Why does not AVX further improve the performance compared with SSE2?

Why does not AVX further improve the performance compared with SSE2? - c++

I am new to the field of SSE2 and AVX. I write the following code to test the performance of both SSE2 and AVX.
#include <cmath>
#include <iostream>
#include <chrono>
#include <emmintrin.h>
#include <immintrin.h>
void normal_res(float* __restrict__ a, float* __restrict__ b, float* __restrict__ c, unsigned long N) {
for (unsigned long n = 0; n < N; n++) {
c[n] = sqrt(a[n]) + sqrt(b[n]);
}
}
void normal(float* a, float* b, float* c, unsigned long N) {
for (unsigned long n = 0; n < N; n++) {
c[n] = sqrt(a[n]) + sqrt(b[n]);
}
}
void sse(float* a, float* b, float* c, unsigned long N) {
__m128* a_ptr = (__m128*)a;
__m128* b_ptr = (__m128*)b;
for (unsigned long n = 0; n < N; n+=4, a_ptr++, b_ptr++) {
__m128 asqrt = _mm_sqrt_ps(*a_ptr);
__m128 bsqrt = _mm_sqrt_ps(*b_ptr);
__m128 add_result = _mm_add_ps(asqrt, bsqrt);
_mm_store_ps(&c[n], add_result);
}
}
void avx(float* a, float* b, float* c, unsigned long N) {
__m256* a_ptr = (__m256*)a;
__m256* b_ptr = (__m256*)b;
for (unsigned long n = 0; n < N; n+=8, a_ptr++, b_ptr++) {
__m256 asqrt = _mm256_sqrt_ps(*a_ptr);
__m256 bsqrt = _mm256_sqrt_ps(*b_ptr);
__m256 add_result = _mm256_add_ps(asqrt, bsqrt);
_mm256_store_ps(&c[n], add_result);
}
}
int main(int argc, char** argv) {
unsigned long N = 1 << 30;
auto *a = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
auto *b = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
auto *c = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
std::chrono::time_point<std::chrono::system_clock> start, end;
for (unsigned long i = 0; i < N; ++i) {
a[i] = 3141592.65358;
b[i] = 1234567.65358;
}
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
normal(a, b, c, N);
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "normal elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
normal_res(a, b, c, N);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
std::cout << "normal restrict elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
sse(a, b, c, N);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
std::cout << "sse elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
avx(a, b, c, N);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
std::cout << "avx elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
return 0;
}
I compile my program by using g++ complier as the following.
g++ -msse -msse2 -mavx -mavx512f -O2
The results are as the following. It seems that there is no further improvement when I use more advanced 256 bits vectors.
normal elapsed time: 10.5311
normal restrict elapsed time: 8.00338
sse elapsed time: 0.995806
avx elapsed time: 0.973302
I have two questions.
Why does not AVX give me further improvement? Is it because the memory bandwidth?
According to my experiment, the SSE2 perform 10 times faster than the naive version. Why is that? I expect the SSE2 can only be 4 times faster based on its 128 bits vectors with respect to single precision floating points. Thanks a lot.

There are several issues here....
Memory bandwidth is very likely to be important for these array sizes -- more notes below.
Throughput for SSE and AVX square root instructions may not be what you expect on your processor -- more notes below.
The first test ("normal") may be slower than expected because the output array is instantiated (i.e., virtual to physical mappings are created) during the timed part of the test. (Just fill c with zeros in the loop that initializes a and b to fix this.)
Memory Bandwidth Notes:
With N = 1<<30 and float variables, each array is 4GiB.
Each test reads two arrays and writes to a third array. This third array must also be read from memory before being overwritten -- this is called a "write allocate" or a "read for ownership".
So you are reading 12 GiB and writing 4 GiB in each test. The SSE and AVX tests therefore correspond to ~16 GB/s of DRAM bandwidth, which is near the high end of the range typically seen for single-threaded operation on recent processors.
Instruction Throughput Notes:
The best reference for instruction latency and throughput on x86 processors is "instruction_tables.pdf" from https://www.agner.org/optimize/
Agner defines "reciprocal throughput" as the average number of cycles per retired instruction when the processor is given a workload of independent instructions of the same type.
As an example, for an Intel Skylake core, the throughput of SSE and AVX SQRT is the same:
SQRTPS (xmm) 1/throughput = 3 --> 1 instruction every 3 cycles
VSQRTPS (ymm) 1/throughput = 6 --> 1 instruction every 6 cycles
Execution time for the square roots is expected to be (1<<31) square roots / 4 square roots per SSE SQRT instruction * 3 cycles per SSE SQRT instruction / 3 GHz = 0.54 seconds (randomly assuming a processor frequency).
Expected throughput for the "normal" and "normal_res" cases depends on the specifics of the generated assembly code.

Scalar being 10x instead of 4x slower:
You're getting page faults in c[] inside the scalar timed region because that's the first time you're writing it. If you did tests in a different order, whichever one was first would pay that large penalty. That part is a duplicate of this mistake: Why is iterating though `std::vector` faster than iterating though `std::array`? See also Idiomatic way of performance evaluation?
normal pays this cost in its first of the 5 passes over the array. Smaller arrays and a larger repeat count would amortize this even more, but better to memset or otherwise fill your destination first to pre-fault it ahead of the timed region.
normal_res is also scalar but is writing into an already-dirtied c[]. Scalar is 8x slower than SSE instead of the expected 4x.
You used sqrt(double) instead of sqrtf(float) or std::sqrt(float). On Skylake-X, this perfectly accounts for an extra factor of 2 throughput. Look at the compiler's asm output on the Godbolt compiler explorer (GCC 7.4 assuming the same system as your last question). I used -mavx512f (which implies -mavx and -msse), and no tuning options, to hopefully get about the same code-gen you did. main doesn't inline normal_res, so we can just look at the stand-alone definition for it.
normal_res(float*, float*, float*, unsigned long):
...
vpxord zmm2, zmm2, zmm2 # uh oh, 512-bit instruction reduces turbo clocks for the next several microseconds. Silly compiler
# more recent gcc would just use `vpxor xmm0,xmm0,xmm0`
...
.L5: # main loop
vxorpd xmm0, xmm0, xmm0
vcvtss2sd xmm0, xmm0, DWORD PTR [rdi+rbx*4] # convert to double
vucomisd xmm2, xmm0
vsqrtsd xmm1, xmm1, xmm0 # scalar double sqrt
ja .L16
.L3:
vxorpd xmm0, xmm0, xmm0
vcvtss2sd xmm0, xmm0, DWORD PTR [rsi+rbx*4]
vucomisd xmm2, xmm0
vsqrtsd xmm3, xmm3, xmm0 # scalar double sqrt
ja .L17
.L4:
vaddsd xmm1, xmm1, xmm3 # scalar double add
vxorps xmm4, xmm4, xmm4
vcvtsd2ss xmm4, xmm4, xmm1 # could have just converted in-place without zeroing another destination to avoid a false dependency :/
vmovss DWORD PTR [rdx+rbx*4], xmm4
add rbx, 1
cmp rcx, rbx
jne .L5
The vpxord zmm only reduces turbo clock for a few milliseconds (I think) at the start of each call to normal and normal_res. It doesn't keep using 512-bit operations so clock speed can jump back up again later. This might partially account for it not being exactly 8x.
The compare / ja is because you didn't use -fno-math-errno so GCC still calls actual sqrt for inputs < 0 to get errno set. It's doing if (!(0 <= tmp)) goto fallback, jumping on 0 > tmp or unordered. "Fortunately" sqrt is slow enough that it's still the only bottleneck. Out-of-order exec of the conversion and compare/branching means the SQRT unit is still kept busy ~100% of the time.
vsqrtsd throughput (6 cycles) is 2x slower than vsqrtss throughput (3 cycles) on Skylake-X, so using double costs a factor of 2 in scalar throughput.
Scalar sqrt on Skylake-X has the same throughput as the corresponding 128-bit ps / pd SIMD version. So 6 cycles per 1 number as a double vs. 3 cycles per 4 floats as a ps vector fully explains the 8x factor.
The extra 8x vs. 10x slowdown for normal was just from page faults.
SSE vs. AVX sqrt throughput
128-bit sqrtps is sufficient to get the full throughput of the SIMD div/sqrt unit; assuming this is a Skylake-server like your last question, it's 256 bits wide but not fully pipelined. The CPU can alternate sending a 128-bit vector into the low or high half to take advantage of the full hardware width even when you're only using 128-bit vectors. See Floating point division vs floating point multiplication (FP div and sqrt run on the same execution unit.)
See also instruction latency/throughput numbers on https://uops.info/, or on https://agner.org/optimize/.
The add/sub/mul/fma are all 512-bits wide and fully pipelined; use that (e.g. to evaluate a 6th order polynomial or something) if you want something that can scale with vector width. div/sqrt is a special case.
You'd expect a benefit from using 256-bit vectors for SQRT only if you had a bottleneck on the front-end (4/clock instruction / uop throughput), or if you were doing a bunch of add/sub/mul/fma work with the vectors as well.
256-bit isn't worse, but it doesn't help when the only computation bottleneck is on the div/sqrt unit's throughput.
See John McCalpin's answer for more details about write-only costing about the same as a read+write, because of RFOs.
With so little computation per memory access, you're probably close to bottlenecking on memory bandwidth again / still. Even if the FP SQRT hardware was wider / faster, you might not in practice have your code run any faster. Instead you'd just have the core spend more time doing nothing while waiting for data to arrive from memory.
It seems you are getting exactly the expected speedup from 128-bit vectors (2x * 4x = 8x), so apparently the __m128 version is not bottlenecked on memory bandwidth either.
2x sqrt per 4 memory accesses is about the same as the a[i] = sqrt(a[i]) (1x sqrt per load + store) you were doing in the code you posted in chat, but you didn't give any numbers for that. That one avoided the page-fault problem because it was rewriting an array in-place after initializing it.
In general rewriting an array in-place is a good idea if you for some reason keep insisting on trying to get a 4x / 8x / 16x SIMD speedup using these insanely huge arrays that won't even fit in L3 cache.
Memory access is pipelined, and overlaps with computation (assuming sequential access so prefetchers can be pulling it in continuously without having to compute the next address): faster computation doesn't speed up overall progress. Cache lines arrive from memory at some fixed maximum bandwidth, with ~12 cache line transfers in flight at once (12 LFBs in Skylake). Or L2 "superqueue" can track more cache lines than that (maybe 16?), so L2 prefetch is reading ahead of where the CPU core is stalled.
As long as your computation can keep up with that rate, making it faster will just leave more cycles of doing nothing before the next cache line arrives.
(The store buffer writing back to L1d and then evicting dirty lines is also happening, but the basic idea of core waiting for memory still works.)
You could think of it like stop-and-go traffic in a car: a gap opens ahead of your car. Closing that gap faster doesn't gain you any average speed, it just means you have to stop faster.
If you want to see the benefit of AVX and AVX512 over SSE, you'll need smaller arrays (and a higher repeat-count). Or you'll need lots of ALU work per vector, like a polynomial.
In many real-world problems, the same data is used repeatedly so caches work. And it's possible to break up your problem into doing multiple things to one block of data while it's hot in cache (or even while loaded in registers), to increase the computational intensity enough to take advantage of the compute vs. memory balance of modern CPUs.

Related

Very low FLOPs/second without any data transfer

I tested the following code on my machine to see how much throughput I can get. The code does not do very much except assigning each thread two nested loop,
#include <chrono>
#include <iostream>
int main() {
auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
for(int thread = 0; thread < 24; thread++) {
float i = 0.0f;
while(i < 100000.0f) {
float j = 0.0f;
while (j < 100000.0f) {
j = j + 1.0f;
}
i = i + 1.0f;
}
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time = end_time - start_time;
std::cout << time / std::chrono::milliseconds(1) << std::endl;
return 0;
}
To my surprise, the throughput is very low according to perf
$ perf stat -e all_dc_accesses -e fp_ret_sse_avx_ops.all cmake-build-release/roofline_prediction
8907
Performance counter stats for 'cmake-build-release/roofline_prediction':
325.372.690 all_dc_accesses
240.002.400.000 fp_ret_sse_avx_ops.all
8,909514307 seconds time elapsed
202,819795000 seconds user
0,059613000 seconds sys
With 240.002.400.000 FLOPs in 8.83 seconds, the machine achieved only 27.1 GFLOPs/second, way below the CPU's capacity of 392 GFLOPs/sec (I got this number from a roofline modelling software).
My question is, how can I achieved higher throughput?
Compiler: GCC 9.3.0
CPU: AMD Threadripper 1920X
Optimization level: -O3
OpenMP's flag: -fopenmp

Compiled with GCC 9.3 with those options, the inner loop looks like this:
.L3:
addss xmm0, xmm2
comiss xmm1, xmm0
ja .L3
Some other combinations of GCC version / options may result in the loop being elided, after all it doesn't really do anything (except waste time).
The addss forms a loop-carried dependency with only itself in it. That is not fast though, on Zen 1 that takes 3 cycles per iteration, so the number of additions per cycle is 1/3. The maximum number of floating point additions per cycle could be attained by having at least 6 independent addps instructions (256bit vaddps may help a bit, but Zen 1 executes such 256bit SIMD instructions with 2 128bit operations internally), to deal with the latency of 3 and the throughput of 2 per cycle (so 6 operations need to be active at any time). That would correspond to 8 additions per cycles, 24 times as much as the current code.
From a C++ program, it may be possible to coax the compiler into generating suitable machine code by:
Using -ffast-math (if possible, which it isn't always)
Using explicit vectorization using _mm_add_ps
Manually unrolling the loop, using (at least 6) independent accumulators

How to shift elements of an array using intel intrinsic

I have an array of size 16 which is aligned to 64 byte boundary which I was trying to shift left by 1 index using intel intrinsics.
int history[16] __attribute__((aligned(64)))
for (std::size_t i = 0; i < 15; i++) {
history[i] = history[i + 1];
}
history[15] = 0;
This is the initial loop on which I want to use 512 bit wide vector instructions. Any way to to it with low latency intrinsics.

You have 2 good options for a single-uop lane-crossing shuffle, that you can use between a 512-bit load and store to shuffle the whole cache line. (vpsrldq would do 4 separate 128-bit right shifts so that's unfortunately not what you want.)
vpermd would need a vector control operand, and zero-masking to "shift" in a zero. So the compiler would need extra instructions to load the control vector, and to kmov a constant into a mask register.
valignd is a 32-bit granularity fully lane-crossing version of SSSE3 / AVX2 vpalignr. But it doesn't have any of that horrible AVX / AVX2 "in lane" behaviour where it does multiple separate 128-bit shuffles so it's actually usable to shift a whole 256 or 512-bit vector left or right by a constant number of dwords. You need either zero-masking or a zeroed vector to shift in zeros from. A zeroed vector is as cheap as a NOP to create on Intel CPUs.
(perf numbers from https://www.uops.info/table.html - valignd is 1 uop for port 5 on Skylake-AVX512, same as vpermd or even vpermt2d which could similarly grab a zero from another register.)
#include <immintrin.h>
alignas(16) int history[16]; // C++ has had portable syntax for alignment since C++11
// assumes aligned pointer input
void shift64_right_4bytes(int *arr) {
__m512i v = _mm512_load_si512(arr); // AVX512 load intrinsics conveniently take void*, not __m512i*
v = _mm512_alignr_epi32( _mm512_setzero_si512(), v, 1 ); // v = (0:v) >> 32bits
_mm512_store_si512(arr, v);
}
Compiles to this asm (Godbolt):
# GCC10.2 -O3 -march=skylake-avx512
shift64_right_4bytes(int*):
vpxor xmm0, xmm0, xmm0
valignd zmm0, zmm0, ZMMWORD PTR [rdi], 1
vmovdqa64 ZMMWORD PTR [rdi], zmm0
vzeroupper
ret
Obviously the vpxor-zeroing and vzeroupper overhead could be hoisted/sunk out of loops after inlining, if you had an outer loop around that loop you showed.
So the real ALU work is just 1 uop for port 5. Of course, if you wrote this array with narrower stores very recently, you could get a store-forwarding stall. Could still be worth it, just extra latency to load, doesn't actually stall the whole pipeline or out-of-order execution of independent work.
If the rest of your code doesn't use 512-bit vectors, you might want to avoid them here (SIMD instructions lowering CPU frequency)
2x 256-bit loads that overlap by one int might be good, then store them. i.e. a 15-byte memmove with the same strategy that glibc's memcpy / memmove uses for small copies. Then store a zero at the end.
// only needs AVX1
// With 64-byte aligned history, no load or store crosses a cache-line boundary
void shift64_right_4bytes_256b(int *history) {
__m256i v0 = _mm256_loadu_si256((const __m256i*)(history+1));
__m256i v1 = _mm256_load_si256((const __m256i*)(history+8));
_mm256_store_si256((__m256i*)history, v0);
_mm256_storeu_si256((__m256i*)(history+7), v1); // overlap by 1 dword
history[15] = 0;
}
Or maybe valignd ymm for the high half, to shift a zero into the vector instead of a separate scalar store. (That would require AVX512VL instead of just AVX1 for this version, but that's fine on AXV512 CPUs.)
Partly depends on how you want to reload it, and whether the surrounding code does a lot of stores. (Back-end pressure on the store execution units and store buffer).
Or if it was originally stored with 2x 256-bit aligned stores, then the unaligned load could hit a store-forwarding stall which you could avoid by using valignd to shift a dword between the high and low halves, as well as to shift a zero into the high half.

why is it faster to print number in binary using arithmetic instead of _bittest

The purpose of the next two code section is to print number in binary.
The first one does this by two instructions (_bittest), while the second does it by pure arithmetic instructions which is three instructions.
the first code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = _bittest(&num, nBit);
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
the inner loop is compile to the instructions bt and setb
The second code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
The inner loop compile to and add(as shift left) and sar.
the second code section run three time faster then the first one.
Why three cpu instructions run faster then two?

Not answer (Bo did), but the second inner loop version can be simplified a bit:
long numCopy = num;
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = numCopy & 1;
numCopy >>= 1;
}
Has subtle difference (1 instruction less) with gcc 7.2 targetting 32b.
(I'm assuming 32b target, as you convert long into 32 bit array, which makes sense only on 32b target ... and I assume x86, as it includes <windows.h>, so it's clearly for obsolete OS target, although I think windows now have even 64b version? (I don't care.))
Answer:
Why three cpu instructions run faster then two?
Because the count of instructions only correlates with performance (usually fewer is better), but the modern x86 CPU is much more complex machine, translating the actual x86 instructions into micro-code before execution, transforming that further by things like out-of-order-execution and register renaming (to break false dependency chains), and then it executes the resulting microcode, with different units of CPU capable to execute only some micro-ops, so in ideal case you may get 2-3 micro-ops executed in parallel by the 2-3 units in single cycle, and in worst case you may be executing an complete micro-code loop implementing some complex x86 instruction taking several cycles to finish, blocking most of the CPU units.
Another factor is availability of data from memory and memory writes, a single cache miss, when the data must be fetched from higher level cache, or even memory itself, creates tens-to-hundreds cycles stall. Having compact data structures favouring predictable access patterns and not exhausting all cache-lines is paramount for exploiting maximum CPU performance.
If you are at stage "why 3 instructions are faster than 2 instructions", you pretty much can start with any x86 optimization article/book, and keep reading for few months or year(s), it's quite complex topic.
You may want to check this answer https://gamedev.stackexchange.com/q/27196 for further reading...

I'm assuming you're using x86-64 MSVC CL19 (or something that makes similar code).
_bittest is slower because MSVC does a horrible job and keeps the value in memory and bt [mem], reg is much slower than bt reg,reg. This is a compiler missed-optimization. It happens even if you make num a local variable instead of a global, even when the initializer is still constant!
I included some perf analysis for Intel Sandybridge-family CPUs because they're common; you didn't say and yes it matters: bt [mem], reg has one per 3 cycle throughput on Ryzen, one per 5 cycle throughput on Haswell. And other perf characteristics differ...
(For just looking at the asm, it's usually a good idea to make a function with args to get code the compiler can't do constant-propagation on. It can't in this case because it doesn't know if anything modifies num before main runs, because it's not static.)
Your instruction-counting didn't include the whole loop so your counts are wrong, but more importantly you didn't consider the different costs of different instructions. (See Agner Fog's instruction tables and optimization manual.)
This is your whole inner loop with the _bittest intrinsic, with uop counts for Haswell / Skylake:
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = _bittest(&num, nBit);
//bits[nBit] = (bool)(num & (1UL << nBit)); // much more efficient
}
Asm output from MSVC CL19 -Ox on the Godbolt compiler explorer
$LL7#main:
bt DWORD PTR num, ebx ; 10 uops (microcoded), one per 5 cycle throughput
lea rcx, QWORD PTR [rcx+1] ; 1 uop
setb al ; 1 uop
inc ebx ; 1 uop
mov BYTE PTR [rcx-1], al ; 1 uop (micro-fused store-address and store-data)
cmp ebx, 31
jb SHORT $LL7#main ; 1 uop (macro-fused with cmp)
That's 15 fused-domain uops, so it can issue (at 4 per clock) in 3.75 cycles. But that's not the bottleneck: Agner Fog's testing found that bt [mem], reg has a throughput of one per 5 clock cycles.
IDK why it's 3x slower than your other loop. Maybe the other ALU instructions compete for the same port as the bt, or the data dependency it's part of causes a problem, or just being a micro-coded instruction is a problem, or maybe the outer loop is less efficient?
Anyway, using bt [mem], reg instead of bt reg, reg is a major missed optimization. This loop would have been faster than your other loop with a 1 uop, 1c latency, 2-per-clock throughput bt r9d, ebx.
The inner loop compile to and add(as shift left) and sar.
Huh? Those are the instructions MSVC associates with the curBit <<= 1; source line (even though that line is fully implemented by the add self,self, and the variable-count arithmetic right shift is part of a different line.)
But the whole loop is this clunky mess:
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
$LL18#main: # MSVC CL19 -Ox
mov ecx, ebx ; 1 uop
lea r8, QWORD PTR [r8+1] ; 1 uop pointer-increment for bits
mov eax, r9d ; 1 uop. r9d holds num
inc ebx ; 1 uop
and eax, edx ; 1 uop
# MSVC says all the rest of these instructions are from curBit <<= 1; but they're obviously not.
add edx, edx ; 1 uop
sar eax, cl ; 3 uops (variable-count shifts suck)
mov BYTE PTR [r8-1], al ; 1 uop (micro-fused)
cmp ebx, 31
jb SHORT $LL18#main ; 1 uop (macro-fused with cmp)
So this is 11 fused-domain uops, and takes 2.75 clock cycles per iteration to issue from the front-end.
I don't see any loop-carried dep chains longer than that front-end bottleneck, so it probably runs about that fast.
Copying ebx to ecx every iteration instead of just using ecx as the loop counter (nBit) is an obvious missed optimization. The shift-count is needed in cl for a variable-count shift (unless you enable BMI2 instructions, if MSVC can even do that.)
There are major missed optimizations here (in the "fast" version), so you should probably write your source differently do hand-hold your compiler into making less bad code. It implements this fairly literally instead of transforming it into something the CPU can do efficiently, or using bt reg,reg / setc
How to do this fast in asm or with intrinsics
Use SSE2 / AVX. Get the right byte (containing the corresponding bit) into each byte element of a vector, and PANDN (to invert your vector) with a mask that has the right bit for that element. PCMPEQB against zero. That gives you 0 / -1. To get ASCII digits, use _mm_sub_epi8(set1('0'), mask) to subtract 0 or -1 (add 0 or 1) to ASCII '0', conditionally turning it into '1'.
The first steps of this (getting a vector of 0/-1 from a bitmask) is How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.
Fastest way to unpack 32 bits to a 32 byte SIMD vector (has a 128b version). Without SSSE3 (pshufb), I think punpcklbw / punpcklwd (and maybe pshufd) is what you need to repeat each byte of num 8 times and make two 16-byte vectors.
is there an inverse instruction to the movemask instruction in intel avx2?.
In scalar code, this is one way that runs at 1 bit->byte per clock. There are probably ways to do better without using SSE2 (storing multiple bytes at once to get around the 1 store per clock bottleneck that exists on all current CPUs), but why bother? Just use SSE2.
mov eax, [num]
lea rdi, [rsp + xxx] ; bits[]
.loop:
shr eax, 1 ; constant-count shift is efficient (1 uop). CF = last bit shifted out
setc [rdi] ; 2 uops, but just as efficient as setc reg / mov [mem], reg
shr eax, 1
setc [rdi+1]
add rdi, 2
cmp end_pointer ; compare against another register instead of a separate counter.
jb .loop
Unrolled by two to avoid bottlenecking on the front-end, so this can run at 1 bit per clock.

The difference is that the code _bittest(&num, nBit); uses a pointer to num, which makes the compiler store it in memory. And the memory access makes the code a lot slower.
bits[nBit] = _bittest(&num, nBit);
00007FF6D25110A0 bt dword ptr [num (07FF6D2513034h)],ebx ; <-----
00007FF6D25110A7 lea rcx,[rcx+1]
00007FF6D25110AB setb al
00007FF6D25110AE inc ebx
00007FF6D25110B0 mov byte ptr [rcx-1],al
The other version stores all the variables in registers, and uses very fast register shifts and adds. No memory accesses.

Using nested vectors vs a flatten vector wrapper, strange behaviour

The problem
For a long time I had the impression that using a nested std::vector<std::vector...> for simulating an N-dimensional array is in general bad, since the memory is not guarantee to be contiguous, and one may have cache misses. I thought it's better to use a flat vector and map from multiple dimensions to 1D and vice versa. So, I decided to test it (code listed at the end). It is pretty straightforward, I timed reading/writing to a nested 3D vector vs my own 3D wrapper of an 1D vector. I compiled the code with both g++ and clang++, with -O3 optimization turned on. For each run I changed the dimensions, so I can get a pretty good idea about the behaviour. To my surprise, these are the results I obtained on my machine MacBook Pro (Retina, 13-inch, Late 2012), 2.5GHz i5, 8GB RAM, OS X 10.10.5:
g++ 5.2
dimensions nested flat
X Y Z (ms) (ms)
100 100 100 -> 16 24
150 150 150 -> 58 98
200 200 200 -> 136 308
250 250 250 -> 264 746
300 300 300 -> 440 1537
clang++ (LLVM 7.0.0)
dimensions nested flat
X Y Z (ms) (ms)
100 100 100 -> 16 18
150 150 150 -> 53 61
200 200 200 -> 135 137
250 250 250 -> 255 271
300 300 300 -> 423 477
As you can see, the "flatten" wrapper is never beating the nested version. Moreover, g++'s libstdc++ implementation performs quite badly compared to libc++ implementation, for example for 300 x 300 x 300 the flatten version is almost 4 times slower than the nested version. libc++ seems to have equal performance.
My questions:
Why isn't the flatten version faster? Shouldn't it be? Am I missing something in the testing code?
Moreover, why does g++'s libstdc++ performs so badly when using flatten vectors? Again, shouldn't it perform better?
The code I used:
#include <chrono>
#include <cstddef>
#include <iostream>
#include <memory>
#include <random>
#include <vector>
// Thin wrapper around flatten vector
template<typename T>
class Array3D
{
std::size_t _X, _Y, _Z;
std::vector<T> _vec;
public:
Array3D(std::size_t X, std::size_t Y, std::size_t Z):
_X(X), _Y(Y), _Z(Z), _vec(_X * _Y * _Z) {}
T& operator()(std::size_t x, std::size_t y, std::size_t z)
{
return _vec[z * (_X * _Y) + y * _X + x];
}
const T& operator()(std::size_t x, std::size_t y, std::size_t z) const
{
return _vec[z * (_X * _Y) + y * _X + x];
}
};
int main(int argc, char** argv)
{
std::random_device rd{};
std::mt19937 rng{rd()};
std::uniform_real_distribution<double> urd(-1, 1);
const std::size_t X = std::stol(argv[1]);
const std::size_t Y = std::stol(argv[2]);
const std::size_t Z = std::stol(argv[3]);
// Standard library nested vector
std::vector<std::vector<std::vector<double>>>
vec3D(X, std::vector<std::vector<double>>(Y, std::vector<double>(Z)));
// 3D wrapper around a 1D flat vector
Array3D<double> vec1D(X, Y, Z);
// TIMING nested vectors
std::cout << "Timing nested vectors...\n";
auto start = std::chrono::steady_clock::now();
volatile double tmp1 = 0;
for (std::size_t x = 0 ; x < X; ++x)
{
for (std::size_t y = 0 ; y < Y; ++y)
{
for (std::size_t z = 0 ; z < Z; ++z)
{
vec3D[x][y][z] = urd(rng);
tmp1 += vec3D[x][y][z];
}
}
}
std::cout << "\tSum: " << tmp1 << std::endl; // we make sure the loops are not optimized out
auto end = std::chrono::steady_clock::now();
std::cout << "Took: ";
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << ms << " milliseconds\n";
// TIMING flatten vector
std::cout << "Timing flatten vector...\n";
start = std::chrono::steady_clock::now();
volatile double tmp2 = 0;
for (std::size_t x = 0 ; x < X; ++x)
{
for (std::size_t y = 0 ; y < Y; ++y)
{
for (std::size_t z = 0 ; z < Z; ++z)
{
vec1D(x, y, z) = urd(rng);
tmp2 += vec1D(x, y, z);
}
}
}
std::cout << "\tSum: " << tmp2 << std::endl; // we make sure the loops are not optimized out
end = std::chrono::steady_clock::now();
std::cout << "Took: ";
ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << ms << " milliseconds\n";
}
EDIT
Changing the Array3D<T>::operator() return to
return _vec[(x * _Y + y) * _Z + z];
as per #1201ProgramAlarm's suggestion does indeed get rid of the "weird" behaviour of g++, in the sense that the flat and nested versions take now roughly the same time. However it's still intriguing. I thought the nested one will be much worse due to cache issues. May I just be lucky and have all the memory contiguously allocated?

Why the nested vectors are about the same speed as flat in your microbenchmark, after fixing the indexing order: You'd expect the flat array to be faster (see Tobias's answer about potential locality problems, and my other answer for why nested vectors suck in general, but not too badly for sequential access). But your specific test is doing so many things that let out-of-order execution hide the overhead of using nested vectors, and/or that just slow things down so much that the extra overhead is lost in measurement noise.
I put your performance-bugfixed source code up on Godbolt so we can look at the asm of the inner loop as compiled by g++5.2, with -O3. (Apple's fork of clang might be similar to clang3.7, but I'll just look at the gcc version.) There's a lot of code from C++ functions, but you can right-click on a source line to scroll the asm windows to the code for that line. Also, mouseover a source line to bold the asm that implements that line, or vice versa.
gcc's inner two loops for the nested version are as follows (with some comments added by hand):
## outer-most loop not shown
.L213: ## middle loop (over `y`)
test rbp, rbp # Z
je .L127 # inner loop runs zero times if Z==0
mov rax, QWORD PTR [rsp+80] # MEM[(struct vector * *)&vec3D], MEM[(struct vector * *)&vec3D]
xor r15d, r15d # z = 0
mov rax, QWORD PTR [rax+r12] # MEM[(struct vector * *)_195], MEM[(struct vector * *)_195]
mov rdx, QWORD PTR [rax+rbx] # D.103857, MEM[(double * *)_38]
## Top of inner-most loop.
.L128:
lea rdi, [rsp+5328] # tmp511, ## function arg: pointer to the RNG object, which is a local on the stack.
lea r14, [rdx+r15*8] # D.103851, ## r14 = &(vec3D[x][y][z])
call double std::generate_canonical<double, 53ul, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul> >(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&) #
addsd xmm0, xmm0 # D.103853, D.103853 ## return val *= 2.0: [0.0, 2.0]
mov rdx, QWORD PTR [rsp+80] # MEM[(struct vector * *)&vec3D], MEM[(struct vector * *)&vec3D] ## redo the pointer-chasing from vec3D.data()
mov rdx, QWORD PTR [rdx+r12] # MEM[(struct vector * *)_150], MEM[(struct vector * *)_150]
subsd xmm0, QWORD PTR .LC6[rip] # D.103859, ## and subtract 1.0: [-1.0, 1.0]
mov rdx, QWORD PTR [rdx+rbx] # D.103857, MEM[(double * *)_27]
movsd QWORD PTR [r14], xmm0 # *_155, D.103859 # store into vec3D[x][y][z]
movsd xmm0, QWORD PTR [rsp+64] # D.103853, tmp1 # reload volatile tmp1
addsd xmm0, QWORD PTR [rdx+r15*8] # D.103853, *_62 # add the value just stored into the array (r14 = rdx+r15*8 because nothing else modifies the pointers in the outer vectors)
add r15, 1 # z,
cmp rbp, r15 # Z, z
movsd QWORD PTR [rsp+64], xmm0 # tmp1, D.103853 # spill tmp1
jne .L128 #,
#End of inner-most loop
.L127: ## middle-loop
add r13, 1 # y,
add rbx, 24 # sizeof(std::vector<> == 24) == the size of 3 pointers.
cmp QWORD PTR [rsp+8], r13 # %sfp, y
jne .L213 #,
## outer loop not shown.
And for the flat loop:
## outer not shown.
.L214:
test rbp, rbp # Z
je .L135 #,
mov rax, QWORD PTR [rsp+280] # D.103849, vec1D._Y
mov rdi, QWORD PTR [rsp+288] # D.103849, vec1D._Z
xor r15d, r15d # z
mov rsi, QWORD PTR [rsp+296] # D.103857, MEM[(double * *)&vec1D + 24B]
.L136: ## inner-most loop
imul rax, r12 # D.103849, x
lea rax, [rax+rbx] # D.103849,
imul rax, rdi # D.103849, D.103849
lea rdi, [rsp+5328] # tmp520,
add rax, r15 # D.103849, z
lea r14, [rsi+rax*8] # D.103851, # &vec1D(x,y,z)
call double std::generate_canonical<double, 53ul, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul> >(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&) #
mov rax, QWORD PTR [rsp+280] # D.103849, vec1D._Y
addsd xmm0, xmm0 # D.103853, D.103853
mov rdi, QWORD PTR [rsp+288] # D.103849, vec1D._Z
mov rsi, QWORD PTR [rsp+296] # D.103857, MEM[(double * *)&vec1D + 24B]
mov rdx, rax # D.103849, D.103849
imul rdx, r12 # D.103849, x # redo address calculation a 2nd time per iteration
subsd xmm0, QWORD PTR .LC6[rip] # D.103859,
add rdx, rbx # D.103849, y
imul rdx, rdi # D.103849, D.103849
movsd QWORD PTR [r14], xmm0 # MEM[(double &)_181], D.103859 # store into the address calculated earlier
movsd xmm0, QWORD PTR [rsp+72] # D.103853, tmp2
add rdx, r15 # tmp374, z
add r15, 1 # z,
addsd xmm0, QWORD PTR [rsi+rdx*8] # D.103853, MEM[(double &)_170] # tmp2 += vec1D(x,y,z). rsi+rdx*8 == r14, so this is a reload of the store this iteration.
cmp rbp, r15 # Z, z
movsd QWORD PTR [rsp+72], xmm0 # tmp2, D.103853
jne .L136 #,
.L135: ## middle loop: increment y
add rbx, 1 # y,
cmp r13, rbx # Y, y
jne .L214 #,
## outer loop not shown.
Your MacBook Pro (Late 2012) has an Intel IvyBridge CPU, so I'm using numbers for that microarchitecture from Agner Fog's instruction tables and microarch guide. Things should be mostly the same on other Intel/AMD CPUs.
The only 2.5GHz mobile IvB i5 is the i5-3210M, so your CPU has 3MiB of L3 cache. This means even your smallest test case (100^3 * 8B per double ~= 7.63MiB) is larger than your last-level cache, so none of your test cases fit in cache at all. That's probably a good thing, because you allocate and default-initialize both nested and flat before testing either of them. However, you do test in the same order you allocate, so if the nested array is still cache after zeroing the flat array, the flat array may still be hot in L3 cache after the timing loop over the nested array.
If you'd used a repeat-loop to loop over the same array multiple times, you could have got times large enough to measure for smaller array sizes.
You're doing several things here that are super-weird and make this so slow that out-of-order execution can hide the extra latency of changing y, even if your inner z vectors are not perfectly contiguous.
You run a slow PRNG inside the timed loop. std::uniform_real_distribution<double> urd(-1, 1); is extra overhead on top of std::mt19937 rng{rd()};, which is already slow compared to FP-add latency (3 cycles), or compared to the L1D cache load throughput of 2 per cycle. All this extra time running the PRNG gives out-of-order execution a chance to run the array-indexing instructions so the final address is ready by the time the data is. Unless you have a lot of cache misses, you're mostly just measuring PRNG speed, because it produces results much slower than 1 per clock cycle.
g++5.2 doesn't fully inline the urd(rng) code, and the x86-64 System V calling convention has no call-preserved XMM registers. So tmp1/tmp2 have to be spilled/reloaded for every element, even if they weren't volatile.
It also loses its place in the Z vector, and has to redo the outer 2 levels of indirection before accessing the next z element. This is because it doesn't know about the internals of the function its calling, and assumes that it might have a pointer to the outer vector<>'s memory. (The flat version does two multiplies for indexing in the inner loop, instead of a simple pointer-add.)
clang (with libc++) does fully inline the PRNG, so moving to the next z is just add reg, 8 to increment a pointer in both the flat and nested versions. You could get the same behaviour from gcc by getting an iterator outside the inner loop, or getting a reference to the inner vector, instead of redoing operator[] and hoping the compiler will hoist it for you.
Intel/AMD FP add/sub/mul throughput/latency is not data-dependent, except for denormals. (x87 also slows down for NaN and maybe infinity, but SSE doesn't. 64-bit code uses SSE even for scalar float/double.) So you could have just initialized your array with zeros, or with a PRNG outisde the timing loop. (Or left them zeroed, since the vector<double> constructor does that for you, and it actually takes extra code to get it not to in cases where you're going to write something else.) Division and sqrt performance is data-dependent on some CPUs, and much slower than add/sub/mul.
You write every element right before you read it, inside the inner loop. In the source, this looks like a store/reload. That's what gcc actually does, unfortunately, but clang with libc++ (which inlines the PRNG) transforms the loop body:
// original
vec3D[x][y][z] = urd(rng);
tmp1 += vec3D[x][y][z];
// what clang's asm really does
double xmm7 = urd(rng);
vec3D[x][y][z] = xmm7;
tmp1 += xmm7;
In clang's asm:
# do { ...
addsd xmm7, xmm4 # last instruction of the PRNG
movsd qword ptr [r8], xmm7 # store it into the Z vector
addsd xmm7, qword ptr [rsp + 88]
add r8, 8 # pointer-increment to walk along the Z vector
dec r13 # i--
movsd qword ptr [rsp + 88], xmm7
jne .LBB0_74 # }while(i != 0);
It's allowed to do this because vec3D isn't volatile or atomic<>, so it would be undefined behaviour for any other thread to be writing this memory at the same time. This means it can optimize away store/reloads to objects in memory into just a store (and simply use the value it stored, without reloading). Or optimize the store away entirely if it can prove it's a dead store (a store that nothing can ever read, e.g. to an unused static variable).
In gcc's version, it does the indexing for the store before the PRNG call, and the indexing for the reload after. So I think gcc isn't sure that the function call doesn't modify a pointer, because pointers to the outer vectors have escaped the function. (And the PRNG doesn't inline).
However, even an actual store/reload in the asm is still less sensitive to cache-misses than a simple load!
Store->load forwarding still works even if the store misses in cache. So a cache-miss in a Z vector doesn't directly delay the critical path. It only slows you down if out-of-order execution can't hide the latency of the cache miss. (A store can retire as soon as the data is written to the store-buffer (and all previous instructions have retired). I'm not sure if a load can retire before the cache-line even makes it to L1D, if it got its data from store-forwarding. It may be able to, because x86 does allow StoreLoad reordering (stores are allowed to become globally visible after loads). In that case, a store/reload only adds 6 cycles of latency for the PRNG result (off the critical path from one PRNG state to next PRNG state). And it's only throughput a bottleneck if it cache-misses so much that the store-buffer fills up and prevents new store uops from executing, which in turn eventually prevents new uops from issuing into the out-of-order core when the Reservation Station or ROB fills up with un-executed or un-retired (respectively) uops.
With reversed indexing (original version of the flat code), probably the main bottleneck was the scattered stores. IDK why clang did so much better than gcc there. Maybe clang managed to invert a loop and traverse memory in sequential order after all. (Since it fully inlined the PRNG, there were no function calls that would require the state of memory to match program-order.)
Traversing each Z vector in-order means that cache misses are relatively far-between (even if each Z vector is not contiguous with the previous one), giving lots of time for for the stores to execute. Or even if a store-forwarded load can't actually retire until the L1D cache actually owns the cache line (in Modified state of the MESI protocol), speculative execution has the correct data and didn't have to wait for the latency of the cache-miss. The out-of-order instruction window is probably big enough to keep the critical path probably from stalling before the load can retire. (Cache miss loads are normally really bad, because dependent instructions can't be executed without data for them to operate on. So they much more easily create bubbles in the pipeline. With a full cache-miss from DRAM having a latency of over 300 cycles, and the out-of-order window being 168 uops on IvB, it can't hide all of the latency for code executing at even 1 uop (approximately 1 instruction) per clock.) For pure stores, the out-of-order window extends beyond the ROB size, because they don't need to commit to L1D to retire. In fact, they can't commit until after they retire, because that's the point at which they're known to be non-speculative. (So making them globally-visible earlier than that would prevent roll-back on detection of an exception or mis-speculation.)
I don't have libc++ installed on my desktop, so I can't benchmark that version against g++. With g++5.4, I'm finding Nested: 225 milliseconds and Flat: 239 milliseconds. I suspect that the extra array-indexing multiplies are a problem, and compete with ALU instructions the PRNG uses. In contrast, the nested version redoing a bunch of pointer-chasing that hits in L1D cache can happen in parallel. My desktop is a Skylake i7-6700k at 4.4GHz. SKL has a ROB (ReOrder Buffer) size of 224 uops, and RS of 97 uops, so the out-of-order window is very large. It also has FP-add latency of 4 cycles (unlike previous uarches where it was 3).
volatile double tmp1 = 0; Your accumulator is volatile, which forces the compiler to store/reload it every iteration of the inner loop. The total latency of the loop-carried dependency chain in the inner loop is 9 cycles: 3 for addsd, and 6 for store-forwarding from movsd the store to the movsd reload. (clang folds the reload into a memory operand with addsd xmm7, qword ptr [rsp + 88], but same difference. ([rsp+88] is on the stack, where variables with automatic storage are stored, if they need to be spilled from registers.)
As noted above, the non-inline function call for gcc will also force a spill/reload in the x86-64 System V calling convention (used by everything but Windows). But a smart compiler could have done 4 PRNG calls, for example, and then 4 array stores. (If you'd used an iterator to make sure gcc knew the vectors holding other vectors weren't changing.)
Using -ffast-math would have let the compiler auto-vectorize (if not for the PRNG and volatile). This would let you run through the arrays fast enough that lack of locality between different Z vectors could be a real problem. It would also let compilers unroll with multiple accumulators, to hide the FP-add latency. e.g. they could (and clang would) make asm equivalent to:
float t0=0, t1=0, t2=0, t3=0;
for () {
t0 += a[i + 0];
t1 += a[i + 1];
t2 += a[i + 2];
t3 += a[i + 3];
}
t0 = (t0 + t1) + (t2 + t3);
That has 4 separate dependency chains, so it can keep 4 FP adds in flight. Since IvB has 3 cycle latency, one per clock throughput for addsd, we only need to keep 4 in flight to saturate its throughput. (Skylake has 4c latency, 2 per clock throughput, same as mul or FMA, so you need 8 accumulators to avoid latency bottlenecks. Actually, even more is better. As testing by the asker of that question showed, Haswell did better with even more accumulators when coming close to maxing out load throughput.)
Something like that would be a much better test of how efficient it is to loop over an Array3D. If you want to stop the loop from being optimized away entirely, just use the result. Test your microbenchmark to make sure that increasing the problem size scales the time; if not then something got optimized away, or you're not testing what you think you're testing. Don't make the inner-loop temporary volatile!!
Writing microbenchmarks is not easy. You have to understand enough to write one that tests what you think you're testing. :P This is a good example of how easy it is to go wrong.
May I just be lucky and have all the memory contiguously allocated?
Yes, that probably happens for many small allocations done in-order, when you haven't allocated and freed anything before doing this. If they were large enough (usually one 4kiB page or larger), glibc malloc would switch to using mmap(MAP_ANONYMOUS) and then the kernel would choose randomized virtual addresses (ASLR). So with larger Z, you might expect locality to get worse. But on the other hand, larger Z vectors mean you spend more of your time looping over one contiguous vector so a cache miss when changing y (and x) becomes relatively less important.
Looping sequentially over your data with your apparently doesn't expose this, because the extra pointer accesses hit in cache, so the pointer-chasing has low enough latency for OOO execution to hide it with your slow loop.
Prefetch has a really easy time keeping up here.
Different compilers / library can make a big difference with this weird test. On my system (Arch Linux, i7-6700k Skylake with 4.4GHz max turbo), the best of 4 runs at 300 300 300 for g++5.4 -O3 was:
Timing nested vectors...
Sum: 579.78
Took: 225 milliseconds
Timing flatten vector...
Sum: 579.78
Took: 239 milliseconds
Performance counter stats for './array3D-gcc54 300 300 300':
532.066374 task-clock (msec) # 1.000 CPUs utilized
2 context-switches # 0.004 K/sec
0 cpu-migrations # 0.000 K/sec
54,523 page-faults # 0.102 M/sec
2,330,334,633 cycles # 4.380 GHz
7,162,855,480 instructions # 3.07 insn per cycle
632,509,527 branches # 1188.779 M/sec
756,486 branch-misses # 0.12% of all branches
0.532233632 seconds time elapsed
vs. g++7.1 -O3 (which apparently decided to branch on something that g++5.4 didn't)
Timing nested vectors...
Sum: 932.159
Took: 363 milliseconds
Timing flatten vector...
Sum: 932.159
Took: 378 milliseconds
Performance counter stats for './array3D-gcc71 300 300 300':
810.911200 task-clock (msec) # 1.000 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
54,523 page-faults # 0.067 M/sec
3,546,467,563 cycles # 4.373 GHz
7,107,511,057 instructions # 2.00 insn per cycle
794,124,850 branches # 979.299 M/sec
55,074,134 branch-misses # 6.94% of all branches
0.811067686 seconds time elapsed
vs. clang4.0 -O3 (with gcc's libstdc++, not libc++)
perf stat ./array3D-clang40-libstdc++ 300 300 300
Timing nested vectors...
Sum: -349.786
Took: 1657 milliseconds
Timing flatten vector...
Sum: -349.786
Took: 1631 milliseconds
Performance counter stats for './array3D-clang40-libstdc++ 300 300 300':
3358.297093 task-clock (msec) # 1.000 CPUs utilized
9 context-switches # 0.003 K/sec
0 cpu-migrations # 0.000 K/sec
54,521 page-faults # 0.016 M/sec
14,679,919,916 cycles # 4.371 GHz
12,917,363,173 instructions # 0.88 insn per cycle
1,658,618,144 branches # 493.887 M/sec
916,195 branch-misses # 0.06% of all branches
3.358518335 seconds time elapsed
I didn't dig into what clang did wrong, or try with -ffast-math and/or -march=native. (Those won't do much unless you remove volatile, though.)
perf stat -d doesn't show more cache misses (L1 or last-level) for clang than gcc. But it does show clang is doing more than twice as many L1D loads.
I did try with a non-square array. It's almost exactly the same time keeping total element count the same but changing the final dimension to 5 or 6.
Even a minor change to the C helps, and makes "flatten" faster than nested with gcc (from 240ms down to 220ms for 300^3, but barely making any difference for nested.):
// vec1D(x, y, z) = urd(rng);
double res = urd(rng);
vec1D(x, y, z) = res; // indexing calculation only done once, after the function call
tmp2 += vec1D(x, y, z);
// using iterators would still avoid redoing it at all.

It is because of how you're ordering your indexes in the 3D class. Since your innermost loop is changing z, that's the largest part of your index so you get a lot of cache misses. Rearrange your indexing to
_vec[(x * _Y + y) * _Z + z]
and you should see better performance.

Reading over the other answers I'm not really satisfied with the accuracy and level of detail of the answers, so I'll attempt an explanation myself:
The man issue here is not indirection but a question of spatial locality:
There are basically two things that make caching especially effective:
Temporal locality, which means that a memory word that has been accessed recently will likely be accessed again in the near future. This might for example happen at nodes near the root of a binary search tree that is accessed frequently.
Spatial locality, which means that if a memory word has been accessed, it is likely that the memory words before or after this word will soon be accessed as well. This happens in our case, for nested and flattened arrays.
To assess the impact that indirection and cache effects might have on this problem, let's just assume that we have X = Y = Z = 1024
Judging from this question, a single cache line (L1, L2 or L3) is 64 bytes long, that means 8 double values. Let's assume the L1 cache has 32 kB (4096 doubles), the L2 cache has 256 kB (32k doubles) and the L3 cache has 8 MB (1M doubles).
This means that - assuming the cache is filled with no other data (which is a bold guess, I know) - in the flattened case only every 4th value of y leads to a L1 cache miss (the L2 cache latency is probably around 10-20 cycles), only every 32nd value of y leads to a L2 cache miss (the L3 cache latency is some value lower than 100 cycles) and only in case of a L3 cache miss we actually have to access main memory. I don't want to open up the whole calculation here, since taking the whole cache hierarchy into account makes it a bit more difficult, but let's just say that almost all accesses to memory can be cached in the flattened case.
In the original formulation of this question, the flattened index was computed differently (z * (_X * _Y) + y * _X + x), an increase of the value that changes in the innermost loop (z) always means a jump of _X * _Y * 64 bit, thus leading to a much more non-local memory layout, which increased cache faults by a large amount.
In the nested case, the answer depends a lot on the value of Z:
If Z is rather large, most of the memory accesses are contiguous, since they refer to the entries of a single vector in the nested
vector<vector<vector>>>, which are laid out contiguously. Only when the y or x value is increased we need to actually use indirection to retrieve the start pointer of the next innermost vector.
If Z is rather small, we need to change our 'base pointer' memory access quite often, which might lead to cache-misses if the storage areas of the innermost vectors are placed rather randomly in memory. However, if they are more or less contiguous, we might observe minimal to no differences in cache performance.
Since there was a question about the assembly output, let me give a short overview:
If you compare the assembly output of the nested and flattened array, you notice a lot of similarities: There are three equivalent nested loops and the counting variables x, y and z are stored in registers. The only real difference - aside from the fact that the nested version uses two counters for every outer index to avoid multiplying by 24 on every address computation, and the flattened version does the same for the innermost loop and multiplying by 8 - can be found in the y loop where instead of just incrementing y and computing the flattened index, we need to do three interdependent memory loads to determine the base pointer for our inner loop:
mov rax, QWORD PTR [rdi]
mov rax, QWORD PTR [rax+r11]
mov rax, QWORD PTR [rax+r10]
But since these only happen every Zth time and the pointers for the 'middle vector' are most likely cached, the time difference is negligible.

(This doesn't really answer the question. I think I read it backwards initially, assuming that the OP had just found what I was expecting, that nested vectors are slower than flat.)
You should expect the nested-vector version to be slower for anything other than sequential access. After fixing the row/column major indexing order for your flat version, it should be faster for many uses, especially since it's easier for a compiler to auto-vectorize with SIMD over a big flat array than over many short std::vector<>.
A cache line is only 64B. That's 8 doubles. Locality on a page level matters because of limited TLB entries, and prefetching requires sequential accesses, but you'll get that anyway (close enough) with nested vectors that are allocated all at once with most malloc implementations. (This is a trivial microbenchmark that doesn't do anything before allocating its vectors. In a real program that allocates and frees some memory before making a lot of small allocations, some of them might be scattered around more.)
Besides locality, the extra levels of indirection are potentially problematic.
A reference / pointer to a std::vector just points to the the fixed-size block that holds the current size, allocated space, and the pointer to the buffer. IDK if any implementations place the buffer right after the control data as part of the same malloced block, but probably that's impossible because sizeof(std::vector<int>) has to be constant so you can have a vector of vectors. Check out the asm on godbolt: A function that just returns v[10] takes one load with an array arg, but two loads with a std::vector arg.
In the nested-vector implementation, loading v[x][y][z] requires 4 steps (assuming a pointer or reference to v is already in a register).
load v.buffer_pointer or v.bp or whatever the implementation calls it. (A pointer to an array of std::vector<std::vector<double>>)
load v.bp[x].bp (A pointer to an array of std::vector<double>)
load v.bp[x].bp[y].bp (A pointer to an array of double)
load v.bp[x].bp[y].bp[z] (The double we want)
A proper 3D array, simulated with a single std::vector, just does:
load v.bp (A pointer to an array of double)
load v.bp[(x*w + y)*h + z] (The double we want)
Multiple accesses to the same simulated 3D array with different x and y require computing a new index, but v.bp will stay in a register. So instead of 3 cache misses, we only get one.
Traversing the 3D array in order hides the penalty of the nested-vector implementation, because there's a loop over all the values in the inner-most vector hiding the overhead of changing x and y. Prefetch of the contiguous pointers in the outer vectors helps here, and Z is small enough in your testing that looping over one inner-most vector won't evict the pointer for the next y value.
What Every Programmer Should Know About Memory is getting somewhat outdated, but it covers the details of caching and locality. Software-prefetching is not nearly as important as it was on P4, so don't pay too much attention to that part of the guide.

May I just be lucky and have all the memory contiguously allocated?
Probably yes. I've modified your sample a little bit, so we have a benchmark which concentrates more on the differences between the two approaches:
array fill is done in a separate pass, so random generator speed is excluded
removed volatile
fixed a little bug (tmp1 was printed instead of tmp2)
added a #if 1 part, which can be used to randomize vec3D placement in memory. As it turned out, it has a huge difference on my machine.
Without randomization (I've used parameters: 300 300 300):
Timing nested vectors...
Sum: -131835
Took: 2122 milliseconds
Timing flatten vector...
Sum: -131835
Took: 2085 milliseconds
So there is a little difference, flatten version is a little bit faster. (I've run several tests, and put the minimal time here).
With randomization:
Timing nested vectors...
Sum: -117685
Took: 3014 milliseconds
Timing flatten vector...
Sum: -117685
Took: 2085 milliseconds
So the cache effect can be seen here: nested version is ~50% slower. This is because CPU cannot predict which memory area will be used, so its cache prefetcher is not efficient.
Here's the modified code:
#include <chrono>
#include <cstddef>
#include <iostream>
#include <memory>
#include <random>
#include <vector>
template<typename T>
class Array3D
{
std::size_t _X, _Y, _Z;
std::vector<T> _vec;
public:
Array3D(std::size_t X, std::size_t Y, std::size_t Z):
_X(X), _Y(Y), _Z(Z), _vec(_X * _Y * _Z) {}
T& operator()(std::size_t x, std::size_t y, std::size_t z)
{
return _vec[(x * _Y + y) * _Z + z];
}
const T& operator()(std::size_t x, std::size_t y, std::size_t z) const
{
return _vec[(x * _Y + y) * _Z + z];
}
};
double nested(std::vector<std::vector<std::vector<double>>> &vec3D, std::size_t X, std::size_t Y, std::size_t Z) {
double tmp1 = 0;
for (int iter=0; iter<100; iter++)
for (std::size_t x = 0 ; x < X; ++x)
{
for (std::size_t y = 0 ; y < Y; ++y)
{
for (std::size_t z = 0 ; z < Z; ++z)
{
tmp1 += vec3D[x][y][z];
}
}
}
return tmp1;
}
double flatten(Array3D<double> &vec1D, std::size_t X, std::size_t Y, std::size_t Z) {
double tmp2 = 0;
for (int iter=0; iter<100; iter++)
for (std::size_t x = 0 ; x < X; ++x)
{
for (std::size_t y = 0 ; y < Y; ++y)
{
for (std::size_t z = 0 ; z < Z; ++z)
{
tmp2 += vec1D(x, y, z);
}
}
}
return tmp2;
}
int main(int argc, char** argv)
{
std::random_device rd{};
std::mt19937 rng{rd()};
std::uniform_real_distribution<double> urd(-1, 1);
const std::size_t X = std::stol(argv[1]);
const std::size_t Y = std::stol(argv[2]);
const std::size_t Z = std::stol(argv[3]);
std::vector<std::vector<std::vector<double>>>
vec3D(X, std::vector<std::vector<double>>(Y, std::vector<double>(Z)));
#if 1
for (std::size_t i = 0 ; i < X*Y; i++)
{
std::size_t xa = rand()%X;
std::size_t ya = rand()%Y;
std::size_t xb = rand()%X;
std::size_t yb = rand()%Y;
std::swap(vec3D[xa][ya], vec3D[xb][yb]);
}
#endif
// 3D wrapper around a 1D flat vector
Array3D<double> vec1D(X, Y, Z);
for (std::size_t x = 0 ; x < X; ++x)
{
for (std::size_t y = 0 ; y < Y; ++y)
{
for (std::size_t z = 0 ; z < Z; ++z)
{
vec3D[x][y][z] = vec1D(x, y, z) = urd(rng);
}
}
}
std::cout << "Timing nested vectors...\n";
auto start = std::chrono::steady_clock::now();
double tmp1 = nested(vec3D, X, Y, Z);
auto end = std::chrono::steady_clock::now();
std::cout << "\tSum: " << tmp1 << std::endl; // we make sure the loops are not optimized out
std::cout << "Took: ";
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << ms << " milliseconds\n";
std::cout << "Timing flatten vector...\n";
start = std::chrono::steady_clock::now();
double tmp2 = flatten(vec1D, X, Y, Z);
end = std::chrono::steady_clock::now();
std::cout << "\tSum: " << tmp2 << std::endl; // we make sure the loops are not optimized out
std::cout << "Took: ";
ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << ms << " milliseconds\n";
}

SIMD XOR operation is not as effective as Integer XOR?

I have a task to calculate xor-sum of bytes in an array:
X = char1 XOR char2 XOR char3 ... charN;
I'm trying to parallelize it, xoring __m128 instead. This should give speed up factor 4.
Also, to recheck the algorithm I use int. This should give speed up factor 4.
The test program is 100 lines long, I can't make it shorter, but it is simple:
#include "xmmintrin.h" // simulation of the SSE instruction
#include <ctime>
#include <iostream>
using namespace std;
#include <stdlib.h> // rand
const int NIter = 100;
const int N = 40000000; // matrix size. Has to be dividable by 4.
unsigned char str[N] __attribute__ ((aligned(16)));
template< typename T >
T Sum(const T* data, const int N)
{
T sum = 0;
for ( int i = 0; i < N; ++i )
sum = sum ^ data[i];
return sum;
}
template<>
__m128 Sum(const __m128* data, const int N)
{
__m128 sum = _mm_set_ps1(0);
for ( int i = 0; i < N; ++i )
sum = _mm_xor_ps(sum,data[i]);
return sum;
}
int main() {
// fill string by random values
for( int i = 0; i < N; i++ ) {
str[i] = 256 * ( double(rand()) / RAND_MAX ); // put a random value, from 0 to 255
}
/// -- CALCULATE --
/// SCALAR
unsigned char sumS = 0;
std::clock_t c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ )
sumS = Sum<unsigned char>( str, N );
double tScal = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;
/// SIMD
unsigned char sumV = 0;
const int m128CharLen = 4*4;
const int NV = N/m128CharLen;
c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ ) {
__m128 sumVV = _mm_set_ps1(0);
sumVV = Sum<__m128>( reinterpret_cast<__m128*>(str), NV );
unsigned char *sumVS = reinterpret_cast<unsigned char*>(&sumVV);
sumV = sumVS[0];
for ( int iE = 1; iE < m128CharLen; ++iE )
sumV ^= sumVS[iE];
}
double tSIMD = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;
/// SCALAR INTEGER
unsigned char sumI = 0;
const int intCharLen = 4;
const int NI = N/intCharLen;
c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ ) {
int sumII = Sum<int>( reinterpret_cast<int*>(str), NI );
unsigned char *sumIS = reinterpret_cast<unsigned char*>(&sumII);
sumI = sumIS[0];
for ( int iE = 1; iE < intCharLen; ++iE )
sumI ^= sumIS[iE];
}
double tINT = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;
/// -- OUTPUT --
cout << "Time scalar: " << tScal << " ms " << endl;
cout << "Time INT: " << tINT << " ms, speed up " << tScal/tINT << endl;
cout << "Time SIMD: " << tSIMD << " ms, speed up " << tScal/tSIMD << endl;
if(sumV == sumS && sumI == sumS )
std::cout << "Results are the same." << std::endl;
else
std::cout << "ERROR! Results are not the same." << std::endl;
return 1;
}
The typical results:
[10:46:20]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3540 ms
Time INT: 890 ms, speed up 3.97753
Time SIMD: 280 ms, speed up 12.6429
Results are the same.
[10:46:27]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3540 ms
Time INT: 890 ms, speed up 3.97753
Time SIMD: 280 ms, speed up 12.6429
Results are the same.
[10:46:35]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 880 ms, speed up 4.13636
Time SIMD: 290 ms, speed up 12.5517
Results are the same.
As you see, int version works ideally, but simd version loses 25% of the speed and this is stable. I tried to change the array sizes, this doesn't help.
Also, if I switch to -O2 I lose 75% of the speed in simd version:
[10:50:25]$ g++ test.cpp -O2 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 880 ms, speed up 4.13636
Time SIMD: 890 ms, speed up 4.08989
Results are the same.
[10:51:16]$ g++ test.cpp -O2 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 900 ms, speed up 4.04444
Time SIMD: 880 ms, speed up 4.13636
Results are the same.
Can someone explain me this?
Additional info:
I have g++ (GCC) 4.7.3; Intel(R) Xeon(R) CPU E7-4860
I use -fno-tree-vectorize to prevent auto vectorization. Without this flag with -O3 the
expected speed up is 1, since the task is simple. This is what I get:
[10:55:40]$ g++ test.cpp -O3; ./a.out
Time scalar: 270 ms
Time INT: 270 ms, speed up 1
Time SIMD: 280 ms, speed up 0.964286
Results are the same.
but with -O2 result is still strange:
[10:55:02]$ g++ test.cpp -O2; ./a.out
Time scalar: 3540 ms
Time INT: 990 ms, speed up 3.57576
Time SIMD: 880 ms, speed up 4.02273
Results are the same.
When I change
for ( int i = 0; i < N; i+=1 )
sum = sum ^ data[i];
to equivalent of:
for ( int i = 0; i < N; i+=8 )
sum = (data[i] ^ data[i+1]) ^ (data[i+2] ^ data[i+3]) ^ (data[i+4] ^ data[i+5]) ^ (data[i+6] ^ data[i+7]) ^ sum;
i do see improvment in scalar speed by factor of 2. But I don't see improvements in speed up. Before: intSpeedUp 3.98416, SIMDSpeedUP 12.5283. After: intSpeedUp 3.5572, SIMDSpeedUP 6.8523.

I think you may be bumping into the upper limits of memory bandwidth. This might be the reason for the 12.6x speedup instead of 16x speedup in the -O3 case.
However, gcc 4.7.3 puts a useless store instruction into the tiny not-unrolled vector loop when inlining, but not in the scalar or int SWAR loops (see below), so that might be the explanation instead.
The -O2 reduction in vector throughput is all due to gcc 4.7.3 doing an even worse job there and sending the accumulator on a round trip to memory (store-forwarding).
For analysis of the implications of that extra store instruction, see the section at the end.
TL;DR: Nehalem likes a bit more loop unrolling than SnB-family requires, and gcc has made major improvements in SSE code-generation in gcc5.
And typically use _mm_xor_si128, not _mm_xor_ps for bulk xor work like this.
Memory bandwidth.
N is huge (40MB), so memory/cache bandwidth is a concern. A Xeon E7-4860 is a 32nm Nehalem microarchitecture, with 256kiB of L2 cache (per core), and 24MiB of shared L3 cache. It has a quad-channel memory controller supporting up to DDR3-1066 (compared to dual-channel DDR3-1333 or DDR3-1600 for typical desktop CPUs like SnB or Haswell).
A typical 3GHz desktop Intel CPU can sustain a load bandwidth of something like ~8B / cycle from DRAM, in theory. (e.g. 25.6GB/s theoretical max memory BW for an i5-4670 with dual channel DDR3-1600). Achieving this in an actual single thread might not work, esp. when using integer 4B or 8B loads. For a slower CPU like a 2267MHz Nehalem Xeon, with quad-channel (but also slower) memory, 16B per clock is probably pushing the upper limits.
I had a look at the asm from the original unchanged code with gcc 4.7.3 on godbolt.
The stand-alone version looks fine (but the inlined version isn't), see below!), with the loop being
## float __vector Sum(...) non-inlined version
.L3:
xorps xmm0, XMMWORD PTR [rdi]
add rdi, 16
cmp rdi, rax
jne .L3
That's 3 fused-domain uops, and should issue and execute at one iteration per clock. Actually, it can't because xorps and fused compare-and-branch both need port5.
N is huge, so the overhead of the clunky char-at-a-time horizontal XOR doesn't come into play, even though gcc 4.7 emits abysmal code for it (multiple copies of sumVV stored to the stack, etc. etc.). (See Fastest way to do horizontal float vector sum on x86 for ways to reduce down to 4B with SIMD. It might be faster to then movd the data into integer regs and use integer shift/xor there for the last 4B -> 1B, esp. if you're not using AVX. The compiler might be able to take advantage of al/ah low and high 8bit component regs.)
The vector loop was inlined stupidly:
## float __vector Sum(...) inlined into main at -O3
.L12:
xorps xmm0, XMMWORD PTR [rdx]
add rdx, 16
cmp rdx, rbx
movaps XMMWORD PTR [rsp+64], xmm0
jne .L12
It's storing the accumulator every iteration, instead of just after the last iteration! Since gcc doesn't / didn't default to optimizing for macro-fusion, it didn't even put the cmp/jne next to each other where they can fuse into a single uop on Intel and AMD CPUs, so the loop has 5 fused-domain uops. This means it can only issue at one per 2 clocks, if the Nehalem frontend / loop buffer is anything like the Sandybridge loop buffer. uops issue in groups of 4, and a predicted-taken branch ends an issue block. So it issues in a 4/1/4/1 uop pattern, not 4/4/4/4. This means we can get at best one 16B load per 2 clocks of sustained throughput.
-mtune=core2 might double the throughput, because it puts the cmp/jne together. The store can micro-fuse into a single uop, and so can the xorps with a memory source operand. A gcc that old doesn't support -mtune=nehalem, or the more generic -mtune=intel. Nehalem can sustain one load and one store per clock, but obviously it would be far better not to have a store in the loop at all.
Compiling with -O2 makes even worse code with that gcc version:
The inlined inner loop now loads the accumulator from memory as well as storing it, so there's a store-forwarding round trip in the loop-carried dependency that the accumulator is part of:
## float __vector Sum(...) inlined at -O2
.L14:
movaps xmm0, XMMWORD PTR [rsp+16] # reload sum
xorps xmm0, XMMWORD PTR [rdx] # load data[i]
add rdx, 16
cmp rdx, rbx
movaps XMMWORD PTR [rsp+16], xmm0 # spill sum
jne .L14
At least with -O2 the horizontal byte-xor compiles to just a plain integer byte loop without spewing 15 copies copies of xmm0 onto the stack.
This is just totally braindead code, because we haven't let a reference / pointer to sumVV escape the function, so there are no other threads that could be observing the accumulator in progress. (And even if so, there's no synchronization stopping gcc from just accumulating in a reg and storing the final result). The non-inlined version is still fine.
That massive performance bug is still present all the way up to gcc 4.9.2, with -O2 -fno-tree-vectorize, even when I rename the function from main to something else, so it gets the full benefit of gcc's optimization efforts. (Don't put microbenchmarks inside main, because gcc marks it as "cold" and optimizes less.)
gcc 5.1 makes good code for the inlined version of template<>
__m128 Sum(const __m128* data, const int N). I didn't check with clang.
This extra loop-carried dep chain is almost certainly why the vector version has a smaller speedup with -O2. i.e. it's a compiler bug that's fixed in gcc5.
The scalar version with -O2 is
.L12:
xor bpl, BYTE PTR [rdx] # sumS, MEM[base: D.27594_156, offset: 0B]
add rdx, 1 # ivtmp.135,
cmp rdx, rbx # ivtmp.135, D.27613
jne .L12 #,
so it's basically optimal. Nehalem can only sustain one load per clock, so there's no need to use more accumulators.
The int version is
.L18:
xor ecx, DWORD PTR [rdx] # sum, MEM[base: D.27549_296, offset: 0B]
add rdx, 4 # ivtmp.135,
cmp rbx, rdx # D.27613, ivtmp.135
jne .L18 #,
so again, it's what you'd expect. It should be sustaining on load per clock.
For uarches that can sustain two loads per clock (Intel SnB-family, and AMD), you should be using two accumulators. compiler-implemented -funroll-loops usually just reduces loop overhead without introducing multiple accumulators. :(
You want the compiler to make code like:
xorps xmm0, xmm0
xorps xmm1, xmm1
.Lunrolled:
pxor xmm0, XMMWORD PTR [rdi]
pxor xmm1, XMMWORD PTR [rdi+16]
pxor xmm0, XMMWORD PTR [rdi+32]
pxor xmm1, XMMWORD PTR [rdi+48]
add rdi, 64
cmp rdi, rax
jb .Lunrolled
pxor xmm0, xmm1
# horizontal xor of xmm0
movhlps xmm1, xmm0
pxor xmm0, xmm1
...
Urolling by two (pxor / pxor / add / cmp/jne) would make a loop that can issue at one iteration per 1c, but requires four ALU execution ports. Only Haswell and later can keep up with that throughput. (Or AMD Bulldozer-family, because vector and integer instructions don't compete for execution ports, but conversely there are only two integer ALU pipes, so they only max out their instruction throughput with mixed code.)
This unroll by four is 6 fused-domain uops in the loop, so it can easily issue at one per 2c, and SnB/IvB can keep up with three ALU uops per clock.
Note that on Intel Nehalem through Broadwell, pxor (_mm_xor_si128) has better throughput than xorps (_mm_xor_ps), because it can run on more execution ports. If you're using AVX but not AVX2, it can make sense to use 256b _mm256_xor_ps instead of _mm_xor_si128, because _mm256_xor_si256 requires AVX2.
If it's not memory bandwidth, why is it only 12.6x speedup?
Nehalem's loop buffer (aka Loop Stream Decoder or LSD) has a "one clock delay" (according to Agner Fog's microarch pdf), so a loop with N uops will take ceil(N/4.0) + 1 cycles to issue out of the loop buffer if I understand him correctly. He doesn't explicitly say what happens to the last group of uops if there are less than 4, but SnB-family CPUs work this way (divide by 4 and round up). They can't issue uops from the next iteration following the taken branch. I tried to google about nehalem, but couldn't find anything useful.
So the char and int loops are presumably running at one load & xor per 2 clocks (since they're 3 fused-domain uops). Loop unrolling could ~double their throughput up to the point where they saturate the load port. SnB-family CPUs don't have that one clock delay, so they can run tiny loops at one clock per iteration.
Using perf counters or at least microbenchmarks to make sure that your absolute throughput is what you expect is a good idea. With just your relative measurements, you have no indication without this kind of analysis that you're leaving half your performance on the table.
The vector -O3 loop is 5 fused-domain uops, so it should be taking three clock cycles to issue. Doing 16x as much work, but taking 3 cycles per iteration instead of 2 would give us a speedup of 16 * 2/3 = 10.66. We're actually getting somewhat better than that, which I don't understand.
I'm going to stop here, instead of digging out a nehalem laptop and running actual benchmarks, since Nehalem is too old to be interesting to tune for at this level of detail.
Did you maybe compile with -mtune=core2? Or maybe your gcc had a different default tune setting, and didn't split up the compare-and-branch? In that case, the frontend probably wasn't the bottleneck, and throughput was maybe slightly limited by memory bandwidth, or by memory false dependencies:
Core 2 and Nehalem both have a false dependence between memory
addresses with the same set and offset, i.e. with a distance that is a
multiple of 4 kB.
This might cause a short bubble in the pipeline every 4k.
Before I checked on Nehalem's loop buffer and found the extra 1c per loop, I had a theory which I'm now confident is incorrect:
I thought the extra store uop in the loop that bumps it up over 4 uops would essentially cut the speed in half, so you'd see a speedup of ~6. However, maybe there are some execution bottlenecks that make the frontend issue throughput not the bottleneck after all?
Or maybe Nehalem's loop buffer is different from SnB's, and doesn't end an issue group at a predicted-taken branch. This would give a thoughput speedup of 16 * 4/5 = 12.8, for the -O3 vector loop, if it's 5 fused-domain uops can issue at a consistent 4 per clock. This matches the experimental data of 12.6429 speedup factor very well: slightly less than 12.8 is to be expected because of increased bandwidth requirements (occasional cache miss stalls when the prefetcher falls behind).
(The scalar loops still just run one iteration per clock: issuing more than one iteration per clock just means they bottleneck on one load per clock, and the 1 cycle xor loop-carried dependency.)
This can't be right because xorps in Nehalem can only run on port5, same as a fused compare-and-branch. So there's no way the non-unrolled vector loop could be running at more than one iteration per 2 cycles.
According to Agner Fog's tables, conditional branches have a throughput of one per 2c on Nehalem, further confirming that this is a bogus theory.

SSE2 is optimal when operating on completely parallel data. e.g.
for (int i = 0 ; i < N ; ++i)
z[i] = _mm_xor_ps(x[i], y[i]);
But in your case, each iteration of the loop depends upon the output of the previous iteration. This is known as a dependency chain. In short, it means that each consecutive xor is going to have to wait for the entire latency of the previous one before it can continue so it lowers the throughput.

jaket has already explained the likely problem: a dependency chain. I'll give it a try:
template<>
__m128 Sum(const __m128* data, const int N)
{
__m128 sum1 = _mm_set_ps1(0);
__m128 sum2 = _mm_set_ps1(0);
for (int i = 0; i < N; i += 2) {
sum1 = _mm_xor_ps(sum1, data[i + 0]);
sum2 = _mm_xor_ps(sum2, data[i + 1]);
}
return _mm_xor_ps(sum1, sum2);
}
Now there are no dependencies at all between the two lanes. Try expanding this to more lanes (e.g. 4).
You could also try using the integer version of these instructions (using __m128i). I do not understand the difference so this is just a hint.

In fact, the gcc compiler is optimized for SIMD. It explains why when you used -O2 the perf decreases significantly. You can re-check with -O1.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js