SIMD XOR operation is not as effective as Integer XOR?

SIMD XOR operation is not as effective as Integer XOR? - c++

I have a task to calculate xor-sum of bytes in an array:
X = char1 XOR char2 XOR char3 ... charN;
I'm trying to parallelize it, xoring __m128 instead. This should give speed up factor 4.
Also, to recheck the algorithm I use int. This should give speed up factor 4.
The test program is 100 lines long, I can't make it shorter, but it is simple:
#include "xmmintrin.h" // simulation of the SSE instruction
#include <ctime>
#include <iostream>
using namespace std;
#include <stdlib.h> // rand
const int NIter = 100;
const int N = 40000000; // matrix size. Has to be dividable by 4.
unsigned char str[N] __attribute__ ((aligned(16)));
template< typename T >
T Sum(const T* data, const int N)
{
T sum = 0;
for ( int i = 0; i < N; ++i )
sum = sum ^ data[i];
return sum;
}
template<>
__m128 Sum(const __m128* data, const int N)
{
__m128 sum = _mm_set_ps1(0);
for ( int i = 0; i < N; ++i )
sum = _mm_xor_ps(sum,data[i]);
return sum;
}
int main() {
// fill string by random values
for( int i = 0; i < N; i++ ) {
str[i] = 256 * ( double(rand()) / RAND_MAX ); // put a random value, from 0 to 255
}
/// -- CALCULATE --
/// SCALAR
unsigned char sumS = 0;
std::clock_t c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ )
sumS = Sum<unsigned char>( str, N );
double tScal = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;
/// SIMD
unsigned char sumV = 0;
const int m128CharLen = 4*4;
const int NV = N/m128CharLen;
c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ ) {
__m128 sumVV = _mm_set_ps1(0);
sumVV = Sum<__m128>( reinterpret_cast<__m128*>(str), NV );
unsigned char *sumVS = reinterpret_cast<unsigned char*>(&sumVV);
sumV = sumVS[0];
for ( int iE = 1; iE < m128CharLen; ++iE )
sumV ^= sumVS[iE];
}
double tSIMD = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;
/// SCALAR INTEGER
unsigned char sumI = 0;
const int intCharLen = 4;
const int NI = N/intCharLen;
c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ ) {
int sumII = Sum<int>( reinterpret_cast<int*>(str), NI );
unsigned char *sumIS = reinterpret_cast<unsigned char*>(&sumII);
sumI = sumIS[0];
for ( int iE = 1; iE < intCharLen; ++iE )
sumI ^= sumIS[iE];
}
double tINT = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;
/// -- OUTPUT --
cout << "Time scalar: " << tScal << " ms " << endl;
cout << "Time INT: " << tINT << " ms, speed up " << tScal/tINT << endl;
cout << "Time SIMD: " << tSIMD << " ms, speed up " << tScal/tSIMD << endl;
if(sumV == sumS && sumI == sumS )
std::cout << "Results are the same." << std::endl;
else
std::cout << "ERROR! Results are not the same." << std::endl;
return 1;
}
The typical results:
[10:46:20]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3540 ms
Time INT: 890 ms, speed up 3.97753
Time SIMD: 280 ms, speed up 12.6429
Results are the same.
[10:46:27]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3540 ms
Time INT: 890 ms, speed up 3.97753
Time SIMD: 280 ms, speed up 12.6429
Results are the same.
[10:46:35]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 880 ms, speed up 4.13636
Time SIMD: 290 ms, speed up 12.5517
Results are the same.
As you see, int version works ideally, but simd version loses 25% of the speed and this is stable. I tried to change the array sizes, this doesn't help.
Also, if I switch to -O2 I lose 75% of the speed in simd version:
[10:50:25]$ g++ test.cpp -O2 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 880 ms, speed up 4.13636
Time SIMD: 890 ms, speed up 4.08989
Results are the same.
[10:51:16]$ g++ test.cpp -O2 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 900 ms, speed up 4.04444
Time SIMD: 880 ms, speed up 4.13636
Results are the same.
Can someone explain me this?
Additional info:
I have g++ (GCC) 4.7.3; Intel(R) Xeon(R) CPU E7-4860
I use -fno-tree-vectorize to prevent auto vectorization. Without this flag with -O3 the
expected speed up is 1, since the task is simple. This is what I get:
[10:55:40]$ g++ test.cpp -O3; ./a.out
Time scalar: 270 ms
Time INT: 270 ms, speed up 1
Time SIMD: 280 ms, speed up 0.964286
Results are the same.
but with -O2 result is still strange:
[10:55:02]$ g++ test.cpp -O2; ./a.out
Time scalar: 3540 ms
Time INT: 990 ms, speed up 3.57576
Time SIMD: 880 ms, speed up 4.02273
Results are the same.
When I change
for ( int i = 0; i < N; i+=1 )
sum = sum ^ data[i];
to equivalent of:
for ( int i = 0; i < N; i+=8 )
sum = (data[i] ^ data[i+1]) ^ (data[i+2] ^ data[i+3]) ^ (data[i+4] ^ data[i+5]) ^ (data[i+6] ^ data[i+7]) ^ sum;
i do see improvment in scalar speed by factor of 2. But I don't see improvements in speed up. Before: intSpeedUp 3.98416, SIMDSpeedUP 12.5283. After: intSpeedUp 3.5572, SIMDSpeedUP 6.8523.

I think you may be bumping into the upper limits of memory bandwidth. This might be the reason for the 12.6x speedup instead of 16x speedup in the -O3 case.
However, gcc 4.7.3 puts a useless store instruction into the tiny not-unrolled vector loop when inlining, but not in the scalar or int SWAR loops (see below), so that might be the explanation instead.
The -O2 reduction in vector throughput is all due to gcc 4.7.3 doing an even worse job there and sending the accumulator on a round trip to memory (store-forwarding).
For analysis of the implications of that extra store instruction, see the section at the end.
TL;DR: Nehalem likes a bit more loop unrolling than SnB-family requires, and gcc has made major improvements in SSE code-generation in gcc5.
And typically use _mm_xor_si128, not _mm_xor_ps for bulk xor work like this.
Memory bandwidth.
N is huge (40MB), so memory/cache bandwidth is a concern. A Xeon E7-4860 is a 32nm Nehalem microarchitecture, with 256kiB of L2 cache (per core), and 24MiB of shared L3 cache. It has a quad-channel memory controller supporting up to DDR3-1066 (compared to dual-channel DDR3-1333 or DDR3-1600 for typical desktop CPUs like SnB or Haswell).
A typical 3GHz desktop Intel CPU can sustain a load bandwidth of something like ~8B / cycle from DRAM, in theory. (e.g. 25.6GB/s theoretical max memory BW for an i5-4670 with dual channel DDR3-1600). Achieving this in an actual single thread might not work, esp. when using integer 4B or 8B loads. For a slower CPU like a 2267MHz Nehalem Xeon, with quad-channel (but also slower) memory, 16B per clock is probably pushing the upper limits.
I had a look at the asm from the original unchanged code with gcc 4.7.3 on godbolt.
The stand-alone version looks fine (but the inlined version isn't), see below!), with the loop being
## float __vector Sum(...) non-inlined version
.L3:
xorps xmm0, XMMWORD PTR [rdi]
add rdi, 16
cmp rdi, rax
jne .L3
That's 3 fused-domain uops, and should issue and execute at one iteration per clock. Actually, it can't because xorps and fused compare-and-branch both need port5.
N is huge, so the overhead of the clunky char-at-a-time horizontal XOR doesn't come into play, even though gcc 4.7 emits abysmal code for it (multiple copies of sumVV stored to the stack, etc. etc.). (See Fastest way to do horizontal float vector sum on x86 for ways to reduce down to 4B with SIMD. It might be faster to then movd the data into integer regs and use integer shift/xor there for the last 4B -> 1B, esp. if you're not using AVX. The compiler might be able to take advantage of al/ah low and high 8bit component regs.)
The vector loop was inlined stupidly:
## float __vector Sum(...) inlined into main at -O3
.L12:
xorps xmm0, XMMWORD PTR [rdx]
add rdx, 16
cmp rdx, rbx
movaps XMMWORD PTR [rsp+64], xmm0
jne .L12
It's storing the accumulator every iteration, instead of just after the last iteration! Since gcc doesn't / didn't default to optimizing for macro-fusion, it didn't even put the cmp/jne next to each other where they can fuse into a single uop on Intel and AMD CPUs, so the loop has 5 fused-domain uops. This means it can only issue at one per 2 clocks, if the Nehalem frontend / loop buffer is anything like the Sandybridge loop buffer. uops issue in groups of 4, and a predicted-taken branch ends an issue block. So it issues in a 4/1/4/1 uop pattern, not 4/4/4/4. This means we can get at best one 16B load per 2 clocks of sustained throughput.
-mtune=core2 might double the throughput, because it puts the cmp/jne together. The store can micro-fuse into a single uop, and so can the xorps with a memory source operand. A gcc that old doesn't support -mtune=nehalem, or the more generic -mtune=intel. Nehalem can sustain one load and one store per clock, but obviously it would be far better not to have a store in the loop at all.
Compiling with -O2 makes even worse code with that gcc version:
The inlined inner loop now loads the accumulator from memory as well as storing it, so there's a store-forwarding round trip in the loop-carried dependency that the accumulator is part of:
## float __vector Sum(...) inlined at -O2
.L14:
movaps xmm0, XMMWORD PTR [rsp+16] # reload sum
xorps xmm0, XMMWORD PTR [rdx] # load data[i]
add rdx, 16
cmp rdx, rbx
movaps XMMWORD PTR [rsp+16], xmm0 # spill sum
jne .L14
At least with -O2 the horizontal byte-xor compiles to just a plain integer byte loop without spewing 15 copies copies of xmm0 onto the stack.
This is just totally braindead code, because we haven't let a reference / pointer to sumVV escape the function, so there are no other threads that could be observing the accumulator in progress. (And even if so, there's no synchronization stopping gcc from just accumulating in a reg and storing the final result). The non-inlined version is still fine.
That massive performance bug is still present all the way up to gcc 4.9.2, with -O2 -fno-tree-vectorize, even when I rename the function from main to something else, so it gets the full benefit of gcc's optimization efforts. (Don't put microbenchmarks inside main, because gcc marks it as "cold" and optimizes less.)
gcc 5.1 makes good code for the inlined version of template<>
__m128 Sum(const __m128* data, const int N). I didn't check with clang.
This extra loop-carried dep chain is almost certainly why the vector version has a smaller speedup with -O2. i.e. it's a compiler bug that's fixed in gcc5.
The scalar version with -O2 is
.L12:
xor bpl, BYTE PTR [rdx] # sumS, MEM[base: D.27594_156, offset: 0B]
add rdx, 1 # ivtmp.135,
cmp rdx, rbx # ivtmp.135, D.27613
jne .L12 #,
so it's basically optimal. Nehalem can only sustain one load per clock, so there's no need to use more accumulators.
The int version is
.L18:
xor ecx, DWORD PTR [rdx] # sum, MEM[base: D.27549_296, offset: 0B]
add rdx, 4 # ivtmp.135,
cmp rbx, rdx # D.27613, ivtmp.135
jne .L18 #,
so again, it's what you'd expect. It should be sustaining on load per clock.
For uarches that can sustain two loads per clock (Intel SnB-family, and AMD), you should be using two accumulators. compiler-implemented -funroll-loops usually just reduces loop overhead without introducing multiple accumulators. :(
You want the compiler to make code like:
xorps xmm0, xmm0
xorps xmm1, xmm1
.Lunrolled:
pxor xmm0, XMMWORD PTR [rdi]
pxor xmm1, XMMWORD PTR [rdi+16]
pxor xmm0, XMMWORD PTR [rdi+32]
pxor xmm1, XMMWORD PTR [rdi+48]
add rdi, 64
cmp rdi, rax
jb .Lunrolled
pxor xmm0, xmm1
# horizontal xor of xmm0
movhlps xmm1, xmm0
pxor xmm0, xmm1
...
Urolling by two (pxor / pxor / add / cmp/jne) would make a loop that can issue at one iteration per 1c, but requires four ALU execution ports. Only Haswell and later can keep up with that throughput. (Or AMD Bulldozer-family, because vector and integer instructions don't compete for execution ports, but conversely there are only two integer ALU pipes, so they only max out their instruction throughput with mixed code.)
This unroll by four is 6 fused-domain uops in the loop, so it can easily issue at one per 2c, and SnB/IvB can keep up with three ALU uops per clock.
Note that on Intel Nehalem through Broadwell, pxor (_mm_xor_si128) has better throughput than xorps (_mm_xor_ps), because it can run on more execution ports. If you're using AVX but not AVX2, it can make sense to use 256b _mm256_xor_ps instead of _mm_xor_si128, because _mm256_xor_si256 requires AVX2.
If it's not memory bandwidth, why is it only 12.6x speedup?
Nehalem's loop buffer (aka Loop Stream Decoder or LSD) has a "one clock delay" (according to Agner Fog's microarch pdf), so a loop with N uops will take ceil(N/4.0) + 1 cycles to issue out of the loop buffer if I understand him correctly. He doesn't explicitly say what happens to the last group of uops if there are less than 4, but SnB-family CPUs work this way (divide by 4 and round up). They can't issue uops from the next iteration following the taken branch. I tried to google about nehalem, but couldn't find anything useful.
So the char and int loops are presumably running at one load & xor per 2 clocks (since they're 3 fused-domain uops). Loop unrolling could ~double their throughput up to the point where they saturate the load port. SnB-family CPUs don't have that one clock delay, so they can run tiny loops at one clock per iteration.
Using perf counters or at least microbenchmarks to make sure that your absolute throughput is what you expect is a good idea. With just your relative measurements, you have no indication without this kind of analysis that you're leaving half your performance on the table.
The vector -O3 loop is 5 fused-domain uops, so it should be taking three clock cycles to issue. Doing 16x as much work, but taking 3 cycles per iteration instead of 2 would give us a speedup of 16 * 2/3 = 10.66. We're actually getting somewhat better than that, which I don't understand.
I'm going to stop here, instead of digging out a nehalem laptop and running actual benchmarks, since Nehalem is too old to be interesting to tune for at this level of detail.
Did you maybe compile with -mtune=core2? Or maybe your gcc had a different default tune setting, and didn't split up the compare-and-branch? In that case, the frontend probably wasn't the bottleneck, and throughput was maybe slightly limited by memory bandwidth, or by memory false dependencies:
Core 2 and Nehalem both have a false dependence between memory
addresses with the same set and offset, i.e. with a distance that is a
multiple of 4 kB.
This might cause a short bubble in the pipeline every 4k.
Before I checked on Nehalem's loop buffer and found the extra 1c per loop, I had a theory which I'm now confident is incorrect:
I thought the extra store uop in the loop that bumps it up over 4 uops would essentially cut the speed in half, so you'd see a speedup of ~6. However, maybe there are some execution bottlenecks that make the frontend issue throughput not the bottleneck after all?
Or maybe Nehalem's loop buffer is different from SnB's, and doesn't end an issue group at a predicted-taken branch. This would give a thoughput speedup of 16 * 4/5 = 12.8, for the -O3 vector loop, if it's 5 fused-domain uops can issue at a consistent 4 per clock. This matches the experimental data of 12.6429 speedup factor very well: slightly less than 12.8 is to be expected because of increased bandwidth requirements (occasional cache miss stalls when the prefetcher falls behind).
(The scalar loops still just run one iteration per clock: issuing more than one iteration per clock just means they bottleneck on one load per clock, and the 1 cycle xor loop-carried dependency.)
This can't be right because xorps in Nehalem can only run on port5, same as a fused compare-and-branch. So there's no way the non-unrolled vector loop could be running at more than one iteration per 2 cycles.
According to Agner Fog's tables, conditional branches have a throughput of one per 2c on Nehalem, further confirming that this is a bogus theory.

SSE2 is optimal when operating on completely parallel data. e.g.
for (int i = 0 ; i < N ; ++i)
z[i] = _mm_xor_ps(x[i], y[i]);
But in your case, each iteration of the loop depends upon the output of the previous iteration. This is known as a dependency chain. In short, it means that each consecutive xor is going to have to wait for the entire latency of the previous one before it can continue so it lowers the throughput.

jaket has already explained the likely problem: a dependency chain. I'll give it a try:
template<>
__m128 Sum(const __m128* data, const int N)
{
__m128 sum1 = _mm_set_ps1(0);
__m128 sum2 = _mm_set_ps1(0);
for (int i = 0; i < N; i += 2) {
sum1 = _mm_xor_ps(sum1, data[i + 0]);
sum2 = _mm_xor_ps(sum2, data[i + 1]);
}
return _mm_xor_ps(sum1, sum2);
}
Now there are no dependencies at all between the two lanes. Try expanding this to more lanes (e.g. 4).
You could also try using the integer version of these instructions (using __m128i). I do not understand the difference so this is just a hint.

In fact, the gcc compiler is optimized for SIMD. It explains why when you used -O2 the perf decreases significantly. You can re-check with -O1.

Related

Very low FLOPs/second without any data transfer

I tested the following code on my machine to see how much throughput I can get. The code does not do very much except assigning each thread two nested loop,
#include <chrono>
#include <iostream>
int main() {
auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
for(int thread = 0; thread < 24; thread++) {
float i = 0.0f;
while(i < 100000.0f) {
float j = 0.0f;
while (j < 100000.0f) {
j = j + 1.0f;
}
i = i + 1.0f;
}
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time = end_time - start_time;
std::cout << time / std::chrono::milliseconds(1) << std::endl;
return 0;
}
To my surprise, the throughput is very low according to perf
$ perf stat -e all_dc_accesses -e fp_ret_sse_avx_ops.all cmake-build-release/roofline_prediction
8907
Performance counter stats for 'cmake-build-release/roofline_prediction':
325.372.690 all_dc_accesses
240.002.400.000 fp_ret_sse_avx_ops.all
8,909514307 seconds time elapsed
202,819795000 seconds user
0,059613000 seconds sys
With 240.002.400.000 FLOPs in 8.83 seconds, the machine achieved only 27.1 GFLOPs/second, way below the CPU's capacity of 392 GFLOPs/sec (I got this number from a roofline modelling software).
My question is, how can I achieved higher throughput?
Compiler: GCC 9.3.0
CPU: AMD Threadripper 1920X
Optimization level: -O3
OpenMP's flag: -fopenmp

Compiled with GCC 9.3 with those options, the inner loop looks like this:
.L3:
addss xmm0, xmm2
comiss xmm1, xmm0
ja .L3
Some other combinations of GCC version / options may result in the loop being elided, after all it doesn't really do anything (except waste time).
The addss forms a loop-carried dependency with only itself in it. That is not fast though, on Zen 1 that takes 3 cycles per iteration, so the number of additions per cycle is 1/3. The maximum number of floating point additions per cycle could be attained by having at least 6 independent addps instructions (256bit vaddps may help a bit, but Zen 1 executes such 256bit SIMD instructions with 2 128bit operations internally), to deal with the latency of 3 and the throughput of 2 per cycle (so 6 operations need to be active at any time). That would correspond to 8 additions per cycles, 24 times as much as the current code.
From a C++ program, it may be possible to coax the compiler into generating suitable machine code by:
Using -ffast-math (if possible, which it isn't always)
Using explicit vectorization using _mm_add_ps
Manually unrolling the loop, using (at least 6) independent accumulators

Why does not AVX further improve the performance compared with SSE2?

I am new to the field of SSE2 and AVX. I write the following code to test the performance of both SSE2 and AVX.
#include <cmath>
#include <iostream>
#include <chrono>
#include <emmintrin.h>
#include <immintrin.h>
void normal_res(float* __restrict__ a, float* __restrict__ b, float* __restrict__ c, unsigned long N) {
for (unsigned long n = 0; n < N; n++) {
c[n] = sqrt(a[n]) + sqrt(b[n]);
}
}
void normal(float* a, float* b, float* c, unsigned long N) {
for (unsigned long n = 0; n < N; n++) {
c[n] = sqrt(a[n]) + sqrt(b[n]);
}
}
void sse(float* a, float* b, float* c, unsigned long N) {
__m128* a_ptr = (__m128*)a;
__m128* b_ptr = (__m128*)b;
for (unsigned long n = 0; n < N; n+=4, a_ptr++, b_ptr++) {
__m128 asqrt = _mm_sqrt_ps(*a_ptr);
__m128 bsqrt = _mm_sqrt_ps(*b_ptr);
__m128 add_result = _mm_add_ps(asqrt, bsqrt);
_mm_store_ps(&c[n], add_result);
}
}
void avx(float* a, float* b, float* c, unsigned long N) {
__m256* a_ptr = (__m256*)a;
__m256* b_ptr = (__m256*)b;
for (unsigned long n = 0; n < N; n+=8, a_ptr++, b_ptr++) {
__m256 asqrt = _mm256_sqrt_ps(*a_ptr);
__m256 bsqrt = _mm256_sqrt_ps(*b_ptr);
__m256 add_result = _mm256_add_ps(asqrt, bsqrt);
_mm256_store_ps(&c[n], add_result);
}
}
int main(int argc, char** argv) {
unsigned long N = 1 << 30;
auto *a = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
auto *b = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
auto *c = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
std::chrono::time_point<std::chrono::system_clock> start, end;
for (unsigned long i = 0; i < N; ++i) {
a[i] = 3141592.65358;
b[i] = 1234567.65358;
}
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
normal(a, b, c, N);
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "normal elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
normal_res(a, b, c, N);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
std::cout << "normal restrict elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
sse(a, b, c, N);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
std::cout << "sse elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
avx(a, b, c, N);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
std::cout << "avx elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
return 0;
}
I compile my program by using g++ complier as the following.
g++ -msse -msse2 -mavx -mavx512f -O2
The results are as the following. It seems that there is no further improvement when I use more advanced 256 bits vectors.
normal elapsed time: 10.5311
normal restrict elapsed time: 8.00338
sse elapsed time: 0.995806
avx elapsed time: 0.973302
I have two questions.
Why does not AVX give me further improvement? Is it because the memory bandwidth?
According to my experiment, the SSE2 perform 10 times faster than the naive version. Why is that? I expect the SSE2 can only be 4 times faster based on its 128 bits vectors with respect to single precision floating points. Thanks a lot.

There are several issues here....
Memory bandwidth is very likely to be important for these array sizes -- more notes below.
Throughput for SSE and AVX square root instructions may not be what you expect on your processor -- more notes below.
The first test ("normal") may be slower than expected because the output array is instantiated (i.e., virtual to physical mappings are created) during the timed part of the test. (Just fill c with zeros in the loop that initializes a and b to fix this.)
Memory Bandwidth Notes:
With N = 1<<30 and float variables, each array is 4GiB.
Each test reads two arrays and writes to a third array. This third array must also be read from memory before being overwritten -- this is called a "write allocate" or a "read for ownership".
So you are reading 12 GiB and writing 4 GiB in each test. The SSE and AVX tests therefore correspond to ~16 GB/s of DRAM bandwidth, which is near the high end of the range typically seen for single-threaded operation on recent processors.
Instruction Throughput Notes:
The best reference for instruction latency and throughput on x86 processors is "instruction_tables.pdf" from https://www.agner.org/optimize/
Agner defines "reciprocal throughput" as the average number of cycles per retired instruction when the processor is given a workload of independent instructions of the same type.
As an example, for an Intel Skylake core, the throughput of SSE and AVX SQRT is the same:
SQRTPS (xmm) 1/throughput = 3 --> 1 instruction every 3 cycles
VSQRTPS (ymm) 1/throughput = 6 --> 1 instruction every 6 cycles
Execution time for the square roots is expected to be (1<<31) square roots / 4 square roots per SSE SQRT instruction * 3 cycles per SSE SQRT instruction / 3 GHz = 0.54 seconds (randomly assuming a processor frequency).
Expected throughput for the "normal" and "normal_res" cases depends on the specifics of the generated assembly code.

Scalar being 10x instead of 4x slower:
You're getting page faults in c[] inside the scalar timed region because that's the first time you're writing it. If you did tests in a different order, whichever one was first would pay that large penalty. That part is a duplicate of this mistake: Why is iterating though `std::vector` faster than iterating though `std::array`? See also Idiomatic way of performance evaluation?
normal pays this cost in its first of the 5 passes over the array. Smaller arrays and a larger repeat count would amortize this even more, but better to memset or otherwise fill your destination first to pre-fault it ahead of the timed region.
normal_res is also scalar but is writing into an already-dirtied c[]. Scalar is 8x slower than SSE instead of the expected 4x.
You used sqrt(double) instead of sqrtf(float) or std::sqrt(float). On Skylake-X, this perfectly accounts for an extra factor of 2 throughput. Look at the compiler's asm output on the Godbolt compiler explorer (GCC 7.4 assuming the same system as your last question). I used -mavx512f (which implies -mavx and -msse), and no tuning options, to hopefully get about the same code-gen you did. main doesn't inline normal_res, so we can just look at the stand-alone definition for it.
normal_res(float*, float*, float*, unsigned long):
...
vpxord zmm2, zmm2, zmm2 # uh oh, 512-bit instruction reduces turbo clocks for the next several microseconds. Silly compiler
# more recent gcc would just use `vpxor xmm0,xmm0,xmm0`
...
.L5: # main loop
vxorpd xmm0, xmm0, xmm0
vcvtss2sd xmm0, xmm0, DWORD PTR [rdi+rbx*4] # convert to double
vucomisd xmm2, xmm0
vsqrtsd xmm1, xmm1, xmm0 # scalar double sqrt
ja .L16
.L3:
vxorpd xmm0, xmm0, xmm0
vcvtss2sd xmm0, xmm0, DWORD PTR [rsi+rbx*4]
vucomisd xmm2, xmm0
vsqrtsd xmm3, xmm3, xmm0 # scalar double sqrt
ja .L17
.L4:
vaddsd xmm1, xmm1, xmm3 # scalar double add
vxorps xmm4, xmm4, xmm4
vcvtsd2ss xmm4, xmm4, xmm1 # could have just converted in-place without zeroing another destination to avoid a false dependency :/
vmovss DWORD PTR [rdx+rbx*4], xmm4
add rbx, 1
cmp rcx, rbx
jne .L5
The vpxord zmm only reduces turbo clock for a few milliseconds (I think) at the start of each call to normal and normal_res. It doesn't keep using 512-bit operations so clock speed can jump back up again later. This might partially account for it not being exactly 8x.
The compare / ja is because you didn't use -fno-math-errno so GCC still calls actual sqrt for inputs < 0 to get errno set. It's doing if (!(0 <= tmp)) goto fallback, jumping on 0 > tmp or unordered. "Fortunately" sqrt is slow enough that it's still the only bottleneck. Out-of-order exec of the conversion and compare/branching means the SQRT unit is still kept busy ~100% of the time.
vsqrtsd throughput (6 cycles) is 2x slower than vsqrtss throughput (3 cycles) on Skylake-X, so using double costs a factor of 2 in scalar throughput.
Scalar sqrt on Skylake-X has the same throughput as the corresponding 128-bit ps / pd SIMD version. So 6 cycles per 1 number as a double vs. 3 cycles per 4 floats as a ps vector fully explains the 8x factor.
The extra 8x vs. 10x slowdown for normal was just from page faults.
SSE vs. AVX sqrt throughput
128-bit sqrtps is sufficient to get the full throughput of the SIMD div/sqrt unit; assuming this is a Skylake-server like your last question, it's 256 bits wide but not fully pipelined. The CPU can alternate sending a 128-bit vector into the low or high half to take advantage of the full hardware width even when you're only using 128-bit vectors. See Floating point division vs floating point multiplication (FP div and sqrt run on the same execution unit.)
See also instruction latency/throughput numbers on https://uops.info/, or on https://agner.org/optimize/.
The add/sub/mul/fma are all 512-bits wide and fully pipelined; use that (e.g. to evaluate a 6th order polynomial or something) if you want something that can scale with vector width. div/sqrt is a special case.
You'd expect a benefit from using 256-bit vectors for SQRT only if you had a bottleneck on the front-end (4/clock instruction / uop throughput), or if you were doing a bunch of add/sub/mul/fma work with the vectors as well.
256-bit isn't worse, but it doesn't help when the only computation bottleneck is on the div/sqrt unit's throughput.
See John McCalpin's answer for more details about write-only costing about the same as a read+write, because of RFOs.
With so little computation per memory access, you're probably close to bottlenecking on memory bandwidth again / still. Even if the FP SQRT hardware was wider / faster, you might not in practice have your code run any faster. Instead you'd just have the core spend more time doing nothing while waiting for data to arrive from memory.
It seems you are getting exactly the expected speedup from 128-bit vectors (2x * 4x = 8x), so apparently the __m128 version is not bottlenecked on memory bandwidth either.
2x sqrt per 4 memory accesses is about the same as the a[i] = sqrt(a[i]) (1x sqrt per load + store) you were doing in the code you posted in chat, but you didn't give any numbers for that. That one avoided the page-fault problem because it was rewriting an array in-place after initializing it.
In general rewriting an array in-place is a good idea if you for some reason keep insisting on trying to get a 4x / 8x / 16x SIMD speedup using these insanely huge arrays that won't even fit in L3 cache.
Memory access is pipelined, and overlaps with computation (assuming sequential access so prefetchers can be pulling it in continuously without having to compute the next address): faster computation doesn't speed up overall progress. Cache lines arrive from memory at some fixed maximum bandwidth, with ~12 cache line transfers in flight at once (12 LFBs in Skylake). Or L2 "superqueue" can track more cache lines than that (maybe 16?), so L2 prefetch is reading ahead of where the CPU core is stalled.
As long as your computation can keep up with that rate, making it faster will just leave more cycles of doing nothing before the next cache line arrives.
(The store buffer writing back to L1d and then evicting dirty lines is also happening, but the basic idea of core waiting for memory still works.)
You could think of it like stop-and-go traffic in a car: a gap opens ahead of your car. Closing that gap faster doesn't gain you any average speed, it just means you have to stop faster.
If you want to see the benefit of AVX and AVX512 over SSE, you'll need smaller arrays (and a higher repeat-count). Or you'll need lots of ALU work per vector, like a polynomial.
In many real-world problems, the same data is used repeatedly so caches work. And it's possible to break up your problem into doing multiple things to one block of data while it's hot in cache (or even while loaded in registers), to increase the computational intensity enough to take advantage of the compute vs. memory balance of modern CPUs.

why is it faster to print number in binary using arithmetic instead of _bittest

The purpose of the next two code section is to print number in binary.
The first one does this by two instructions (_bittest), while the second does it by pure arithmetic instructions which is three instructions.
the first code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = _bittest(&num, nBit);
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
the inner loop is compile to the instructions bt and setb
The second code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
The inner loop compile to and add(as shift left) and sar.
the second code section run three time faster then the first one.
Why three cpu instructions run faster then two?

Not answer (Bo did), but the second inner loop version can be simplified a bit:
long numCopy = num;
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = numCopy & 1;
numCopy >>= 1;
}
Has subtle difference (1 instruction less) with gcc 7.2 targetting 32b.
(I'm assuming 32b target, as you convert long into 32 bit array, which makes sense only on 32b target ... and I assume x86, as it includes <windows.h>, so it's clearly for obsolete OS target, although I think windows now have even 64b version? (I don't care.))
Answer:
Why three cpu instructions run faster then two?
Because the count of instructions only correlates with performance (usually fewer is better), but the modern x86 CPU is much more complex machine, translating the actual x86 instructions into micro-code before execution, transforming that further by things like out-of-order-execution and register renaming (to break false dependency chains), and then it executes the resulting microcode, with different units of CPU capable to execute only some micro-ops, so in ideal case you may get 2-3 micro-ops executed in parallel by the 2-3 units in single cycle, and in worst case you may be executing an complete micro-code loop implementing some complex x86 instruction taking several cycles to finish, blocking most of the CPU units.
Another factor is availability of data from memory and memory writes, a single cache miss, when the data must be fetched from higher level cache, or even memory itself, creates tens-to-hundreds cycles stall. Having compact data structures favouring predictable access patterns and not exhausting all cache-lines is paramount for exploiting maximum CPU performance.
If you are at stage "why 3 instructions are faster than 2 instructions", you pretty much can start with any x86 optimization article/book, and keep reading for few months or year(s), it's quite complex topic.
You may want to check this answer https://gamedev.stackexchange.com/q/27196 for further reading...

I'm assuming you're using x86-64 MSVC CL19 (or something that makes similar code).
_bittest is slower because MSVC does a horrible job and keeps the value in memory and bt [mem], reg is much slower than bt reg,reg. This is a compiler missed-optimization. It happens even if you make num a local variable instead of a global, even when the initializer is still constant!
I included some perf analysis for Intel Sandybridge-family CPUs because they're common; you didn't say and yes it matters: bt [mem], reg has one per 3 cycle throughput on Ryzen, one per 5 cycle throughput on Haswell. And other perf characteristics differ...
(For just looking at the asm, it's usually a good idea to make a function with args to get code the compiler can't do constant-propagation on. It can't in this case because it doesn't know if anything modifies num before main runs, because it's not static.)
Your instruction-counting didn't include the whole loop so your counts are wrong, but more importantly you didn't consider the different costs of different instructions. (See Agner Fog's instruction tables and optimization manual.)
This is your whole inner loop with the _bittest intrinsic, with uop counts for Haswell / Skylake:
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = _bittest(&num, nBit);
//bits[nBit] = (bool)(num & (1UL << nBit)); // much more efficient
}
Asm output from MSVC CL19 -Ox on the Godbolt compiler explorer
$LL7#main:
bt DWORD PTR num, ebx ; 10 uops (microcoded), one per 5 cycle throughput
lea rcx, QWORD PTR [rcx+1] ; 1 uop
setb al ; 1 uop
inc ebx ; 1 uop
mov BYTE PTR [rcx-1], al ; 1 uop (micro-fused store-address and store-data)
cmp ebx, 31
jb SHORT $LL7#main ; 1 uop (macro-fused with cmp)
That's 15 fused-domain uops, so it can issue (at 4 per clock) in 3.75 cycles. But that's not the bottleneck: Agner Fog's testing found that bt [mem], reg has a throughput of one per 5 clock cycles.
IDK why it's 3x slower than your other loop. Maybe the other ALU instructions compete for the same port as the bt, or the data dependency it's part of causes a problem, or just being a micro-coded instruction is a problem, or maybe the outer loop is less efficient?
Anyway, using bt [mem], reg instead of bt reg, reg is a major missed optimization. This loop would have been faster than your other loop with a 1 uop, 1c latency, 2-per-clock throughput bt r9d, ebx.
The inner loop compile to and add(as shift left) and sar.
Huh? Those are the instructions MSVC associates with the curBit <<= 1; source line (even though that line is fully implemented by the add self,self, and the variable-count arithmetic right shift is part of a different line.)
But the whole loop is this clunky mess:
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
$LL18#main: # MSVC CL19 -Ox
mov ecx, ebx ; 1 uop
lea r8, QWORD PTR [r8+1] ; 1 uop pointer-increment for bits
mov eax, r9d ; 1 uop. r9d holds num
inc ebx ; 1 uop
and eax, edx ; 1 uop
# MSVC says all the rest of these instructions are from curBit <<= 1; but they're obviously not.
add edx, edx ; 1 uop
sar eax, cl ; 3 uops (variable-count shifts suck)
mov BYTE PTR [r8-1], al ; 1 uop (micro-fused)
cmp ebx, 31
jb SHORT $LL18#main ; 1 uop (macro-fused with cmp)
So this is 11 fused-domain uops, and takes 2.75 clock cycles per iteration to issue from the front-end.
I don't see any loop-carried dep chains longer than that front-end bottleneck, so it probably runs about that fast.
Copying ebx to ecx every iteration instead of just using ecx as the loop counter (nBit) is an obvious missed optimization. The shift-count is needed in cl for a variable-count shift (unless you enable BMI2 instructions, if MSVC can even do that.)
There are major missed optimizations here (in the "fast" version), so you should probably write your source differently do hand-hold your compiler into making less bad code. It implements this fairly literally instead of transforming it into something the CPU can do efficiently, or using bt reg,reg / setc
How to do this fast in asm or with intrinsics
Use SSE2 / AVX. Get the right byte (containing the corresponding bit) into each byte element of a vector, and PANDN (to invert your vector) with a mask that has the right bit for that element. PCMPEQB against zero. That gives you 0 / -1. To get ASCII digits, use _mm_sub_epi8(set1('0'), mask) to subtract 0 or -1 (add 0 or 1) to ASCII '0', conditionally turning it into '1'.
The first steps of this (getting a vector of 0/-1 from a bitmask) is How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.
Fastest way to unpack 32 bits to a 32 byte SIMD vector (has a 128b version). Without SSSE3 (pshufb), I think punpcklbw / punpcklwd (and maybe pshufd) is what you need to repeat each byte of num 8 times and make two 16-byte vectors.
is there an inverse instruction to the movemask instruction in intel avx2?.
In scalar code, this is one way that runs at 1 bit->byte per clock. There are probably ways to do better without using SSE2 (storing multiple bytes at once to get around the 1 store per clock bottleneck that exists on all current CPUs), but why bother? Just use SSE2.
mov eax, [num]
lea rdi, [rsp + xxx] ; bits[]
.loop:
shr eax, 1 ; constant-count shift is efficient (1 uop). CF = last bit shifted out
setc [rdi] ; 2 uops, but just as efficient as setc reg / mov [mem], reg
shr eax, 1
setc [rdi+1]
add rdi, 2
cmp end_pointer ; compare against another register instead of a separate counter.
jb .loop
Unrolled by two to avoid bottlenecking on the front-end, so this can run at 1 bit per clock.

The difference is that the code _bittest(&num, nBit); uses a pointer to num, which makes the compiler store it in memory. And the memory access makes the code a lot slower.
bits[nBit] = _bittest(&num, nBit);
00007FF6D25110A0 bt dword ptr [num (07FF6D2513034h)],ebx ; <-----
00007FF6D25110A7 lea rcx,[rcx+1]
00007FF6D25110AB setb al
00007FF6D25110AE inc ebx
00007FF6D25110B0 mov byte ptr [rcx-1],al
The other version stores all the variables in registers, and uses very fast register shifts and adds. No memory accesses.

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?

I wrote these two solutions for Project Euler Q14, in assembly and in C++. They implement identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with:
nasm -felf64 p14.asm && gcc p14.o -o p14
The C++ was compiled with:
g++ p14.cpp -o p14
Assembly, p14.asm:
section .data
fmt db "%d", 10, 0
global main
extern printf
section .text
main:
mov rcx, 1000000
xor rdi, rdi ; max i
xor rsi, rsi ; i
l1:
dec rcx
xor r10, r10 ; count
mov rax, rcx
l2:
test rax, 1
jpe even
mov rbx, 3
mul rbx
inc rax
jmp c1
even:
mov rbx, 2
xor rdx, rdx
div rbx
c1:
inc r10
cmp rax, 1
jne l2
cmp rdi, r10
cmovl rdi, r10
cmovl rsi, rcx
cmp rcx, 2
jne l1
mov rdi, fmt
xor rax, rax
call printf
ret
C++, p14.cpp:
#include <iostream>
int sequence(long n) {
int count = 1;
while (n != 1) {
if (n % 2 == 0)
n /= 2;
else
n = 3*n + 1;
++count;
}
return count;
}
int main() {
int max = 0, maxi;
for (int i = 999999; i > 0; --i) {
int s = sequence(i);
if (s > max) {
max = s;
maxi = i;
}
}
std::cout << maxi << std::endl;
}
I know about the compiler optimizations to improve speed and everything, but I don’t see many ways to further optimize my assembly solution (speaking programmatically, not mathematically).
The C++ code uses modulus every term and division every other term, while the assembly code only uses a single division every other term.
But the assembly is taking on average 1 second longer than the C++ solution. Why is this? I am asking mainly out of curiosity.
Execution times
My system: 64-bit Linux on 1.4 GHz Intel Celeron 2955U (Haswell microarchitecture).
g++ (unoptimized): avg 1272 ms.
g++ -O3: avg 578 ms.
asm (div) (original): avg 2650 ms.
asm (shr): avg 679 ms.
#johnfound asm (assembled with NASM): avg 501 ms.
#hidefromkgb asm: avg 200 ms.
#hidefromkgb asm, optimized by #Peter Cordes: avg 145 ms.
#Veedrac C++: avg 81 ms with -O3, 305 ms with -O0.

If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with -O0 (compile fast, no extra optimization, and store/reload to memory after/before every C statement so a debugger can modify variables).
See Agner Fog's Optimizing Assembly guide to learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the x86 tag wiki for more perf links.
See also this more general question about beating the compiler with hand-written asm: Is inline assembly language slower than native C++ code?. TL:DR: yes if you do it wrong (like this question).
Usually you're fine letting the compiler do its thing, especially if you try to write C++ that can compile efficiently. Also see is assembly faster than compiled languages?. One of the answers links to these neat slides showing how various C compilers optimize some really simple functions with cool tricks. Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is in a similar vein.
even:
mov rbx, 2
xor rdx, rdx
div rbx
On Intel Haswell, div r64 is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles. (Plus the 2 uops to set up RBX and zero RDX, but out-of-order execution can run those early). High-uop-count instructions like DIV are microcoded, which can also cause front-end bottlenecks. In this case, latency is the most relevant factor because it's part of a loop-carried dependency chain.
shr rax, 1 does the same unsigned division: It's 1 uop, with 1c latency, and can run 2 per clock cycle.
For comparison, 32-bit division is faster, but still horrible vs. shifts. idiv r32 is 9 uops, 22-29c latency, and one per 8-11c throughput on Haswell.
As you can see from looking at gcc's -O0 asm output (Godbolt compiler explorer), it only uses shifts instructions. clang -O0 does compile naively like you thought, even using 64-bit IDIV twice. (When optimizing, compilers do use both outputs of IDIV when the source does a division and modulus with the same operands, if they use IDIV at all)
GCC doesn't have a totally-naive mode; it always transforms through GIMPLE, which means some "optimizations" can't be disabled. This includes recognizing division-by-constant and using shifts (power of 2) or a fixed-point multiplicative inverse (non power of 2) to avoid IDIV (see div_by_13 in the above godbolt link).
gcc -Os (optimize for size) does use IDIV for non-power-of-2 division,
unfortunately even in cases where the multiplicative inverse code is only slightly larger but much faster.
Helping the compiler
(summary for this case: use uint64_t n)
First of all, it's only interesting to look at optimized compiler output. (-O3).
-O0 speed is basically meaningless.
Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answer and what I did with the non-unrolled loop in #Veedrac's code below.)
This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs. Comments on #johnfound's answer discuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory, g++ -O3 -march=bdver3 and g++ -O3 -march=skylake will do the right thing. (Or -march=native.) Or -mtune=... to just tune, without using instructions that other CPUs might not support.
My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.
Hand-written asm is a black-box for the optimizer, so constant-propagation doesn't work when inlining makes an input a compile-time constant. Other optimizations are also affected. Read https://gcc.gnu.org/wiki/DontUseInlineAsm before using asm. (And avoid MSVC-style inline asm: inputs/outputs have to go through memory which adds overhead.)
In this case: your n has a signed type, and gcc uses the SAR/SHR/ADD sequence that gives the correct rounding. (IDIV and arithmetic-shift "round" differently for negative inputs, see the SAR insn set ref manual entry). (IDK if gcc tried and failed to prove that n can't be negative, or what. Signed-overflow is undefined behaviour, so it should have been able to.)
You should have used uint64_t n, so it can just SHR. And so it's portable to systems where long is only 32-bit (e.g. x86-64 Windows).
BTW, gcc's optimized asm output looks pretty good (using unsigned long n): the inner loop it inlines into main() does this:
# from gcc5.4 -O3 plus my comments
# edx= count=1
# rax= uint64_t n
.L9: # do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
mov rdi, rax
shr rdi # rdi = n>>1;
test al, 1 # set flags based on n%2 (aka n&1)
mov rax, rcx
cmove rax, rdi # n= (n%2) ? 3*n+1 : n/2;
add edx, 1 # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
cmp/branch to update max and maxi, and then do the next n
The inner loop is branchless, and the critical path of the loop-carried dependency chain is:
3-component LEA (3 cycles)
cmov (2 cycles on Haswell, 1c on Broadwell or later).
Total: 5 cycle per iteration, latency bottleneck. Out-of-order execution takes care of everything else in parallel with this (in theory: I haven't tested with perf counters to see if it really runs at 5c/iter).
The FLAGS input of cmov (produced by TEST) is faster to produce than the RAX input (from LEA->MOV), so it's not on the critical path.
Similarly, the MOV->SHR that produces CMOV's RDI input is off the critical path, because it's also faster than the LEA. MOV on IvyBridge and later has zero latency (handled at register-rename time). (It still takes a uop, and a slot in the pipeline, so it's not free, just zero latency). The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs.
The cmp/jne is also not part of the critical path: it's not loop-carried, because control dependencies are handled with branch prediction + speculative execution, unlike data dependencies on the critical path.
Beating the compiler
GCC did a pretty good job here. It could save one code byte by using inc edx instead of add edx, 1, because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions.
It could also save all the MOV instructions, and the TEST: SHR sets CF= the bit shifted out, so we can use cmovc instead of test / cmovz.
### Hand-optimized version of what gcc does
.L9: #do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
shr rax, 1 # n>>=1; CF = n&1 = n%2
cmovc rax, rcx # n= (n&1) ? 3*n+1 : n/2;
inc edx # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
See #johnfound's answer for another clever trick: remove the CMP by branching on SHR's flag result as well as using it for CMOV: zero only if n was 1 (or 0) to start with. (Fun fact: SHR with count != 1 on Nehalem or earlier causes a stall if you read the flag results. That's how they made it single-uop. The shift-by-1 special encoding is fine, though.)
Avoiding MOV doesn't help with the latency at all on Haswell (Can x86's MOV really be "free"? Why can't I reproduce this at all?). It does help significantly on CPUs like Intel pre-IvB, and AMD Bulldozer-family, where MOV is not zero-latency (and Ice Lake with updated microcode). The compiler's wasted MOV instructions do affect the critical path. BD's complex-LEA and CMOV are both lower latency (2c and 1c respectively), so it's a bigger fraction of the latency. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See #johnfound's answer, where he has timing results from an AMD CPU.
Even on Haswell, this version may help a bit by avoiding some occasional delays where a non-critical uop steals an execution port from one on the critical path, delaying execution by 1 cycle. (This is called a resource conflict). It also saves a register, which may help when doing multiple n values in parallel in an interleaved loop (see below).
LEA's latency depends on the addressing mode, on Intel SnB-family CPUs. 3c for 3 components ([base+idx+const], which takes two separate adds), but only 1c with 2 or fewer components (one add). Some CPUs (like Core2) do even a 3-component LEA in a single cycle, but SnB-family doesn't. Worse, Intel SnB-family standardizes latencies so there are no 2c uops, otherwise 3-component LEA would be only 2c like Bulldozer. (3-component LEA is slower on AMD as well, just not by as much).
So lea rcx, [rax + rax*2] / inc rcx is only 2c latency, faster than lea rcx, [rax + rax*2 + 1], on Intel SnB-family CPUs like Haswell. Break-even on BD, and worse on Core2. It does cost an extra uop, which normally isn't worth it to save 1c latency, but latency is the major bottleneck here and Haswell has a wide enough pipeline to handle the extra uop throughput.
Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't model the pipeline in the target microarchitecture, at least not in the same detail as IACA or other static-analysis tools; they just use some heuristics.)
Simple loop unrolling won't help; this loop bottlenecks on the latency of a loop-carried dependency chain, not on loop overhead / throughput. This means it would do well with hyperthreading (or any other kind of SMT), since the CPU has lots of time to interleave instructions from two threads. This would mean parallelizing the loop in main, but that's fine because each thread can just check a range of n values and produce a pair of integers as a result.
Interleaving by hand within a single thread might be viable, too. Maybe compute the sequence for a pair of numbers in parallel, since each one only takes a couple registers, and they can all update the same max / maxi. This creates more instruction-level parallelism.
The trick is deciding whether to wait until all the n values have reached 1 before getting another pair of starting n values, or whether to break out and get a new start point for just one that reached the end condition, without touching the registers for the other sequence. Probably it's best to keep each chain working on useful data, otherwise you'd have to conditionally increment its counter.
You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector elements where n hadn't reached 1 yet. And then to hide the even longer latency of a SIMD conditional-increment implementation, you'd need to keep more vectors of n values up in the air. Maybe only worth with 256b vector (4x uint64_t).
I think the best strategy to make detection of a 1 "sticky" is to mask the vector of all-ones that you add to increment the counter. So after you've seen a 1 in an element, the increment-vector will have a zero, and +=0 is a no-op.
Untested idea for manual vectorization
# starting with YMM0 = [ n_d, n_c, n_b, n_a ] (64-bit elements)
# ymm4 = _mm256_set1_epi64x(1): increment vector
# ymm5 = all-zeros: count vector
.inner_loop:
vpaddq ymm1, ymm0, xmm0
vpaddq ymm1, ymm1, xmm0
vpaddq ymm1, ymm1, set1_epi64(1) # ymm1= 3*n + 1. Maybe could do this more efficiently?
vpsllq ymm3, ymm0, 63 # shift bit 1 to the sign bit
vpsrlq ymm0, ymm0, 1 # n /= 2
# FP blend between integer insns may cost extra bypass latency, but integer blends don't have 1 bit controlling a whole qword.
vpblendvpd ymm0, ymm0, ymm1, ymm3 # variable blend controlled by the sign bit of each 64-bit element. I might have the source operands backwards, I always have to look this up.
# ymm0 = updated n in each element.
vpcmpeqq ymm1, ymm0, set1_epi64(1)
vpandn ymm4, ymm1, ymm4 # zero out elements of ymm4 where the compare was true
vpaddq ymm5, ymm5, ymm4 # count++ in elements where n has never been == 1
vptest ymm4, ymm4
jnz .inner_loop
# Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero
vextracti128 ymm0, ymm5, 1
vpmaxq .... crap this doesn't exist
# Actually just delay doing a horizontal max until the very very end. But you need some way to record max and maxi.
You can and should implement this with intrinsics instead of hand-written asm.
Algorithmic / implementation improvement:
Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work. e.g. memoize to detect common endings to sequences. Or even better, look at 8 trailing bits at once (gnasher's answer)
#EOF points out that tzcnt (or bsf) could be used to do multiple n/=2 iterations in one step. That's probably better than SIMD vectorizing; no SSE or AVX instruction can do that. It's still compatible with doing multiple scalar ns in parallel in different integer registers, though.
So the loop might look like this:
goto loop_entry; // C++ structured like the asm, for illustration only
do {
n = n*3 + 1;
loop_entry:
shift = _tzcnt_u64(n);
n >>= shift;
count += shift;
} while(n != 1);
This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86's crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed / power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop / 1c latency).
It also puts tzcnt (3c on Haswell and later) on the critical path, so it significantly lengthens the total latency of the loop-carried dependency chain. It does remove any need for a CMOV, or for preparing a register holding n>>1, though. #Veedrac's answer overcomes all this by deferring the tzcnt/shift for multiple iterations, which is highly effective (see below).
We can safely use BSF or TZCNT interchangeably, because n can never be zero at that point. TZCNT's machine-code decodes as BSF on CPUs that don't support BMI1. (Meaningless prefixes are ignored, so REP BSF runs as BSF).
TZCNT performs much better than BSF on AMD CPUs that support it, so it can be a good idea to use REP BSF, even if you don't care about setting ZF if the input is zero rather than the output. Some compilers do this when you use __builtin_ctzll even with -mno-bmi.
They perform the same on Intel CPUs, so just save the byte if that's all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there's nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn't, or that is retroactively disallowed. e.g. Windows 9x's assumes no speculative prefetching of TLB entries, which was safe when the code was written, before Intel updated the TLB management rules.)
Anyway, LZCNT/TZCNT on Haswell have the same false dep as POPCNT: see this Q&A. This is why in gcc's asm output for #Veedrac's code, you see it breaking the dep chain with xor-zeroing on the register it's about to use as TZCNT's destination when it doesn't use dst=src. Since TZCNT/LZCNT/POPCNT never leave their destination undefined or unmodified, this false dependency on the output on Intel CPUs is a performance bug / limitation. Presumably it's worth some transistors / power to have them behave like other uops that go to the same execution unit. The only perf upside is interaction with another uarch limitation: they can micro-fuse a memory operand with an indexed addressing mode on Haswell, but on Skylake where Intel removed the false dep for LZCNT/TZCNT they "un-laminate" indexed addressing modes while POPCNT can still micro-fuse any addr mode.
Improvements to ideas / code from other answers:
#hidefromkgb's answer has a nice observation that you're guaranteed to be able to do one right shift after a 3n+1. You can compute this more even more efficiently than just leaving out the checks between steps. The asm implementation in that answer is broken, though (it depends on OF, which is undefined after SHRD with a count > 1), and slow: ROR rdi,2 is faster than SHRD rdi,rdi,2, and using two CMOV instructions on the critical path is slower than an extra TEST that can run in parallel.
I put tidied / improved C (which guides the compiler to produce better asm), and tested+working faster asm (in comments below the C) up on Godbolt: see the link in #hidefromkgb's answer. (This answer hit the 30k char limit from the large Godbolt URLs, but shortlinks can rot and were too long for goo.gl anyway.)
Also improved the output-printing to convert to a string and make one write() instead of writing one char at a time. This minimizes impact on timing the whole program with perf stat ./collatz (to record performance counters), and I de-obfuscated some of the non-critical asm.
#Veedrac's code
I got a minor speedup from right-shifting as much as we know needs doing, and checking to continue the loop. From 7.5s for limit=1e8 down to 7.275s, on Core2Duo (Merom), with an unroll factor of 16.
code + comments on Godbolt. Don't use this version with clang; it does something silly with the defer-loop. Using a tmp counter k and then adding it to count later changes what clang does, but that slightly hurts gcc.
See discussion in comments: Veedrac's code is excellent on CPUs with BMI1 (i.e. not Celeron/Pentium)

Claiming that the C++ compiler can produce more optimal code than a competent assembly language programmer is a very bad mistake. And especially in this case. The human always can make the code better than the compiler can, and this particular situation is a good illustration of this claim.
The timing difference you're seeing is because the assembly code in the question is very far from optimal in the inner loops.
(The below code is 32-bit, but can be easily converted to 64-bit)
For example, the sequence function can be optimized to only 5 instructions:
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
The whole code looks like:
include "%lib%/freshlib.inc"
#BinaryType console, compact
options.DebugMode = 1
include "%lib%/freshlib.asm"
start:
InitializeAll
mov ecx, 999999
xor edi, edi ; max
xor ebx, ebx ; max i
.main_loop:
xor esi, esi
mov eax, ecx
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
cmp edi, esi
cmovb edi, esi
cmovb ebx, ecx
dec ecx
jnz .main_loop
OutputValue "Max sequence: ", edi, 10, -1
OutputValue "Max index: ", ebx, 10, -1
FinalizeAll
stdcall TerminateAll, 0
In order to compile this code, FreshLib is needed.
In my tests, (1 GHz AMD A4-1200 processor), the above code is approximately four times faster than the C++ code from the question (when compiled with -O0: 430 ms vs. 1900 ms), and more than two times faster (430 ms vs. 830 ms) when the C++ code is compiled with -O3.
The output of both programs is the same: max sequence = 525 on i = 837799.

For more performance: A simple change is observing that after n = 3n+1, n will be even, so you can divide by 2 immediately. And n won't be 1, so you don't need to test for it. So you could save a few if statements and write:
while (n % 2 == 0) n /= 2;
if (n > 1) for (;;) {
n = (3*n + 1) / 2;
if (n % 2 == 0) {
do n /= 2; while (n % 2 == 0);
if (n == 1) break;
}
}
Here's a big win: If you look at the lowest 8 bits of n, all the steps until you divided by 2 eight times are completely determined by those eight bits. For example, if the last eight bits are 0x01, that is in binary your number is ???? 0000 0001 then the next steps are:
3n+1 -> ???? 0000 0100
/ 2 -> ???? ?000 0010
/ 2 -> ???? ??00 0001
3n+1 -> ???? ??00 0100
/ 2 -> ???? ???0 0010
/ 2 -> ???? ???? 0001
3n+1 -> ???? ???? 0100
/ 2 -> ???? ???? ?010
/ 2 -> ???? ???? ??01
3n+1 -> ???? ???? ??00
/ 2 -> ???? ???? ???0
/ 2 -> ???? ???? ????
So all these steps can be predicted, and 256k + 1 is replaced with 81k + 1. Something similar will happen for all combinations. So you can make a loop with a big switch statement:
k = n / 256;
m = n % 256;
switch (m) {
case 0: n = 1 * k + 0; break;
case 1: n = 81 * k + 1; break;
case 2: n = 81 * k + 1; break;
...
case 155: n = 729 * k + 425; break;
...
}
Run the loop until n ≤ 128, because at that point n could become 1 with fewer than eight divisions by 2, and doing eight or more steps at a time would make you miss the point where you reach 1 for the first time. Then continue the "normal" loop - or have a table prepared that tells you how many more steps are need to reach 1.
PS. I strongly suspect Peter Cordes' suggestion would make it even faster. There will be no conditional branches at all except one, and that one will be predicted correctly except when the loop actually ends. So the code would be something like
static const unsigned int multipliers [256] = { ... }
static const unsigned int adders [256] = { ... }
while (n > 128) {
size_t lastBits = n % 256;
n = (n >> 8) * multipliers [lastBits] + adders [lastBits];
}
In practice, you would measure whether processing the last 9, 10, 11, 12 bits of n at a time would be faster. For each bit, the number of entries in the table would double, and I excect a slowdown when the tables don't fit into L1 cache anymore.
PPS. If you need the number of operations: In each iteration we do exactly eight divisions by two, and a variable number of (3n + 1) operations, so an obvious method to count the operations would be another array. But we can actually calculate the number of steps (based on number of iterations of the loop).
We could redefine the problem slightly: Replace n with (3n + 1) / 2 if odd, and replace n with n / 2 if even. Then every iteration will do exactly 8 steps, but you could consider that cheating :-) So assume there were r operations n <- 3n+1 and s operations n <- n/2. The result will be quite exactly n' = n * 3^r / 2^s, because n <- 3n+1 means n <- 3n * (1 + 1/3n). Taking the logarithm we find r = (s + log2 (n' / n)) / log2 (3).
If we do the loop until n ≤ 1,000,000 and have a precomputed table how many iterations are needed from any start point n ≤ 1,000,000 then calculating r as above, rounded to the nearest integer, will give the right result unless s is truly large.

On a rather unrelated note: more performance hacks!
[the first «conjecture» has been finally debunked by #ShreevatsaR; removed]
When traversing the sequence, we can only get 3 possible cases in the 2-neighborhood of the current element N (shown first):
[even] [odd]
[odd] [even]
[even] [even]
To leap past these 2 elements means to compute (N >> 1) + N + 1, ((N << 1) + N + 1) >> 1 and N >> 2, respectively.
Let`s prove that for both cases (1) and (2) it is possible to use the first formula, (N >> 1) + N + 1.
Case (1) is obvious. Case (2) implies (N & 1) == 1, so if we assume (without loss of generality) that N is 2-bit long and its bits are ba from most- to least-significant, then a = 1, and the following holds:
(N << 1) + N + 1: (N >> 1) + N + 1:
b10 b1
b1 b
+ 1 + 1
---- ---
bBb0 bBb
where B = !b. Right-shifting the first result gives us exactly what we want.
Q.E.D.: (N & 1) == 1 ⇒ (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1.
As proven, we can traverse the sequence 2 elements at a time, using a single ternary operation. Another 2× time reduction.
The resulting algorithm looks like this:
uint64_t sequence(uint64_t size, uint64_t *path) {
uint64_t n, i, c, maxi = 0, maxc = 0;
for (n = i = (size - 1) | 1; i > 2; n = i -= 2) {
c = 2;
while ((n = ((n & 3)? (n >> 1) + n + 1 : (n >> 2))) > 2)
c += 2;
if (n == 2)
c++;
if (c > maxc) {
maxi = i;
maxc = c;
}
}
*path = maxc;
return maxi;
}
int main() {
uint64_t maxi, maxc;
maxi = sequence(1000000, &maxc);
printf("%llu, %llu\n", maxi, maxc);
return 0;
}
Here we compare n > 2 because the process may stop at 2 instead of 1 if the total length of the sequence is odd.
[EDIT:]
Let`s translate this into assembly!
MOV RCX, 1000000;
DEC RCX;
AND RCX, -2;
XOR RAX, RAX;
MOV RBX, RAX;
#main:
XOR RSI, RSI;
LEA RDI, [RCX + 1];
#loop:
ADD RSI, 2;
LEA RDX, [RDI + RDI*2 + 2];
SHR RDX, 1;
SHRD RDI, RDI, 2; ror rdi,2 would do the same thing
CMOVL RDI, RDX; Note that SHRD leaves OF = undefined with count>1, and this doesn't work on all CPUs.
CMOVS RDI, RDX;
CMP RDI, 2;
JA #loop;
LEA RDX, [RSI + 1];
CMOVE RSI, RDX;
CMP RAX, RSI;
CMOVB RAX, RSI;
CMOVB RBX, RCX;
SUB RCX, 2;
JA #main;
MOV RDI, RCX;
ADD RCX, 10;
PUSH RDI;
PUSH RCX;
#itoa:
XOR RDX, RDX;
DIV RCX;
ADD RDX, '0';
PUSH RDX;
TEST RAX, RAX;
JNE #itoa;
PUSH RCX;
LEA RAX, [RBX + 1];
TEST RBX, RBX;
MOV RBX, RDI;
JNE #itoa;
POP RCX;
INC RDI;
MOV RDX, RDI;
#outp:
MOV RSI, RSP;
MOV RAX, RDI;
SYSCALL;
POP RAX;
TEST RAX, RAX;
JNE #outp;
LEA RAX, [RDI + 59];
DEC RDI;
SYSCALL;
Use these commands to compile:
nasm -f elf64 file.asm
ld -o file file.o
See the C and an improved/bugfixed version of the asm by Peter Cordes on Godbolt. (editor's note: Sorry for putting my stuff in your answer, but my answer hit the 30k char limit from Godbolt links + text!)

C++ programs are translated to assembly programs during the generation of machine code from the source code. It would be virtually wrong to say assembly is slower than C++. Moreover, the binary code generated differs from compiler to compiler. So a smart C++ compiler may produce binary code more optimal and efficient than a dumb assembler's code.
However I believe your profiling methodology has certain flaws. The following are general guidelines for profiling:
Make sure your system is in its normal/idle state. Stop all running processes (applications) that you started or that use CPU intensively (or poll over the network).
Your datasize must be greater in size.
Your test must run for something more than 5-10 seconds.
Do not rely on just one sample. Perform your test N times. Collect results and calculate the mean or median of the result.

From comments:
But, this code never stops (because of integer overflow) !?! Yves Daoust
For many numbers it will not overflow.
If it will overflow - for one of those unlucky initial seeds, the overflown number will very likely converge toward 1 without another overflow.
Still this poses interesting question, is there some overflow-cyclic seed number?
Any simple final converging series starts with power of two value (obvious enough?).
2^64 will overflow to zero, which is undefined infinite loop according to algorithm (ends only with 1), but the most optimal solution in answer will finish due to shr rax producing ZF=1.
Can we produce 2^64? If the starting number is 0x5555555555555555, it's odd number, next number is then 3n+1, which is 0xFFFFFFFFFFFFFFFF + 1 = 0. Theoretically in undefined state of algorithm, but the optimized answer of johnfound will recover by exiting on ZF=1. The cmp rax,1 of Peter Cordes will end in infinite loop (QED variant 1, "cheapo" through undefined 0 number).
How about some more complex number, which will create cycle without 0?
Frankly, I'm not sure, my Math theory is too hazy to get any serious idea, how to deal with it in serious way. But intuitively I would say the series will converge to 1 for every number : 0 < number, as the 3n+1 formula will slowly turn every non-2 prime factor of original number (or intermediate) into some power of 2, sooner or later. So we don't need to worry about infinite loop for original series, only overflow can hamper us.
So I just put few numbers into sheet and took a look on 8 bit truncated numbers.
There are three values overflowing to 0: 227, 170 and 85 (85 going directly to 0, other two progressing toward 85).
But there's no value creating cyclic overflow seed.
Funnily enough I did a check, which is the first number to suffer from 8 bit truncation, and already 27 is affected! It does reach value 9232 in proper non-truncated series (first truncated value is 322 in 12th step), and the maximum value reached for any of the 2-255 input numbers in non-truncated way is 13120 (for the 255 itself), maximum number of steps to converge to 1 is about 128 (+-2, not sure if "1" is to count, etc...).
Interestingly enough (for me) the number 9232 is maximum for many other source numbers, what's so special about it? :-O 9232 = 0x2410 ... hmmm.. no idea.
Unfortunately I can't get any deep grasp of this series, why does it converge and what are the implications of truncating them to k bits, but with cmp number,1 terminating condition it's certainly possible to put the algorithm into infinite loop with particular input value ending as 0 after truncation.
But the value 27 overflowing for 8 bit case is sort of alerting, this looks like if you count the number of steps to reach value 1, you will get wrong result for majority of numbers from the total k-bit set of integers. For the 8 bit integers the 146 numbers out of 256 have affected series by truncation (some of them may still hit the correct number of steps by accident maybe, I'm too lazy to check).

You did not post the code generated by the compiler, so there' some guesswork here, but even without having seen it, one can say that this:
test rax, 1
jpe even
... has a 50% chance of mispredicting the branch, and that will come expensive.
The compiler almost certainly does both computations (which costs neglegibly more since the div/mod is quite long latency, so the multiply-add is "free") and follows up with a CMOV. Which, of course, has a zero percent chance of being mispredicted.

For the Collatz problem, you can get a significant boost in performance by caching the "tails". This is a time/memory trade-off. See: memoization
(https://en.wikipedia.org/wiki/Memoization). You could also look into dynamic programming solutions for other time/memory trade-offs.
Example python implementation:
import sys
inner_loop = 0
def collatz_sequence(N, cache):
global inner_loop
l = [ ]
stop = False
n = N
tails = [ ]
while not stop:
inner_loop += 1
tmp = n
l.append(n)
if n <= 1:
stop = True
elif n in cache:
stop = True
elif n % 2:
n = 3*n + 1
else:
n = n // 2
tails.append((tmp, len(l)))
for key, offset in tails:
if not key in cache:
cache[key] = l[offset:]
return l
def gen_sequence(l, cache):
for elem in l:
yield elem
if elem in cache:
yield from gen_sequence(cache[elem], cache)
raise StopIteration
if __name__ == "__main__":
le_cache = {}
for n in range(1, 4711, 5):
l = collatz_sequence(n, le_cache)
print("{}: {}".format(n, len(list(gen_sequence(l, le_cache)))))
print("inner_loop = {}".format(inner_loop))

As a generic answer, not specifically directed at this task: In many cases, you can significantly speed up any program by making improvements at a high level. Like calculating data once instead of multiple times, avoiding unnecessary work completely, using caches in the best way, and so on. These things are much easier to do in a high level language.
Writing assembler code, it is possible to improve on what an optimising compiler does, but it is hard work. And once it's done, your code is much harder to modify, so it is much more difficult to add algorithmic improvements. Sometimes the processor has functionality that you cannot use from a high level language, inline assembly is often useful in these cases and still lets you use a high level language.
In the Euler problems, most of the time you succeed by building something, finding why it is slow, building something better, finding why it is slow, and so on and so on. That is very, very hard using assembler. A better algorithm at half the possible speed will usually beat a worse algorithm at full speed, and getting the full speed in assembler isn't trivial.

Even without looking at assembly, the most obvious reason is that /= 2 is probably optimized as >>=1 and many processors have a very quick shift operation. But even if a processor doesn't have a shift operation, the integer division is faster than floating point division.
Edit: your milage may vary on the "integer division is faster than floating point division" statement above. The comments below reveal that the modern processors have prioritized optimizing fp division over integer division. So if someone were looking for the most likely reason for the speedup which this thread's question asks about, then compiler optimizing /=2 as >>=1 would be the best 1st place to look.
On an unrelated note, if n is odd, the expression n*3+1 will always be even. So there is no need to check. You can change that branch to
{
n = (n*3+1) >> 1;
count += 2;
}
So the whole statement would then be
if (n & 1)
{
n = (n*3 + 1) >> 1;
count += 2;
}
else
{
n >>= 1;
++count;
}

The simple answer:
doing a MOV RBX, 3 and MUL RBX is expensive; just ADD RBX, RBX twice
ADD 1 is probably faster than INC here
MOV 2 and DIV is very expensive; just shift right
64-bit code is usually noticeably slower than 32-bit code and the alignment issues are more complicated; with small programs like this you have to pack them so you are doing parallel computation to have any chance of being faster than 32-bit code
If you generate the assembly listing for your C++ program, you can see how it differs from your assembly.

Using nested vectors vs a flatten vector wrapper, strange behaviour

The problem
For a long time I had the impression that using a nested std::vector<std::vector...> for simulating an N-dimensional array is in general bad, since the memory is not guarantee to be contiguous, and one may have cache misses. I thought it's better to use a flat vector and map from multiple dimensions to 1D and vice versa. So, I decided to test it (code listed at the end). It is pretty straightforward, I timed reading/writing to a nested 3D vector vs my own 3D wrapper of an 1D vector. I compiled the code with both g++ and clang++, with -O3 optimization turned on. For each run I changed the dimensions, so I can get a pretty good idea about the behaviour. To my surprise, these are the results I obtained on my machine MacBook Pro (Retina, 13-inch, Late 2012), 2.5GHz i5, 8GB RAM, OS X 10.10.5:
g++ 5.2
dimensions nested flat
X Y Z (ms) (ms)
100 100 100 -> 16 24
150 150 150 -> 58 98
200 200 200 -> 136 308
250 250 250 -> 264 746
300 300 300 -> 440 1537
clang++ (LLVM 7.0.0)
dimensions nested flat
X Y Z (ms) (ms)
100 100 100 -> 16 18
150 150 150 -> 53 61
200 200 200 -> 135 137
250 250 250 -> 255 271
300 300 300 -> 423 477
As you can see, the "flatten" wrapper is never beating the nested version. Moreover, g++'s libstdc++ implementation performs quite badly compared to libc++ implementation, for example for 300 x 300 x 300 the flatten version is almost 4 times slower than the nested version. libc++ seems to have equal performance.
My questions:
Why isn't the flatten version faster? Shouldn't it be? Am I missing something in the testing code?
Moreover, why does g++'s libstdc++ performs so badly when using flatten vectors? Again, shouldn't it perform better?
The code I used:
#include <chrono>
#include <cstddef>
#include <iostream>
#include <memory>
#include <random>
#include <vector>
// Thin wrapper around flatten vector
template<typename T>
class Array3D
{
std::size_t _X, _Y, _Z;
std::vector<T> _vec;
public:
Array3D(std::size_t X, std::size_t Y, std::size_t Z):
_X(X), _Y(Y), _Z(Z), _vec(_X * _Y * _Z) {}
T& operator()(std::size_t x, std::size_t y, std::size_t z)
{
return _vec[z * (_X * _Y) + y * _X + x];
}
const T& operator()(std::size_t x, std::size_t y, std::size_t z) const
{
return _vec[z * (_X * _Y) + y * _X + x];
}
};
int main(int argc, char** argv)
{
std::random_device rd{};
std::mt19937 rng{rd()};
std::uniform_real_distribution<double> urd(-1, 1);
const std::size_t X = std::stol(argv[1]);
const std::size_t Y = std::stol(argv[2]);
const std::size_t Z = std::stol(argv[3]);
// Standard library nested vector
std::vector<std::vector<std::vector<double>>>
vec3D(X, std::vector<std::vector<double>>(Y, std::vector<double>(Z)));
// 3D wrapper around a 1D flat vector
Array3D<double> vec1D(X, Y, Z);
// TIMING nested vectors
std::cout << "Timing nested vectors...\n";
auto start = std::chrono::steady_clock::now();
volatile double tmp1 = 0;
for (std::size_t x = 0 ; x < X; ++x)
{
for (std::size_t y = 0 ; y < Y; ++y)
{
for (std::size_t z = 0 ; z < Z; ++z)
{
vec3D[x][y][z] = urd(rng);
tmp1 += vec3D[x][y][z];
}
}
}
std::cout << "\tSum: " << tmp1 << std::endl; // we make sure the loops are not optimized out
auto end = std::chrono::steady_clock::now();
std::cout << "Took: ";
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << ms << " milliseconds\n";
// TIMING flatten vector
std::cout << "Timing flatten vector...\n";
start = std::chrono::steady_clock::now();
volatile double tmp2 = 0;
for (std::size_t x = 0 ; x < X; ++x)
{
for (std::size_t y = 0 ; y < Y; ++y)
{
for (std::size_t z = 0 ; z < Z; ++z)
{
vec1D(x, y, z) = urd(rng);
tmp2 += vec1D(x, y, z);
}
}
}
std::cout << "\tSum: " << tmp2 << std::endl; // we make sure the loops are not optimized out
end = std::chrono::steady_clock::now();
std::cout << "Took: ";
ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << ms << " milliseconds\n";
}
EDIT
Changing the Array3D<T>::operator() return to
return _vec[(x * _Y + y) * _Z + z];
as per #1201ProgramAlarm's suggestion does indeed get rid of the "weird" behaviour of g++, in the sense that the flat and nested versions take now roughly the same time. However it's still intriguing. I thought the nested one will be much worse due to cache issues. May I just be lucky and have all the memory contiguously allocated?

Why the nested vectors are about the same speed as flat in your microbenchmark, after fixing the indexing order: You'd expect the flat array to be faster (see Tobias's answer about potential locality problems, and my other answer for why nested vectors suck in general, but not too badly for sequential access). But your specific test is doing so many things that let out-of-order execution hide the overhead of using nested vectors, and/or that just slow things down so much that the extra overhead is lost in measurement noise.
I put your performance-bugfixed source code up on Godbolt so we can look at the asm of the inner loop as compiled by g++5.2, with -O3. (Apple's fork of clang might be similar to clang3.7, but I'll just look at the gcc version.) There's a lot of code from C++ functions, but you can right-click on a source line to scroll the asm windows to the code for that line. Also, mouseover a source line to bold the asm that implements that line, or vice versa.
gcc's inner two loops for the nested version are as follows (with some comments added by hand):
## outer-most loop not shown
.L213: ## middle loop (over `y`)
test rbp, rbp # Z
je .L127 # inner loop runs zero times if Z==0
mov rax, QWORD PTR [rsp+80] # MEM[(struct vector * *)&vec3D], MEM[(struct vector * *)&vec3D]
xor r15d, r15d # z = 0
mov rax, QWORD PTR [rax+r12] # MEM[(struct vector * *)_195], MEM[(struct vector * *)_195]
mov rdx, QWORD PTR [rax+rbx] # D.103857, MEM[(double * *)_38]
## Top of inner-most loop.
.L128:
lea rdi, [rsp+5328] # tmp511, ## function arg: pointer to the RNG object, which is a local on the stack.
lea r14, [rdx+r15*8] # D.103851, ## r14 = &(vec3D[x][y][z])
call double std::generate_canonical<double, 53ul, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul> >(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&) #
addsd xmm0, xmm0 # D.103853, D.103853 ## return val *= 2.0: [0.0, 2.0]
mov rdx, QWORD PTR [rsp+80] # MEM[(struct vector * *)&vec3D], MEM[(struct vector * *)&vec3D] ## redo the pointer-chasing from vec3D.data()
mov rdx, QWORD PTR [rdx+r12] # MEM[(struct vector * *)_150], MEM[(struct vector * *)_150]
subsd xmm0, QWORD PTR .LC6[rip] # D.103859, ## and subtract 1.0: [-1.0, 1.0]
mov rdx, QWORD PTR [rdx+rbx] # D.103857, MEM[(double * *)_27]
movsd QWORD PTR [r14], xmm0 # *_155, D.103859 # store into vec3D[x][y][z]
movsd xmm0, QWORD PTR [rsp+64] # D.103853, tmp1 # reload volatile tmp1
addsd xmm0, QWORD PTR [rdx+r15*8] # D.103853, *_62 # add the value just stored into the array (r14 = rdx+r15*8 because nothing else modifies the pointers in the outer vectors)
add r15, 1 # z,
cmp rbp, r15 # Z, z
movsd QWORD PTR [rsp+64], xmm0 # tmp1, D.103853 # spill tmp1
jne .L128 #,
#End of inner-most loop
.L127: ## middle-loop
add r13, 1 # y,
add rbx, 24 # sizeof(std::vector<> == 24) == the size of 3 pointers.
cmp QWORD PTR [rsp+8], r13 # %sfp, y
jne .L213 #,
## outer loop not shown.
And for the flat loop:
## outer not shown.
.L214:
test rbp, rbp # Z
je .L135 #,
mov rax, QWORD PTR [rsp+280] # D.103849, vec1D._Y
mov rdi, QWORD PTR [rsp+288] # D.103849, vec1D._Z
xor r15d, r15d # z
mov rsi, QWORD PTR [rsp+296] # D.103857, MEM[(double * *)&vec1D + 24B]
.L136: ## inner-most loop
imul rax, r12 # D.103849, x
lea rax, [rax+rbx] # D.103849,
imul rax, rdi # D.103849, D.103849
lea rdi, [rsp+5328] # tmp520,
add rax, r15 # D.103849, z
lea r14, [rsi+rax*8] # D.103851, # &vec1D(x,y,z)
call double std::generate_canonical<double, 53ul, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul> >(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&) #
mov rax, QWORD PTR [rsp+280] # D.103849, vec1D._Y
addsd xmm0, xmm0 # D.103853, D.103853
mov rdi, QWORD PTR [rsp+288] # D.103849, vec1D._Z
mov rsi, QWORD PTR [rsp+296] # D.103857, MEM[(double * *)&vec1D + 24B]
mov rdx, rax # D.103849, D.103849
imul rdx, r12 # D.103849, x # redo address calculation a 2nd time per iteration
subsd xmm0, QWORD PTR .LC6[rip] # D.103859,
add rdx, rbx # D.103849, y
imul rdx, rdi # D.103849, D.103849
movsd QWORD PTR [r14], xmm0 # MEM[(double &)_181], D.103859 # store into the address calculated earlier
movsd xmm0, QWORD PTR [rsp+72] # D.103853, tmp2
add rdx, r15 # tmp374, z
add r15, 1 # z,
addsd xmm0, QWORD PTR [rsi+rdx*8] # D.103853, MEM[(double &)_170] # tmp2 += vec1D(x,y,z). rsi+rdx*8 == r14, so this is a reload of the store this iteration.
cmp rbp, r15 # Z, z
movsd QWORD PTR [rsp+72], xmm0 # tmp2, D.103853
jne .L136 #,
.L135: ## middle loop: increment y
add rbx, 1 # y,
cmp r13, rbx # Y, y
jne .L214 #,
## outer loop not shown.
Your MacBook Pro (Late 2012) has an Intel IvyBridge CPU, so I'm using numbers for that microarchitecture from Agner Fog's instruction tables and microarch guide. Things should be mostly the same on other Intel/AMD CPUs.
The only 2.5GHz mobile IvB i5 is the i5-3210M, so your CPU has 3MiB of L3 cache. This means even your smallest test case (100^3 * 8B per double ~= 7.63MiB) is larger than your last-level cache, so none of your test cases fit in cache at all. That's probably a good thing, because you allocate and default-initialize both nested and flat before testing either of them. However, you do test in the same order you allocate, so if the nested array is still cache after zeroing the flat array, the flat array may still be hot in L3 cache after the timing loop over the nested array.
If you'd used a repeat-loop to loop over the same array multiple times, you could have got times large enough to measure for smaller array sizes.
You're doing several things here that are super-weird and make this so slow that out-of-order execution can hide the extra latency of changing y, even if your inner z vectors are not perfectly contiguous.
You run a slow PRNG inside the timed loop. std::uniform_real_distribution<double> urd(-1, 1); is extra overhead on top of std::mt19937 rng{rd()};, which is already slow compared to FP-add latency (3 cycles), or compared to the L1D cache load throughput of 2 per cycle. All this extra time running the PRNG gives out-of-order execution a chance to run the array-indexing instructions so the final address is ready by the time the data is. Unless you have a lot of cache misses, you're mostly just measuring PRNG speed, because it produces results much slower than 1 per clock cycle.
g++5.2 doesn't fully inline the urd(rng) code, and the x86-64 System V calling convention has no call-preserved XMM registers. So tmp1/tmp2 have to be spilled/reloaded for every element, even if they weren't volatile.
It also loses its place in the Z vector, and has to redo the outer 2 levels of indirection before accessing the next z element. This is because it doesn't know about the internals of the function its calling, and assumes that it might have a pointer to the outer vector<>'s memory. (The flat version does two multiplies for indexing in the inner loop, instead of a simple pointer-add.)
clang (with libc++) does fully inline the PRNG, so moving to the next z is just add reg, 8 to increment a pointer in both the flat and nested versions. You could get the same behaviour from gcc by getting an iterator outside the inner loop, or getting a reference to the inner vector, instead of redoing operator[] and hoping the compiler will hoist it for you.
Intel/AMD FP add/sub/mul throughput/latency is not data-dependent, except for denormals. (x87 also slows down for NaN and maybe infinity, but SSE doesn't. 64-bit code uses SSE even for scalar float/double.) So you could have just initialized your array with zeros, or with a PRNG outisde the timing loop. (Or left them zeroed, since the vector<double> constructor does that for you, and it actually takes extra code to get it not to in cases where you're going to write something else.) Division and sqrt performance is data-dependent on some CPUs, and much slower than add/sub/mul.
You write every element right before you read it, inside the inner loop. In the source, this looks like a store/reload. That's what gcc actually does, unfortunately, but clang with libc++ (which inlines the PRNG) transforms the loop body:
// original
vec3D[x][y][z] = urd(rng);
tmp1 += vec3D[x][y][z];
// what clang's asm really does
double xmm7 = urd(rng);
vec3D[x][y][z] = xmm7;
tmp1 += xmm7;
In clang's asm:
# do { ...
addsd xmm7, xmm4 # last instruction of the PRNG
movsd qword ptr [r8], xmm7 # store it into the Z vector
addsd xmm7, qword ptr [rsp + 88]
add r8, 8 # pointer-increment to walk along the Z vector
dec r13 # i--
movsd qword ptr [rsp + 88], xmm7
jne .LBB0_74 # }while(i != 0);
It's allowed to do this because vec3D isn't volatile or atomic<>, so it would be undefined behaviour for any other thread to be writing this memory at the same time. This means it can optimize away store/reloads to objects in memory into just a store (and simply use the value it stored, without reloading). Or optimize the store away entirely if it can prove it's a dead store (a store that nothing can ever read, e.g. to an unused static variable).
In gcc's version, it does the indexing for the store before the PRNG call, and the indexing for the reload after. So I think gcc isn't sure that the function call doesn't modify a pointer, because pointers to the outer vectors have escaped the function. (And the PRNG doesn't inline).
However, even an actual store/reload in the asm is still less sensitive to cache-misses than a simple load!
Store->load forwarding still works even if the store misses in cache. So a cache-miss in a Z vector doesn't directly delay the critical path. It only slows you down if out-of-order execution can't hide the latency of the cache miss. (A store can retire as soon as the data is written to the store-buffer (and all previous instructions have retired). I'm not sure if a load can retire before the cache-line even makes it to L1D, if it got its data from store-forwarding. It may be able to, because x86 does allow StoreLoad reordering (stores are allowed to become globally visible after loads). In that case, a store/reload only adds 6 cycles of latency for the PRNG result (off the critical path from one PRNG state to next PRNG state). And it's only throughput a bottleneck if it cache-misses so much that the store-buffer fills up and prevents new store uops from executing, which in turn eventually prevents new uops from issuing into the out-of-order core when the Reservation Station or ROB fills up with un-executed or un-retired (respectively) uops.
With reversed indexing (original version of the flat code), probably the main bottleneck was the scattered stores. IDK why clang did so much better than gcc there. Maybe clang managed to invert a loop and traverse memory in sequential order after all. (Since it fully inlined the PRNG, there were no function calls that would require the state of memory to match program-order.)
Traversing each Z vector in-order means that cache misses are relatively far-between (even if each Z vector is not contiguous with the previous one), giving lots of time for for the stores to execute. Or even if a store-forwarded load can't actually retire until the L1D cache actually owns the cache line (in Modified state of the MESI protocol), speculative execution has the correct data and didn't have to wait for the latency of the cache-miss. The out-of-order instruction window is probably big enough to keep the critical path probably from stalling before the load can retire. (Cache miss loads are normally really bad, because dependent instructions can't be executed without data for them to operate on. So they much more easily create bubbles in the pipeline. With a full cache-miss from DRAM having a latency of over 300 cycles, and the out-of-order window being 168 uops on IvB, it can't hide all of the latency for code executing at even 1 uop (approximately 1 instruction) per clock.) For pure stores, the out-of-order window extends beyond the ROB size, because they don't need to commit to L1D to retire. In fact, they can't commit until after they retire, because that's the point at which they're known to be non-speculative. (So making them globally-visible earlier than that would prevent roll-back on detection of an exception or mis-speculation.)
I don't have libc++ installed on my desktop, so I can't benchmark that version against g++. With g++5.4, I'm finding Nested: 225 milliseconds and Flat: 239 milliseconds. I suspect that the extra array-indexing multiplies are a problem, and compete with ALU instructions the PRNG uses. In contrast, the nested version redoing a bunch of pointer-chasing that hits in L1D cache can happen in parallel. My desktop is a Skylake i7-6700k at 4.4GHz. SKL has a ROB (ReOrder Buffer) size of 224 uops, and RS of 97 uops, so the out-of-order window is very large. It also has FP-add latency of 4 cycles (unlike previous uarches where it was 3).
volatile double tmp1 = 0; Your accumulator is volatile, which forces the compiler to store/reload it every iteration of the inner loop. The total latency of the loop-carried dependency chain in the inner loop is 9 cycles: 3 for addsd, and 6 for store-forwarding from movsd the store to the movsd reload. (clang folds the reload into a memory operand with addsd xmm7, qword ptr [rsp + 88], but same difference. ([rsp+88] is on the stack, where variables with automatic storage are stored, if they need to be spilled from registers.)
As noted above, the non-inline function call for gcc will also force a spill/reload in the x86-64 System V calling convention (used by everything but Windows). But a smart compiler could have done 4 PRNG calls, for example, and then 4 array stores. (If you'd used an iterator to make sure gcc knew the vectors holding other vectors weren't changing.)
Using -ffast-math would have let the compiler auto-vectorize (if not for the PRNG and volatile). This would let you run through the arrays fast enough that lack of locality between different Z vectors could be a real problem. It would also let compilers unroll with multiple accumulators, to hide the FP-add latency. e.g. they could (and clang would) make asm equivalent to:
float t0=0, t1=0, t2=0, t3=0;
for () {
t0 += a[i + 0];
t1 += a[i + 1];
t2 += a[i + 2];
t3 += a[i + 3];
}
t0 = (t0 + t1) + (t2 + t3);
That has 4 separate dependency chains, so it can keep 4 FP adds in flight. Since IvB has 3 cycle latency, one per clock throughput for addsd, we only need to keep 4 in flight to saturate its throughput. (Skylake has 4c latency, 2 per clock throughput, same as mul or FMA, so you need 8 accumulators to avoid latency bottlenecks. Actually, even more is better. As testing by the asker of that question showed, Haswell did better with even more accumulators when coming close to maxing out load throughput.)
Something like that would be a much better test of how efficient it is to loop over an Array3D. If you want to stop the loop from being optimized away entirely, just use the result. Test your microbenchmark to make sure that increasing the problem size scales the time; if not then something got optimized away, or you're not testing what you think you're testing. Don't make the inner-loop temporary volatile!!
Writing microbenchmarks is not easy. You have to understand enough to write one that tests what you think you're testing. :P This is a good example of how easy it is to go wrong.
May I just be lucky and have all the memory contiguously allocated?
Yes, that probably happens for many small allocations done in-order, when you haven't allocated and freed anything before doing this. If they were large enough (usually one 4kiB page or larger), glibc malloc would switch to using mmap(MAP_ANONYMOUS) and then the kernel would choose randomized virtual addresses (ASLR). So with larger Z, you might expect locality to get worse. But on the other hand, larger Z vectors mean you spend more of your time looping over one contiguous vector so a cache miss when changing y (and x) becomes relatively less important.
Looping sequentially over your data with your apparently doesn't expose this, because the extra pointer accesses hit in cache, so the pointer-chasing has low enough latency for OOO execution to hide it with your slow loop.
Prefetch has a really easy time keeping up here.
Different compilers / library can make a big difference with this weird test. On my system (Arch Linux, i7-6700k Skylake with 4.4GHz max turbo), the best of 4 runs at 300 300 300 for g++5.4 -O3 was:
Timing nested vectors...
Sum: 579.78
Took: 225 milliseconds
Timing flatten vector...
Sum: 579.78
Took: 239 milliseconds
Performance counter stats for './array3D-gcc54 300 300 300':
532.066374 task-clock (msec) # 1.000 CPUs utilized
2 context-switches # 0.004 K/sec
0 cpu-migrations # 0.000 K/sec
54,523 page-faults # 0.102 M/sec
2,330,334,633 cycles # 4.380 GHz
7,162,855,480 instructions # 3.07 insn per cycle
632,509,527 branches # 1188.779 M/sec
756,486 branch-misses # 0.12% of all branches
0.532233632 seconds time elapsed
vs. g++7.1 -O3 (which apparently decided to branch on something that g++5.4 didn't)
Timing nested vectors...
Sum: 932.159
Took: 363 milliseconds
Timing flatten vector...
Sum: 932.159
Took: 378 milliseconds
Performance counter stats for './array3D-gcc71 300 300 300':
810.911200 task-clock (msec) # 1.000 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
54,523 page-faults # 0.067 M/sec
3,546,467,563 cycles # 4.373 GHz
7,107,511,057 instructions # 2.00 insn per cycle
794,124,850 branches # 979.299 M/sec
55,074,134 branch-misses # 6.94% of all branches
0.811067686 seconds time elapsed
vs. clang4.0 -O3 (with gcc's libstdc++, not libc++)
perf stat ./array3D-clang40-libstdc++ 300 300 300
Timing nested vectors...
Sum: -349.786
Took: 1657 milliseconds
Timing flatten vector...
Sum: -349.786
Took: 1631 milliseconds
Performance counter stats for './array3D-clang40-libstdc++ 300 300 300':
3358.297093 task-clock (msec) # 1.000 CPUs utilized
9 context-switches # 0.003 K/sec
0 cpu-migrations # 0.000 K/sec
54,521 page-faults # 0.016 M/sec
14,679,919,916 cycles # 4.371 GHz
12,917,363,173 instructions # 0.88 insn per cycle
1,658,618,144 branches # 493.887 M/sec
916,195 branch-misses # 0.06% of all branches
3.358518335 seconds time elapsed
I didn't dig into what clang did wrong, or try with -ffast-math and/or -march=native. (Those won't do much unless you remove volatile, though.)
perf stat -d doesn't show more cache misses (L1 or last-level) for clang than gcc. But it does show clang is doing more than twice as many L1D loads.
I did try with a non-square array. It's almost exactly the same time keeping total element count the same but changing the final dimension to 5 or 6.
Even a minor change to the C helps, and makes "flatten" faster than nested with gcc (from 240ms down to 220ms for 300^3, but barely making any difference for nested.):
// vec1D(x, y, z) = urd(rng);
double res = urd(rng);
vec1D(x, y, z) = res; // indexing calculation only done once, after the function call
tmp2 += vec1D(x, y, z);
// using iterators would still avoid redoing it at all.

It is because of how you're ordering your indexes in the 3D class. Since your innermost loop is changing z, that's the largest part of your index so you get a lot of cache misses. Rearrange your indexing to
_vec[(x * _Y + y) * _Z + z]
and you should see better performance.

Reading over the other answers I'm not really satisfied with the accuracy and level of detail of the answers, so I'll attempt an explanation myself:
The man issue here is not indirection but a question of spatial locality:
There are basically two things that make caching especially effective:
Temporal locality, which means that a memory word that has been accessed recently will likely be accessed again in the near future. This might for example happen at nodes near the root of a binary search tree that is accessed frequently.
Spatial locality, which means that if a memory word has been accessed, it is likely that the memory words before or after this word will soon be accessed as well. This happens in our case, for nested and flattened arrays.
To assess the impact that indirection and cache effects might have on this problem, let's just assume that we have X = Y = Z = 1024
Judging from this question, a single cache line (L1, L2 or L3) is 64 bytes long, that means 8 double values. Let's assume the L1 cache has 32 kB (4096 doubles), the L2 cache has 256 kB (32k doubles) and the L3 cache has 8 MB (1M doubles).
This means that - assuming the cache is filled with no other data (which is a bold guess, I know) - in the flattened case only every 4th value of y leads to a L1 cache miss (the L2 cache latency is probably around 10-20 cycles), only every 32nd value of y leads to a L2 cache miss (the L3 cache latency is some value lower than 100 cycles) and only in case of a L3 cache miss we actually have to access main memory. I don't want to open up the whole calculation here, since taking the whole cache hierarchy into account makes it a bit more difficult, but let's just say that almost all accesses to memory can be cached in the flattened case.
In the original formulation of this question, the flattened index was computed differently (z * (_X * _Y) + y * _X + x), an increase of the value that changes in the innermost loop (z) always means a jump of _X * _Y * 64 bit, thus leading to a much more non-local memory layout, which increased cache faults by a large amount.
In the nested case, the answer depends a lot on the value of Z:
If Z is rather large, most of the memory accesses are contiguous, since they refer to the entries of a single vector in the nested
vector<vector<vector>>>, which are laid out contiguously. Only when the y or x value is increased we need to actually use indirection to retrieve the start pointer of the next innermost vector.
If Z is rather small, we need to change our 'base pointer' memory access quite often, which might lead to cache-misses if the storage areas of the innermost vectors are placed rather randomly in memory. However, if they are more or less contiguous, we might observe minimal to no differences in cache performance.
Since there was a question about the assembly output, let me give a short overview:
If you compare the assembly output of the nested and flattened array, you notice a lot of similarities: There are three equivalent nested loops and the counting variables x, y and z are stored in registers. The only real difference - aside from the fact that the nested version uses two counters for every outer index to avoid multiplying by 24 on every address computation, and the flattened version does the same for the innermost loop and multiplying by 8 - can be found in the y loop where instead of just incrementing y and computing the flattened index, we need to do three interdependent memory loads to determine the base pointer for our inner loop:
mov rax, QWORD PTR [rdi]
mov rax, QWORD PTR [rax+r11]
mov rax, QWORD PTR [rax+r10]
But since these only happen every Zth time and the pointers for the 'middle vector' are most likely cached, the time difference is negligible.

(This doesn't really answer the question. I think I read it backwards initially, assuming that the OP had just found what I was expecting, that nested vectors are slower than flat.)
You should expect the nested-vector version to be slower for anything other than sequential access. After fixing the row/column major indexing order for your flat version, it should be faster for many uses, especially since it's easier for a compiler to auto-vectorize with SIMD over a big flat array than over many short std::vector<>.
A cache line is only 64B. That's 8 doubles. Locality on a page level matters because of limited TLB entries, and prefetching requires sequential accesses, but you'll get that anyway (close enough) with nested vectors that are allocated all at once with most malloc implementations. (This is a trivial microbenchmark that doesn't do anything before allocating its vectors. In a real program that allocates and frees some memory before making a lot of small allocations, some of them might be scattered around more.)
Besides locality, the extra levels of indirection are potentially problematic.
A reference / pointer to a std::vector just points to the the fixed-size block that holds the current size, allocated space, and the pointer to the buffer. IDK if any implementations place the buffer right after the control data as part of the same malloced block, but probably that's impossible because sizeof(std::vector<int>) has to be constant so you can have a vector of vectors. Check out the asm on godbolt: A function that just returns v[10] takes one load with an array arg, but two loads with a std::vector arg.
In the nested-vector implementation, loading v[x][y][z] requires 4 steps (assuming a pointer or reference to v is already in a register).
load v.buffer_pointer or v.bp or whatever the implementation calls it. (A pointer to an array of std::vector<std::vector<double>>)
load v.bp[x].bp (A pointer to an array of std::vector<double>)
load v.bp[x].bp[y].bp (A pointer to an array of double)
load v.bp[x].bp[y].bp[z] (The double we want)
A proper 3D array, simulated with a single std::vector, just does:
load v.bp (A pointer to an array of double)
load v.bp[(x*w + y)*h + z] (The double we want)
Multiple accesses to the same simulated 3D array with different x and y require computing a new index, but v.bp will stay in a register. So instead of 3 cache misses, we only get one.
Traversing the 3D array in order hides the penalty of the nested-vector implementation, because there's a loop over all the values in the inner-most vector hiding the overhead of changing x and y. Prefetch of the contiguous pointers in the outer vectors helps here, and Z is small enough in your testing that looping over one inner-most vector won't evict the pointer for the next y value.
What Every Programmer Should Know About Memory is getting somewhat outdated, but it covers the details of caching and locality. Software-prefetching is not nearly as important as it was on P4, so don't pay too much attention to that part of the guide.

May I just be lucky and have all the memory contiguously allocated?
Probably yes. I've modified your sample a little bit, so we have a benchmark which concentrates more on the differences between the two approaches:
array fill is done in a separate pass, so random generator speed is excluded
removed volatile
fixed a little bug (tmp1 was printed instead of tmp2)
added a #if 1 part, which can be used to randomize vec3D placement in memory. As it turned out, it has a huge difference on my machine.
Without randomization (I've used parameters: 300 300 300):
Timing nested vectors...
Sum: -131835
Took: 2122 milliseconds
Timing flatten vector...
Sum: -131835
Took: 2085 milliseconds
So there is a little difference, flatten version is a little bit faster. (I've run several tests, and put the minimal time here).
With randomization:
Timing nested vectors...
Sum: -117685
Took: 3014 milliseconds
Timing flatten vector...
Sum: -117685
Took: 2085 milliseconds
So the cache effect can be seen here: nested version is ~50% slower. This is because CPU cannot predict which memory area will be used, so its cache prefetcher is not efficient.
Here's the modified code:
#include <chrono>
#include <cstddef>
#include <iostream>
#include <memory>
#include <random>
#include <vector>
template<typename T>
class Array3D
{
std::size_t _X, _Y, _Z;
std::vector<T> _vec;
public:
Array3D(std::size_t X, std::size_t Y, std::size_t Z):
_X(X), _Y(Y), _Z(Z), _vec(_X * _Y * _Z) {}
T& operator()(std::size_t x, std::size_t y, std::size_t z)
{
return _vec[(x * _Y + y) * _Z + z];
}
const T& operator()(std::size_t x, std::size_t y, std::size_t z) const
{
return _vec[(x * _Y + y) * _Z + z];
}
};
double nested(std::vector<std::vector<std::vector<double>>> &vec3D, std::size_t X, std::size_t Y, std::size_t Z) {
double tmp1 = 0;
for (int iter=0; iter<100; iter++)
for (std::size_t x = 0 ; x < X; ++x)
{
for (std::size_t y = 0 ; y < Y; ++y)
{
for (std::size_t z = 0 ; z < Z; ++z)
{
tmp1 += vec3D[x][y][z];
}
}
}
return tmp1;
}
double flatten(Array3D<double> &vec1D, std::size_t X, std::size_t Y, std::size_t Z) {
double tmp2 = 0;
for (int iter=0; iter<100; iter++)
for (std::size_t x = 0 ; x < X; ++x)
{
for (std::size_t y = 0 ; y < Y; ++y)
{
for (std::size_t z = 0 ; z < Z; ++z)
{
tmp2 += vec1D(x, y, z);
}
}
}
return tmp2;
}
int main(int argc, char** argv)
{
std::random_device rd{};
std::mt19937 rng{rd()};
std::uniform_real_distribution<double> urd(-1, 1);
const std::size_t X = std::stol(argv[1]);
const std::size_t Y = std::stol(argv[2]);
const std::size_t Z = std::stol(argv[3]);
std::vector<std::vector<std::vector<double>>>
vec3D(X, std::vector<std::vector<double>>(Y, std::vector<double>(Z)));
#if 1
for (std::size_t i = 0 ; i < X*Y; i++)
{
std::size_t xa = rand()%X;
std::size_t ya = rand()%Y;
std::size_t xb = rand()%X;
std::size_t yb = rand()%Y;
std::swap(vec3D[xa][ya], vec3D[xb][yb]);
}
#endif
// 3D wrapper around a 1D flat vector
Array3D<double> vec1D(X, Y, Z);
for (std::size_t x = 0 ; x < X; ++x)
{
for (std::size_t y = 0 ; y < Y; ++y)
{
for (std::size_t z = 0 ; z < Z; ++z)
{
vec3D[x][y][z] = vec1D(x, y, z) = urd(rng);
}
}
}
std::cout << "Timing nested vectors...\n";
auto start = std::chrono::steady_clock::now();
double tmp1 = nested(vec3D, X, Y, Z);
auto end = std::chrono::steady_clock::now();
std::cout << "\tSum: " << tmp1 << std::endl; // we make sure the loops are not optimized out
std::cout << "Took: ";
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << ms << " milliseconds\n";
std::cout << "Timing flatten vector...\n";
start = std::chrono::steady_clock::now();
double tmp2 = flatten(vec1D, X, Y, Z);
end = std::chrono::steady_clock::now();
std::cout << "\tSum: " << tmp2 << std::endl; // we make sure the loops are not optimized out
std::cout << "Took: ";
ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << ms << " milliseconds\n";
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js