How many cache misses will we have for this simple program?

How many cache misses will we have for this simple program? - c++

I have a simple program as follows. When I compiled the code without any optimization, it took 5.986s (user 3.677s, sys 1.716s) to run on a mac with 2.4G i5 processor, and 16GB DDR3-1600 9 CAS memory. I am trying to figure out how many L1 cache miss happen in this program. Any suggestions? Thanks!
void main()
{
int size = 1024 * 1024 * 1024;
int * a = new int[size];
int i;
for (i = 0; i < size; i++) a[i] = i;
delete[] a;
}

You can measure the number of cache misses using cachegrind feature of valgrind. This page provides a pretty detailed summary.
note: If you're using C then you should be using malloc. Don't forget to call free: as it stands your program will leak memory. If you are using C++ (and this question is incorrectly tagged), you should be using new and delete.

If you want really fine granularity measurements of cache misses, you should use Intel's architectural counters, which can accessed from userspace using the rdpmc instruction. The kernel module source I wrote in this answer will enable rdpmc in userspace for older CPU's.
Here is another kernel module to enable configuration of the counters for measuring last-level cache misses and last-level cache references. Note that I have hardcoded 8 cores, because that was what I had happened to use for my configuration.
#include <linux/module.h> /* Needed by all modules */
#include <linux/kernel.h> /* Needed for KERN_INFO */
#define PERFEVTSELx_MSR_BASE 0x00000186
#define PMCx_MSR_BASE 0x000000c1 /* NB: write when evt disabled*/
#define PERFEVTSELx_USR (1U << 16) /* count in rings 1, 2, or 3 */
#define PERFEVTSELx_OS (1U << 17) /* count in ring 0 */
#define PERFEVTSELx_EN (1U << 22) /* enable counter */
static void
write_msr(uint32_t msr, uint64_t val)
{
uint32_t lo = val & 0xffffffff;
uint32_t hi = val >> 32;
__asm __volatile("wrmsr" : : "c" (msr), "a" (lo), "d" (hi));
}
static uint64_t
read_msr(uint32_t msr)
{
uint32_t hi, lo;
__asm __volatile("rdmsr" : "=d" (hi), "=a" (lo) : "c" (msr));
return ((uint64_t) lo) | (((uint64_t) hi) << 32);
}
static uint64_t old_value_perfsel0[8];
static uint64_t old_value_perfsel1[8];
static spinlock_t mr_lock = SPIN_LOCK_UNLOCKED;
static unsigned long flags;
static void wrapper(void* ptr) {
int id;
uint64_t value;
spin_lock_irqsave(&mr_lock, flags);
id = smp_processor_id();
// Save the old values before we do something stupid.
old_value_perfsel0[id] = read_msr(PERFEVTSELx_MSR_BASE);
old_value_perfsel1[id] = read_msr(PERFEVTSELx_MSR_BASE+1);
// Clear out the existing counters
write_msr(PERFEVTSELx_MSR_BASE, 0);
write_msr(PERFEVTSELx_MSR_BASE + 1, 0);
write_msr(PMCx_MSR_BASE, 0);
write_msr(PMCx_MSR_BASE + 1, 0);
if (clear){
spin_unlock_irqrestore(&mr_lock, flags);
return;
}
// Table 19-1 in the most recent Intel Manual - Architectural
// Last Level Cache References Event select 2EH, Umask 4FH
value = 0x2E | (0x4F << 8) |PERFEVTSELx_EN |PERFEVTSELx_OS|PERFEVTSELx_USR;
write_msr(PERFEVTSELx_MSR_BASE, value);
// Table 19-1 in the most recent Intel Manual - Architectural
// Last Level Cache Misses Event select 2EH, Umask 41H
value = 0x2E | (0x41 << 8) |PERFEVTSELx_EN |PERFEVTSELx_OS|PERFEVTSELx_USR;
write_msr(PERFEVTSELx_MSR_BASE + 1, value);
spin_unlock_irqrestore(&mr_lock, flags);
}
static void restore_wrapper(void* ptr) {
int id = smp_processor_id();
if (clear) return;
write_msr(PERFEVTSELx_MSR_BASE, old_value_perfsel0[id]);
write_msr(PERFEVTSELx_MSR_BASE+1, old_value_perfsel1[id]);
}
int init_module(void)
{
printk(KERN_INFO "Entering write-msr!\n");
on_each_cpu(wrapper, NULL, 0);
/*
* A non 0 return means init_module failed; module can't be loaded.
*/
return 0;
}
void cleanup_module(void)
{
on_each_cpu(restore_wrapper, NULL, 0);
printk(KERN_INFO "Exiting write-msr!\n");
}
Here is a user-space wrapper around rdpmc.
uint64_t
read_pmc(int ecx)
{
unsigned int a, d;
__asm __volatile("rdpmc" : "=a"(a), "=d"(d) : "c"(ecx));
return ((uint64_t)a) | (((uint64_t)d) << 32);
}

You have to be running on a 64 bit system. You are setting 4 GB of data to zero. The number of cache misses is 4 x 1024 x 1024 x 1024, divided by the cache line size. However, since all the memory accesses are sequential, you will not have many TLB misses etc. and the processor most likely optimises accesses to sequential cache lines.

Your performance here (or lack thereof) is totally dominated by paging, every new page (probably 4k in your case) would cause a page fault since it's newly allocated and was never used, and trigger an expensive OS flow. Cachegrind and performance monitors should show you the same behavior, so you may be confused if you expected just simple data accesses.
One way to avoid this is to allocate, store once to the entire array (or even once per page) to warm up the page tables, and then measure time internally in you application (using rdtsc or any c API you like) over the main loop.
Alternatively, if you want to use external time measurements, just loop multiple times (> 1000), and amortize, so the initial penalty would become less significant.
Once you do all that, the cache misses you measure should reflect the number of accesses for each new 64 byte line (i.e. ~16M), plus the page walks (256k pages assuming they're 4k, multiplied by the page table levels, since each walk would have to lookup the memory on each level).
under a virtualized platform the paging would become squared (9 accesses instead of 3 for e.g.) since each level of the guest page table also needs paging on the host.

Related

CPU caching understanding

I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:
int64_t MemoryAccessAllElements(const int64_t *data, size_t length) {
for (size_t id = 0; id < length; id++) {
volatile int64_t ignored = data[id];
}
return 0;
}
int64_t MemoryAccessEvery4th(const int64_t *data, size_t length) {
for (size_t id = 0; id < length; id += 4) {
volatile int64_t ignored = data[id];
}
return 0;
}
And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):
There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.
I tested this code on single core CPU with next caches configuration:
CPU Caches:
L1 Data 32K (x1), 8 way associative
L1 Instruction 32K (x1), 8 way associative
L2 Unified 256K (x1), 8 way associative
L3 Unified 30720K (x1), 20 way associative
At this pictures we can see many changes in graph behavior:
There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)
Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses
But there is a lot of questions about results that I can't explain:
Why latency for the MemoryAccessEvery4th decreasing after array exceeded ~1024 bytes?
Why we can see another peak for the MemoryAccessAllElements around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained?
Why we can see latency increasing after passing the L2 cache size while benchmarking MemoryAccessEvery4th but there is no such difference with the MemoryAccessAllElements?
I've tried to compare my results with the results from gallery of processor cache effects and what every programmer should know about memory, but I can't fully describe my results with reasoning from this articles.
Can someone help me to understand the internal processes of CPU caching?
UPD:
I use the following code to measure performance of memory access:
#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length) {
auto random = Random(0);
for (size_t id = 0; id < length; id++) {
array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));
}
}
static void MemoryAccessAllElements_Benchmark(State &state) {
size_t size = static_cast<size_t>(state.range(0));
auto array = new long long[size];
InitializeWithRandomNumbers(array, size);
for (auto _ : state) {
DoNotOptimize(MemoryAccessAllElements(array, size));
}
delete[] array;
}
static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark) {
for (int size = 2; size <= (1 << 24); size *= 2) {
benchmark->Arg(size);
}
}
BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();
You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.

Why is std::fill(0) slower than std::fill(1)?

I have observed on a system that std::fill on a large std::vector<int> was significantly and consistently slower when setting a constant value 0 compared to a constant value 1 or a dynamic value:
5.8 GiB/s vs 7.5 GiB/s
However, the results are different for smaller data sizes, where fill(0) is faster:
With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):
This raises the secondary question, why the peak bandwidth of fill(1) is so much lower.
The test system for this was a dual socket Intel Xeon CPU E5-2680 v3 set at 2.5 GHz (via /sys/cpufreq) with 8x16 GiB DDR4-2133. I tested with GCC 6.1.0 (-O3) and Intel compiler 17.0.1 (-fast), both get identical results. GOMP_CPU_AFFINITY=0,12,1,13,2,14,3,15,4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 was set. Strem/add/24 threads gets 85 GiB/s on the system.
I was able to reproduce this effect on a different Haswell dual socket server system, but not any other architecture. For example on Sandy Bridge EP, memory performance is identical, while in cache fill(0) is much faster.
Here is the code to reproduce:
#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <omp.h>
#include <vector>
using value = int;
using vector = std::vector<value>;
constexpr size_t write_size = 8ll * 1024 * 1024 * 1024;
constexpr size_t max_data_size = 4ll * 1024 * 1024 * 1024;
void __attribute__((noinline)) fill0(vector& v) {
std::fill(v.begin(), v.end(), 0);
}
void __attribute__((noinline)) fill1(vector& v) {
std::fill(v.begin(), v.end(), 1);
}
void bench(size_t data_size, int nthreads) {
#pragma omp parallel num_threads(nthreads)
{
vector v(data_size / (sizeof(value) * nthreads));
auto repeat = write_size / data_size;
#pragma omp barrier
auto t0 = omp_get_wtime();
for (auto r = 0; r < repeat; r++)
fill0(v);
#pragma omp barrier
auto t1 = omp_get_wtime();
for (auto r = 0; r < repeat; r++)
fill1(v);
#pragma omp barrier
auto t2 = omp_get_wtime();
#pragma omp master
std::cout << data_size << ", " << nthreads << ", " << write_size / (t1 - t0) << ", "
<< write_size / (t2 - t1) << "\n";
}
}
int main(int argc, const char* argv[]) {
std::cout << "size,nthreads,fill0,fill1\n";
for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
bench(bytes, 1);
}
for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
bench(bytes, omp_get_max_threads());
}
for (int nthreads = 1; nthreads <= omp_get_max_threads(); nthreads++) {
bench(max_data_size, nthreads);
}
}
Presented results compiled with g++ fillbench.cpp -O3 -o fillbench_gcc -fopenmp.

From your question + the compiler-generated asm from your answer:
fill(0) is an ERMSB rep stosb which will use 256b stores in an optimized microcoded loop. (Works best if the buffer is aligned, probably to at least 32B or maybe 64B).
fill(1) is a simple 128-bit movaps vector store loop. Only one store can execute per core clock cycle regardless of width, up to 256b AVX. So 128b stores can only fill half of Haswell's L1D cache write bandwidth. This is why fill(0) is about 2x as fast for buffers up to ~32kiB. Compile with -march=haswell or -march=native to fix that.
Haswell can just barely keep up with the loop overhead, but it can still run 1 store per clock even though it's not unrolled at all. But with 4 fused-domain uops per clock, that's a lot of filler taking up space in the out-of-order window. Some unrolling would maybe let TLB misses start resolving farther ahead of where stores are happening, since there is more throughput for store-address uops than for store-data. Unrolling might help make up the rest of the difference between ERMSB and this vector loop for buffers that fit in L1D. (A comment on the question says that -march=native only helped fill(1) for L1.)
Note that rep movsd (which could be used to implement fill(1) for int elements) will probably perform the same as rep stosb on Haswell.
Although only the official documentation only guarantees that ERMSB gives fast rep stosb (but not rep stosd), actual CPUs that support ERMSB use similarly efficient microcode for rep stosd. There is some doubt about IvyBridge, where maybe only b is fast. See the #BeeOnRope's excellent ERMSB answer for updates on this.
gcc has some x86 tuning options for string ops (like -mstringop-strategy=alg and -mmemset-strategy=strategy), but IDK if any of them will get it to actually emit rep movsd for fill(1). Probably not, since I assume the code starts out as a loop, rather than a memset.
With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):
A normal movaps store to a cold cache line triggers a Read For Ownership (RFO). A lot of real DRAM bandwidth is spent on reading cache lines from memory when movaps writes the first 16 bytes. ERMSB stores use a no-RFO protocol for its stores, so the memory controllers are only writing. (Except for miscellaneous reads, like page tables if any page-walks miss even in L3 cache, and maybe some load misses in interrupt handlers or whatever).
#BeeOnRope explains in comments that the difference between regular RFO stores and the RFO-avoiding protocol used by ERMSB has downsides for some ranges of buffer sizes on server CPUs where there's high latency in the uncore/L3 cache. See also the linked ERMSB answer for more about RFO vs non-RFO, and the high latency of the uncore (L3/memory) in many-core Intel CPUs being a problem for single-core bandwidth.
movntps (_mm_stream_ps()) stores are weakly-ordered, so they can bypass the cache and go straight to memory a whole cache-line at a time without ever reading the cache line into L1D. movntps avoids RFOs, like rep stos does. (rep stos stores can reorder with each other, but not outside the boundaries of the instruction.)
Your movntps results in your updated answer are surprising.
For a single thread with large buffers, your results are movnt >> regular RFO > ERMSB. So that's really weird that the two non-RFO methods are on opposite sides of the plain old stores, and that ERMSB is so far from optimal. I don't currently have an explanation for that. (edits welcome with an explanation + good evidence).
As we expected, movnt allows multiple threads to achieve high aggregate store bandwidth, like ERMSB. movnt always goes straight into line-fill buffers and then memory, so it is much slower for buffer sizes that fit in cache. One 128b vector per clock is enough to easily saturate a single core's no-RFO bandwidth to DRAM. Probably vmovntps ymm (256b) is only a measurable advantage over vmovntps xmm (128b) when storing the results of a CPU-bound AVX 256b-vectorized computation (i.e. only when it saves the trouble of unpacking to 128b).
movnti bandwidth is low because storing in 4B chunks bottlenecks on 1 store uop per clock adding data to the line fill buffers, not on sending those line-full buffers to DRAM (until you have enough threads to saturate memory bandwidth).
#osgx posted some interesting links in comments:
Agner Fog's asm optimization guide, instruction tables, and microarch guide: http://agner.org/optimize/
Intel optimization guide: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.
NUMA snooping: http://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/
https://software.intel.com/en-us/articles/intelr-memory-latency-checker
Cache Coherence Protocol and Memory
Performance of the Intel Haswell-EP Architecture
See also other stuff in the x86 tag wiki.

I'll share my preliminary findings, in the hope to encourage more detailed answers. I just felt this would be too much as part of the question itself.
The compiler optimizes fill(0) to a internal memset. It cannot do the same for fill(1), since memset only works on bytes.
Specifically, both glibcs __memset_avx2 and __intel_avx_rep_memset are implemented with a single hot instruction:
rep stos %al,%es:(%rdi)
Wheres the manual loop compiles down to an actual 128-bit instruction:
add $0x1,%rax
add $0x10,%rdx
movaps %xmm0,-0x10(%rdx)
cmp %rax,%r8
ja 400f41
Interestingly while there is a template/header optimization to implement std::fill via memset for byte types, but in this case it is a compiler optimization to transform the actual loop.
Strangely,for a std::vector<char>, gcc begins to optimize also fill(1). The Intel compiler does not, despite the memset template specification.
Since this happens only when the code is actually working in memory rather than cache, makes it appears the Haswell-EP architecture fails to efficiently consolidate the single byte writes.
I would appreciate any further insight into the issue and the related micro-architecture details. In particular it is unclear to me why this behaves so differently for four or more threads and why memset is so much faster in cache.
Update:
Here is a result in comparison with
fill(1) that uses -march=native (avx2 vmovdq %ymm0) - it works better in L1, but similar to the movaps %xmm0 version for other memory levels.
Variants of 32, 128 and 256 bit non-temporal stores. They perform consistently with the same performance regardless of the data size. All outperform the other variants in memory, especially for small numbers of threads. 128 bit and 256 bit perform exactly similar, for low numbers of threads 32 bit performs significantly worse.
For <= 6 thread, vmovnt has a 2x advantage over rep stos when operating in memory.
Single threaded bandwidth:
Aggregate bandwidth in memory:
Here is the code used for the additional tests with their respective hot-loops:
void __attribute__ ((noinline)) fill1(vector& v) {
std::fill(v.begin(), v.end(), 1);
}
┌─→add $0x1,%rax
│ vmovdq %ymm0,(%rdx)
│ add $0x20,%rdx
│ cmp %rdi,%rax
└──jb e0
void __attribute__ ((noinline)) fill1_nt_si32(vector& v) {
for (auto& elem : v) {
_mm_stream_si32(&elem, 1);
}
}
┌─→movnti %ecx,(%rax)
│ add $0x4,%rax
│ cmp %rdx,%rax
└──jne 18
void __attribute__ ((noinline)) fill1_nt_si128(vector& v) {
assert((long)v.data() % 32 == 0); // alignment
const __m128i buf = _mm_set1_epi32(1);
size_t i;
int* data;
int* end4 = &v[v.size() - (v.size() % 4)];
int* end = &v[v.size()];
for (data = v.data(); data < end4; data += 4) {
_mm_stream_si128((__m128i*)data, buf);
}
for (; data < end; data++) {
*data = 1;
}
}
┌─→vmovnt %xmm0,(%rdx)
│ add $0x10,%rdx
│ cmp %rcx,%rdx
└──jb 40
void __attribute__ ((noinline)) fill1_nt_si256(vector& v) {
assert((long)v.data() % 32 == 0); // alignment
const __m256i buf = _mm256_set1_epi32(1);
size_t i;
int* data;
int* end8 = &v[v.size() - (v.size() % 8)];
int* end = &v[v.size()];
for (data = v.data(); data < end8; data += 8) {
_mm256_stream_si256((__m256i*)data, buf);
}
for (; data < end; data++) {
*data = 1;
}
}
┌─→vmovnt %ymm0,(%rdx)
│ add $0x20,%rdx
│ cmp %rcx,%rdx
└──jb 40
Note: I had to do manual pointer calculation in order to get the loops so compact. Otherwise it would do vector indexing within the loop, probably due to the intrinsic confusing the optimizer.

System call cost

I'm currently working on operating system operations overheads.
I'm actually studying the cost to make a system call and I've developed a simple C++ program to observe it.
#include <iostream>
#include <unistd.h>
#include <sys/time.h>
uint64_t
rdtscp(void) {
uint32_t eax, edx;
__asm__ __volatile__("rdtscp" //! rdtscp instruction
: "+a" (eax), "=d" (edx) //! output
: //! input
: "%ecx"); //! registers
return (((uint64_t)edx << 32) | eax);
}
int main(void) {
uint64_t before;
uint64_t after;
struct timeval t;
for (unsigned int i = 0; i < 10; ++i) {
before = rdtscp();
gettimeofday(&t, NULL);
after = rdtscp();
std::cout << after - before << std::endl;
std::cout << t.tv_usec << std::endl;
}
return 0;
}
This program is quite straightforward.
The rdtscp function is just a wrapper to call the RTDSCP instruction (a processor instruction which loads the 64-bits cycle count into two 32-bits registers). This function is used to take the timing.
I iterate 10 times. At each iteration I call gettimeofday and determine the take it took to execute it (as a number of CPU cycles).
The results are quite unexpected:
8984
64008
376
64049
164
64053
160
64056
160
64060
160
64063
164
64067
160
64070
160
64073
160
64077
Odd lines in the output are the number of cycles needed to execute the system call. Even lines are the value contains in t.tv_usec (which is set by gettimeofday, the system call that I'm studying).
I don't really understand how that it is possible: the number of cycles drastically decreases, from nearly 10,000 to around 150! But the timeval struct is still updated at each call!
I've tried on different operating system (debian and macos) and the result is similar.
Even if the cache is used, I don't see how it is possible. Making a system call should result in a context switch to switch from user to kernel mode and we still need to read the clock in order to update the time.
Does someone has an idea?

The answer ? try another system call. There's vsyscalls on linux, and they accelerate things for certain syscalls:
What are vdso and vsyscall?
The short version: the syscall is not performed, but instead the kernel maps a region of memory where the process can access the time information. Cost ? Not much (no context switch).

Faster bit reading?

In my application 20% of cpu time is spent on reading bits (skip) through my bit reader. Does anyone have any idea on how one might make the following code faster? At any given time, I do not need more than 20 valid bits (which is why I, in some situations, can use fast_skip).
Bits are read in big-endian order, which is why the byte swap is needed.
class bit_reader
{
std::uint32_t* m_data;
std::size_t m_pos;
std::uint64_t m_block;
public:
bit_reader(void* data)
: m_data(reinterpret_cast<std::uint32_t*>(data))
, m_pos(0)
, m_block(_byteswap_uint64(*reinterpret_cast<std::uint64_t*>(data)))
{
}
std::uint64_t value(std::size_t n_bits = 64)
{
assert(m_pos + n_bits < 64);
return (m_block << m_pos) >> (64 - n_bits);
}
void skip(std::size_t n_bits) // 20% cpu time
{
assert(m_pos + n_bits < 42);
m_pos += n_bits;
if(m_pos > 31)
{
m_block = _byteswap_uint64(reinterpret_cast<std::uint64_t*>(++m_data)[0]);
m_pos -= 32;
}
}
void fast_skip(std::size_t n_bits)
{
assert(m_pos + n_bits < 42);
m_pos += n_bits;
}
};
Target hardware is x64.

I see from an earlier comment you are unpacking Huffman/arithmetic coded streams in JPEG.
skip() and value() are really simple enough to be inlined. There's a chance that the compiler will keep the shift register and buffer pointers in registers the whole while. Making all pointers here and in the caller with the restrict modifier might help by telling the compiler that you won't be writing the results of Huffman decoding into the bit-buffer, thus allowing further optimisation.
The average length of each Huffman/artimetic symbol is short - so, ~7 times out of 8, you won't need to top up the 64-bit shift register. Investigate giving the compiler a branch-prediction hint.
It's unusual for any symbol in the JPEG bitstream to be longer than 32-bits. Does this allow further optimization?
One very logical reason that skip() is a heavy path is that you're calling it a lot. You are consuming an entire symbol at once rather than every bit here aren't you? There are some clever tricks you can do by counting leading 0 or 1s in symbols and table lookup.
You might consider arranging your shift register such that the next bit in the stream is the LSB. This will avoid the shifts in value()

Shifting for 64 bits is definitely not a good idea. In many CPUs shift is a slow operation.
I would advise you to change your code to a byte addressing. This will limit the shift for 8 bits maximum.
In many cases you really do not need a bit by itself, but rather to check if it is present or not. This can be done with a code like:
if (data[bit_inx/64] & mask[bit_inx % 64])
{
....
}

Try substituting this line in skip:
m_block = (m_block << 32) | _byteswap_uint32(*++m_data);

I don't know if it's the cause and what the underlying implementation of _byteswap_uint64 looks like, but you should read Rob Pike's article on byteorder. Maybe that's your answer.
Abstract: endianness is less of a problem than it's often made up to be. And the implementation for byte order swapping often come with issues. But there's a simple alternative.
[EDIT] I've got a better theory. Pasted from my comment below:
Maybe it's aliasing. 64 bit architectures love to align the data by 64 bits, when you read across alignment boundaries, it gets pretty slow. So it could be the (++m_data)[0] part, as x64 is 64 bit aligned and when you reinterpret_cast a uint32_t* to uint64_t*, you are crossing alignment boundaries about half of the time.

If your source buffers are not huge, then you should pre-process them, byte-swap the buffers before you access them using the bit_reader!
Reading from your bit_reader will be much faster then, because:
you will save some conditional instructions
the CPU caches can be used more efficiently: it can read straight from memory, which is most probably already loaded into cpu cache, instead of reading from memory that will be modified after reading each 64bit chunk, and so, destroy the benefits of having had it in cache
EDIT
Oh wait, you do not modify the source buffer. However, putting the byteswap into a pre-processing stage should at least be worth a try.
Another point: make sure those assert() calls will only be in debug version.
EDIT 2
(deleted)
EDIT 3
Your code is definitely flawed, check the following usage scenario:
uint32_t source[] = { 0x00112233, 0x44556677, 0x8899AABB, 0xCCDDEEFF };
bit_reader br(source); // -> m_block = 0x7766554433221100
// reading...
br.value(16); // -> 0x77665544
br.skip(16);
br.value(16); // -> 0x33221100
br.skip(16); // -> triggers reading more bits
// -> m_block = 0xBBAA998877665544, m_pos = 0
br.value(16); // -> 0xBBAA9988
br.skip(16);
br.value(16); // -> 0x77665544
// that's not what you expect, right ???
EDIT 4
Well, no, EDIT 3 was wrong, but I can not help, the code is flawed. Isn't it?
uint32_t source[] = { 0x00112233, 0x44556677, 0x8899AABB, 0xCCDDEEFF };
bit_reader br(source); // -> m_block = 0x7766554433221100
// reading...
br.value(16); // -> 0x7766
br.skip(16);
br.value(16); // -> 0x5544
br.skip(16); // -> triggers reading more bits (because m_pos=32, which is: m_pos>31)
// -> m_block = 0xBBAA998877665544, m_pos = 0
br.value(16); // -> 0xBBAA --> not what you expect, right?

Here is another version I tried, which didn't give any performance improvements.
class bit_reader
{
public:
const std::uint64_t* m_data64;
std::size_t m_pos64;
std::uint64_t m_block0;
std::uint64_t m_block1;
bit_reader(const void* data)
: m_pos64(0)
, m_data64(reinterpret_cast<const std::uint64_t*>(data))
, m_block0(byte_swap(*m_data64++))
, m_block1(byte_swap(*m_data64++))
{
}
std::uint64_t value(std::size_t n_bits = 64)
{
return __shiftleft128(m_block1, m_block0, m_pos64) >> (64 - n_bits);
}
void skip(std::size_t n_bits)
{
m_pos64 += n_bits;
if(m_pos64 > 63)
{
m_block0 = m_block1;
m_block1 = byte_swap(*m_data64++);
m_pos64 -= 64;
}
}
void fast_skip(std::size_t n_bits)
{
skip(n_bits);
}
};

If possible it would be best to do this in multiple passes. Multiple runs can be optimized and reduced breaching.
In general it is best to do
const uint64_t * arr = data;
for(uint64_t * i = arr; i != &arr[len/sizeof(uint64_t)] ;i++)
{
*i = _byteswap_uint64(*i);
//no more operations here
}
// another similar for loop
Such code can reduce run time by huge factor
At worst you can do it in like runs of 100k blocks, to keep cache misses at minimum and single loading of data from RAM.
In your case you do it in streaming way witch is good only for keeping low memory and faster responses from slow data source, but not for speed.

MapViewOfFile and VirtualLock

Will the following code load data from file into system memory so that access to the resulting pointer will never block threads?
auto ptr = VirtualLock(MapViewOfFile(file_map, FILE_MAP_READ, high, low, size), size); // Map file to memory and wait for DMA transfer to finish.
int val0 = reinterpret_cast<int*>(ptr)[0]; // Will not block thread?
int val1 = reinterpret_cast<int*>(ptr)[size-4]; // Will not block thread?
VirtualUnlock(ptr);
UnmapViewOfFile(ptr);
EDIT:
Updated after Dammons answer.
auto ptr = MapViewOfFile(file_map, FILE_MAP_READ, high, low, size);
#pragma optimize("", off)
char dummy;
for(int n = 0; n < size; n += 4096)
dummy = reinterpret_cast<char*>(ptr)[n];
#pragma optimize("", on)
int val0 = reinterpret_cast<int*>(ptr)[0]; // Will not block thread?
int val1 = reinterpret_cast<int*>(ptr)[size-4]; // Will not block thread?
UnmapViewOfFile(ptr);

If the file's size is less than the ridiculously small maximum working set size (or, if you have modified your working set size accordingly) then in theory yes. If you exceed your maximum working set size, VirtualLock will simply do nothing (that is, fail).
(In practice, I've seen VirtualLock being rather... liberal... at interpreting what it's supposed to do as opposed to what it actually does, at least under Windows XP -- might be different under more modern versions)
I've been trying similar things in the past, and I'm now simply touching all pages that I want in RAM with a simple for loop (reading one byte). This leaves no questions open and works, with the sole possible exception that a page might in theory get swapped out again after touched. In practice, this never happens (unless the machine is really really low on RAM, and then it's ok to happen).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js