MapViewOfFile and VirtualLock - c++

Will the following code load data from file into system memory so that access to the resulting pointer will never block threads?
auto ptr = VirtualLock(MapViewOfFile(file_map, FILE_MAP_READ, high, low, size), size); // Map file to memory and wait for DMA transfer to finish.
int val0 = reinterpret_cast<int*>(ptr)[0]; // Will not block thread?
int val1 = reinterpret_cast<int*>(ptr)[size-4]; // Will not block thread?
VirtualUnlock(ptr);
UnmapViewOfFile(ptr);
EDIT:
Updated after Dammons answer.
auto ptr = MapViewOfFile(file_map, FILE_MAP_READ, high, low, size);
#pragma optimize("", off)
char dummy;
for(int n = 0; n < size; n += 4096)
dummy = reinterpret_cast<char*>(ptr)[n];
#pragma optimize("", on)
int val0 = reinterpret_cast<int*>(ptr)[0]; // Will not block thread?
int val1 = reinterpret_cast<int*>(ptr)[size-4]; // Will not block thread?
UnmapViewOfFile(ptr);

If the file's size is less than the ridiculously small maximum working set size (or, if you have modified your working set size accordingly) then in theory yes. If you exceed your maximum working set size, VirtualLock will simply do nothing (that is, fail).
(In practice, I've seen VirtualLock being rather... liberal... at interpreting what it's supposed to do as opposed to what it actually does, at least under Windows XP -- might be different under more modern versions)
I've been trying similar things in the past, and I'm now simply touching all pages that I want in RAM with a simple for loop (reading one byte). This leaves no questions open and works, with the sole possible exception that a page might in theory get swapped out again after touched. In practice, this never happens (unless the machine is really really low on RAM, and then it's ok to happen).

Related

Does `mem_fence` provide consistency between work-groups?

I am trying to implement the bounding-box calculation as described here. Long story short, I have a binary tree of bounding boxes. The leaf nodes are all filled in, and now it is time to calculate the internal nodes. In addition to the nodes (each defining the child/parent indices), there is a counter for each internal node.
Starting at each leaf node, the parent node is visited and its flag atomically incremented. If this is the first visit to the node, the thread exits (as only one child is guaranteed to have been initialized). If it is the second visit, then both children are initialized, its bounding box is calculated and we continue with that node's parents.
Is the mem_fence between reading the flag and reading the data of its children sufficient to guarantee the data in the children will be visible?
kernel void internalBounds(global struct Bound * const bounds,
global unsigned int * const flags,
const global struct Node * const nodes) {
const unsigned int n = get_global_size(0);
const size_t D = 3;
const size_t leaf_start = n - 1;
size_t node_idx = leaf_start + get_global_id(0);
do {
node_idx = nodes[node_idx].parent;
write_mem_fence(CLK_GLOBAL_MEM_FENCE);
// Mark node as visited, both children initialized on second visit
if (atomic_inc(&flags[node_idx]) < 1)
break;
read_mem_fence(CLK_GLOBAL_MEM_FENCE);
const global unsigned int * child_idxs = nodes[node_idx].internal.children;
for (size_t d = 0; d < D; d++) {
bounds[node_idx].min[d] = min(bounds[child_idxs[0]].min[d],
bounds[child_idxs[1]].min[d]);
bounds[node_idx].max[d] = max(bounds[child_idxs[0]].max[d],
bounds[child_idxs[1]].max[d]);
}
} while (node_idx != 0);
}
I am limited to OpenCL 1.2.
No it doesn't. CLK_GLOBAL_MEM_FENCE only provides consistency within the work group when accessing global memory. There is no inter-workgroup synchronization in OpenCL 1.x
Try to use a single, large workgroup and iterate over the data. And/or start with some small trees that will fit inside a single work group.
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/mem_fence.html
mem_fence(...) syncs mem-accesses for only single work-item. Even if all work-items have this line, they may not hit(and continue) it at the same time.
barrier(...) does synchronize for all work items in a work group and have them wait for the slowest one(that isa accessing the specified memory given as parameter), but only connected to its own work groups work items.(such as only 64 or 256 for amd-intel and maybe 1024 for nvidia) because an opencl device driver implementation may be designed to finish all wavefronts before loading new shards of wavefronts because all global items would simply not fit inside chip memory(such as 64M work items each using 1kB local memory that need 64GB memory! --> even software emulation would need hundreds or thousands of passes and decrease performance to a level of single core cpu)
Global sync (where all work groups synchronized) is not possible.
Just in case work item work group and processing elements get mixed meanings,
OpenCL: Work items, Processing elements, NDRange
Atomic function you put there is already accesing global memory so adding group-scope synchronization shouldn't be important.
Also check machine codes if
bounds[child_idxs[0]].min[d]
is getting whole bounds[child_idxs[0]] struct into private memory before accessing to min[d]. If yes, you can separate min as an independent array access its items to have %100 more memory bandwidth for it.
Test on intel hd 400, more than 100000 threads
__kernel void fenceTest( __global float *c,
__global int *ctr)
{
int id=get_global_id(0);
if(id<128000)
for(int i=0;i<20000;i++)
{
c[id]+=ctr[0];
mem_fence(CLK_GLOBAL_MEM_FENCE);
}
ctr[0]++;
}
2900ms (c array has garbage)
__kernel void fenceTest( __global float *c,
__global int *ctr)
{
int id=get_global_id(0);
if(id<128000)
for(int i=0;i<20000;i++)
{
c[id]+=ctr[0];
}
ctr[0]++;
}
500 ms(c array has garbage). 500ms is ~6x the performance of fence version(my laptop has single channel 4GB ram which is only 5-10 GB/s but its igpu local memory has nearly 38GB/s(64B per cycle and 600 MHz frequency)). Local fence version takes 700ms so the fenceless version doesn't even touching cache or local memory for some iterations as it seems.
Without loop, it takes 8-9 ms so it wasn't optimizing the loop in these kernels I suppose.
Edit:
int id=get_global_id(0);
if(id==0)
{
atom_inc(&ctr[0]);
mem_fence(CLK_GLOBAL_MEM_FENCE);
}
mem_fence(CLK_GLOBAL_MEM_FENCE);
c[id]+=ctr[0];
behaves exactly as
int id=get_global_id(0);
if(id==0)
{
ctr[0]++;
mem_fence(CLK_GLOBAL_MEM_FENCE);
}
mem_fence(CLK_GLOBAL_MEM_FENCE);
c[id]+=ctr[0];
for this Intel igpu device(only by chance, but it proves changed memory is visible by "all" trailing threads, but doesn't prove it always happens(such as first compute unit hiccups and 2nd starts first for example) and it is not atomic for more than single threads accessing it).

Doubling buffering in CUDA so the CPU can operate on data produced by a persistent kernel

I have a Monte Carlo simulation in which the state of the system is a bit string (size N) with the bits being randomly flipped. In an effort to accelerate the simulation the code was revised to use CUDA. However because of the large number of statistics I need calculated from the system state (goes as N^2) this part needs to be done on the CPU where there is more memory. Currently the algorithm looks like this:
loop
CUDA kernel making 10s of Monte Carlo steps
Copy system state back to CPU
Calculate statistics
This is inefficient and I would like to have the kernel run persistently while the CPU occasionally queries the state of the system and calculates the statistics while the kernel continues to run.
Based on Tom's answer to this question I think the answer is double buffering, but I haven't been able to find an explanation or example of how to do this.
How does one set up the double buffering described in the third paragraph of Tom's answer for a CUDA/C++ code?
Here's a fully worked example of a "persistent" kernel, producer-consumer approach, with a double-buffered interface from device (producer) to host (consumer).
Persistent kernel design generally implies launching kernels with, at most, the number of blocks that can be simultaneously resident on the hardware (see item 1 on slide 16 here). For the most efficient usage of the machine, we'd generally like to maximize this, while still staying within the aforementioned limit. This involves an occupancy study for a specific kernel, and it will vary from kernel to kernel. Therefore I've chosen to take a shortcut here, and simply launch as many blocks as there are multiprocessors. Such an approach is always guaranteed to work (it could be considered a "lower bound" on the number of blocks to launch for a persistent kernel), but is (typically) not the most efficient usage of the machine. Nevertheless, I claim the occupancy study is beside the point of your question. Furthermore, it is arguable that proper "persistent kernel" design with guaranteed forward progress is actually quite tricky - requiring careful design of the CUDA thread code and placement of threadblocks (e.g. only use 1 threadblock per SM) to guarantee forward progress. However we don't need to delve to this level to address your question (I don't think) and the persistent kernel example I propose here only places 1 threadblock per SM.
I'm also assuming a proper UVA setup, so that I can skip the details of arranging for proper mapped memory allocations in an non-UVA setup.
The basic idea is that we will have 2 buffers on the device, along with 2 "mailboxes" in mapped memory, one for each buffer. The device kernel will fill a buffer with data, then set the "mailbox" to a value (2, in this case) that indicates the host may "consume" the buffer. The device then goes on to the other buffer and repeats the process in a ping-pong fashion between buffers. In order to make this work we must make sure that the device itself has not overrun the buffers (no thread is allowed to be more than one buffer ahead of any other thread) and that before a buffer is populated by the device, the host has consumed the previous contents.
On the host side, it is simply waiting for the mailbox to indicate "full", then copying the buffer from device to host, reset the mailbox, and perform the "processing" on it (the validate function). It then goes on to the next buffer in a ping-pong fashion. The actual data "production" by the device is just to fill each buffer with the iteration number. The host then checks to see that the proper iteration number was received.
I've structured the code to call out the actual device "work" function (my_compute_function) which is where you would put whatever your Monte Carlo code is. If your code is nicely thread-independent, this should be straightforward. Thus the device side my_compute_function is the producer function, and the host side validate is the consumer function. If your device producer code is not simply thread independent, then you may need to restructure things slightly around the calling point to my_compute_function.
The net effect of this is that the device can "race ahead" and begin filling the next buffer, while the host is "consuming" the data in the previous buffer.
Because persistent kernel design imposes an upper bound on the number of blocks (and threads) in a kernel launch, I've chosen to implement the "work" producer function in a grid-striding loop, so that arbitrary size buffers can be handled by the given grid-width.
Here's a fully worked example:
$ cat t942.cu
#include <stdio.h>
#define ITERS 1000
#define DSIZE 65536
#define nTPB 256
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__device__ volatile int blkcnt1 = 0;
__device__ volatile int blkcnt2 = 0;
__device__ volatile int itercnt = 0;
__device__ void my_compute_function(int *buf, int idx, int data){
buf[idx] = data; // put your work code here
}
__global__ void testkernel(int *buffer1, int *buffer2, volatile int *buffer1_ready, volatile int *buffer2_ready, const int buffersize, const int iterations){
// assumption of persistent block-limited kernel launch
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int iter_count = 0;
while (iter_count < iterations ){ // persistent until iterations complete
int *buf = (iter_count & 1)? buffer2:buffer1; // ping pong between buffers
volatile int *bufrdy = (iter_count & 1)?(buffer2_ready):(buffer1_ready);
volatile int *blkcnt = (iter_count & 1)?(&blkcnt2):(&blkcnt1);
int my_idx = idx;
while (iter_count - itercnt > 1); // don't overrun buffers on device
while (*bufrdy == 2); // wait for buffer to be consumed
while (my_idx < buffersize){ // perform the "work"
my_compute_function(buf, my_idx, iter_count);
my_idx += gridDim.x*blockDim.x; // grid-striding loop
}
__syncthreads(); // wait for my block to finish
__threadfence(); // make sure global buffer writes are "visible"
if (!threadIdx.x) atomicAdd((int *)blkcnt, 1); // mark my block done
if (!idx){ // am I the master block/thread?
while (*blkcnt < gridDim.x); // wait for all blocks to finish
*blkcnt = 0;
*bufrdy = 2; // indicate that buffer is ready
__threadfence_system(); // push it out to mapped memory
itercnt++;
}
iter_count++;
}
}
int validate(const int *data, const int dsize, const int val){
for (int i = 0; i < dsize; i++) if (data[i] != val) {printf("mismatch at %d, was: %d, should be: %d\n", i, data[i], val); return 0;}
return 1;
}
int main(){
int *h_buf1, *d_buf1, *h_buf2, *d_buf2;
volatile int *m_bufrdy1, *m_bufrdy2;
// buffer and "mailbox" setup
cudaHostAlloc(&h_buf1, DSIZE*sizeof(int), cudaHostAllocDefault);
cudaHostAlloc(&h_buf2, DSIZE*sizeof(int), cudaHostAllocDefault);
cudaHostAlloc(&m_bufrdy1, sizeof(int), cudaHostAllocMapped);
cudaHostAlloc(&m_bufrdy2, sizeof(int), cudaHostAllocMapped);
cudaCheckErrors("cudaHostAlloc fail");
cudaMalloc(&d_buf1, DSIZE*sizeof(int));
cudaMalloc(&d_buf2, DSIZE*sizeof(int));
cudaCheckErrors("cudaMalloc fail");
cudaStream_t streamk, streamc;
cudaStreamCreate(&streamk);
cudaStreamCreate(&streamc);
cudaCheckErrors("cudaStreamCreate fail");
*m_bufrdy1 = 0;
*m_bufrdy2 = 0;
cudaMemset(d_buf1, 0xFF, DSIZE*sizeof(int));
cudaMemset(d_buf2, 0xFF, DSIZE*sizeof(int));
cudaCheckErrors("cudaMemset fail");
// inefficient crutch for choosing number of blocks
int nblock = 0;
cudaDeviceGetAttribute(&nblock, cudaDevAttrMultiProcessorCount, 0);
cudaCheckErrors("get multiprocessor count fail");
testkernel<<<nblock, nTPB, 0, streamk>>>(d_buf1, d_buf2, m_bufrdy1, m_bufrdy2, DSIZE, ITERS);
cudaCheckErrors("kernel launch fail");
volatile int *bufrdy;
int *hbuf, *dbuf;
for (int i = 0; i < ITERS; i++){
if (i & 1){ // ping pong on the host side
bufrdy = m_bufrdy2;
hbuf = h_buf2;
dbuf = d_buf2;}
else {
bufrdy = m_bufrdy1;
hbuf = h_buf1;
dbuf = d_buf1;}
// int qq = 0; // add for failsafe - otherwise a machine failure can hang
while ((*bufrdy)!= 2); // use this for a failsafe: if (++qq > 1000000) {printf("bufrdy = %d\n", *bufrdy); return 0;} // wait for buffer to be full;
cudaMemcpyAsync(hbuf, dbuf, DSIZE*sizeof(int), cudaMemcpyDeviceToHost, streamc);
cudaStreamSynchronize(streamc);
cudaCheckErrors("cudaMemcpyAsync fail");
*bufrdy = 0; // release buffer back to device
if (!validate(hbuf, DSIZE, i)) {printf("validation failure at iter %d\n", i); exit(1);}
}
printf("Completed %d iterations successfully\n", ITERS);
}
$ nvcc -o t942 t942.cu
$ ./t942
Completed 1000 iterations successfully
$
I've tested the above code and it seems to work well on linux. I believe it should be OK on a windows TCC setup. On windows WDDM, however, I think there are issues that I am still investigating.
Note that the above kernel design attempts to do a grid-wide synchronization using a block-counting atomic strategy. CUDA now (9.0 and newer) has cooperative groups, and that is the recommended approach, rather than the above methodology, to create a grid-wide sync.
This isn't a direct answer to your question but it may be of help.
I am working with a CUDA producer-consumer code that appears to be similar in basic structure to yours. I was hoping to speed up the code by making the CPU and GPU run concurrently. I attempted this by restructuring the code this why
Launch kernel
Copy data
Loop
Launch kernel
CPU work
Copy data
CPU work
This way the CPU can work on the data from the last kernel run while the next set of data is being generated. This cut 30% off the runtime of my code. I am guess thing it could get better if the GPU/CPU work can be balanced so they take roughly the same amount of time.
I am still launching the same kernel 1000s of times. If the overhead of launching a kernel repeatedly is significant then looking for a way to do what I have accomplish with a single launch would be worth it. Otherwise this is probably the best (simplest) solution.

What could cause a mutex to misbehave?

I've been busy the last couple of months debugging a rare crash caused somewhere within a very large proprietary C++ image processing library, compiled with GCC 4.7.2 for an ARM Cortex-A9 Linux target. Since a common symptom was glibc complaining about heap corruption, the first step was to employ a heap corruption checker to catch oob memory writes. I used the technique described in https://stackoverflow.com/a/17850402/3779334 to divert all calls to free/malloc to my own function, padding every allocated chunk of memory with some amount of known data to catch out-of-bounds writes - but found nothing, even when padding with as much as 1 KB before and after every single allocated block (there are hundreds of thousands of allocated blocks due to intensive use of STL containers, so I can't enlarge the padding further, plus I assume any write more than 1KB out of bounds would eventually trigger a segfault anyway). This bounds checker has found other problems in the past so I don't doubt its functionality.
(Before anyone says 'Valgrind', yes, I have tried that too with no results either.)
Now, my memory bounds checker also has a feature where it prepends every allocated block with a data struct. These structs are all linked in one long linked list, to allow me to occasionally go over all allocations and test memory integrity. For some reason, even though all manipulations of this list are mutex protected, the list was getting corrupted. When investigating the issue, it began to seem like the mutex itself was occasionally failing to do its job. Here is the pseudocode:
pthread_mutex_t alloc_mutex;
static bool boolmutex; // set to false during init. volatile has no effect.
void malloc_wrapper() {
// ...
pthread_mutex_lock(&alloc_mutex);
if (boolmutex) {
printf("mutex misbehaving\n");
__THROW_ERROR__; // this happens!
}
boolmutex = true;
// manipulate linked list here
boolmutex = false;
pthread_mutex_unlock(&alloc_mutex);
// ...
}
The code commented with "this happens!" is occasionally reached, even though this seems impossible. My first theory was that the mutex data structure was being overwritten. I placed the mutex within a struct, with large arrays before and after it, but when this problem occurred the arrays were untouched so nothing seems to be overwritten.
So.. What kind of corruption could possibly cause this to happen, and how would I find and fix the cause?
A few more notes. The test program uses 3-4 threads for processing. Running with less threads seems to make the corruptions less common, but not disappear. The test runs for about 20 seconds each time and completes successfully in the vast majority of cases (I can have 10 units repeating the test, with the first failure occurring after 5 minutes to several hours). When the problem occurs it is quite late in the test (say, 15 seconds in), so this isn't a bad initialization issue. The memory bounds checker never catches actual out of bounds writes but glibc still occasionally fails with a corrupted heap error (Can such an error be caused by something other than an oob write?). Each failure generates a core dump with plenty of trace information; there is no pattern I can see in these dumps, no particular section of code that shows up more than others. This problem seems very specific to a particular family of algorithms and does not happen in other algorithms, so I'm quite certain this isn't a sporadic hardware or memory error. I have done many more tests to check for oob heap accesses which I don't want to list to keep this post from getting any longer.
Thanks in advance for any help!
Thanks to all commenters. I've tried nearly all suggestions with no results, when I finally decided to write a simple memory allocation stress test - one that would run a thread on each of the CPU cores (my unit is a Freescale i.MX6 quad core SoC), each allocating and freeing memory in random order at high speed. The test crashed with a glibc memory corruption error within minutes or a few hours at most.
Updating the kernel from 3.0.35 to 3.0.101 solved the problem; both the stress test and the image processing algorithm now run overnight without failing. The problem does not reproduce on Intel machines with the same kernel version, so the problem is specific either to ARM in general or perhaps to some patch Freescale included with the specific BSP version that included kernel 3.0.35.
For those curious, attached is the stress test source code. Set NUM_THREADS to the number of CPU cores and build with:
<cross-compiler-prefix>g++ -O3 test_heap.cpp -lpthread -o test_heap
I hope this information helps someone. Cheers :)
// Multithreaded heap stress test. By Itay Chamiel 20151012.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <pthread.h>
#include <sys/time.h>
#define NUM_THREADS 4 // set to number of CPU cores
#define ALIVE_INDICATOR NUM_THREADS
// Each thread constantly allocates and frees memory. In each iteration of the infinite loop, decide at random whether to
// allocate or free a block of memory. A list of 500-1000 allocated blocks is maintained by each thread. When memory is allocated
// it is added to this list; when freeing, a random block is selected from this list, freed and removed from the list.
void* thr(void* arg) {
int* alive_flag = (int*)arg;
int thread_id = *alive_flag; // this is a number between 0 and (NUM_THREADS-1) given by main()
int cnt = 0;
timeval t_pre, t_post;
gettimeofday(&t_pre, NULL);
const int ALLOCATE=1, FREE=0;
const unsigned int MINSIZE=500, MAXSIZE=1000;
const int MAX_ALLOC=10000;
char* membufs[MAXSIZE];
unsigned int membufs_size = 0;
int num_allocs = 0, num_frees = 0;
while(1)
{
int action;
// Decide whether to allocate or free a memory block.
// if we have less than MINSIZE buffers, allocate.
if (membufs_size < MINSIZE) action = ALLOCATE;
// if we have MAXSIZE, free.
else if (membufs_size >= MAXSIZE) action = FREE;
// else, decide randomly.
else {
action = ((rand() & 0x1)? ALLOCATE : FREE);
}
if (action == ALLOCATE) {
// choose size to allocate, from 1 to MAX_ALLOC bytes
size_t size = (rand() % MAX_ALLOC) + 1;
// allocate and fill memory
char* buf = (char*)malloc(size);
memset(buf, 0x77, size);
// add buffer to list
membufs[membufs_size] = buf;
membufs_size++;
assert(membufs_size <= MAXSIZE);
num_allocs++;
}
else { // action == FREE
// choose a random buffer to free
size_t pos = rand() % membufs_size;
assert (pos < membufs_size);
// free and remove from list by replacing entry with last member
free(membufs[pos]);
membufs[pos] = membufs[membufs_size-1];
membufs_size--;
assert(membufs_size >= 0);
num_frees++;
}
// once in 10 seconds print a status update
gettimeofday(&t_post, NULL);
if (t_post.tv_sec - t_pre.tv_sec >= 10) {
printf("Thread %d [%d] - %d allocs %d frees. Alloced blocks %u.\n", thread_id, cnt++, num_allocs, num_frees, membufs_size);
gettimeofday(&t_pre, NULL);
}
// indicate alive to main thread
*alive_flag = ALIVE_INDICATOR;
}
return NULL;
}
int main()
{
int alive_flag[NUM_THREADS];
printf("Memory allocation stress test running on %d threads.\n", NUM_THREADS);
// start a thread for each core
for (int i=0; i<NUM_THREADS; i++) {
alive_flag[i] = i; // tell each thread its ID.
pthread_t th;
int ret = pthread_create(&th, NULL, thr, &alive_flag[i]);
assert(ret == 0);
}
while(1) {
sleep(10);
// check that all threads are alive
bool ok = true;
for (int i=0; i<NUM_THREADS; i++) {
if (alive_flag[i] != ALIVE_INDICATOR)
{
printf("Thread %d is not responding\n", i);
ok = false;
}
}
assert(ok);
for (int i=0; i<NUM_THREADS; i++)
alive_flag[i] = 0;
}
return 0;
}

Loop Around File Mapping Kills Performance

I have a circular buffer which is backed with file mapped memory (the buffer is in the size range of 8GB-512GB).
I am writing to (8 instances of) this memory in a sequential manner from the beginning to the end at which point it loops around back to the beginning.
It works fine until it reaches the end where it needs to perform two file mappings and loop around the memory, at which point IO performance is totally trashed and doesn't recover (even after several minutes). I can't quite figure it out.
using namespace boost::interprocess;
class mapping
{
public:
mapping()
{
}
mapping(file_mapping& file, mode_t mode, std::size_t file_size, std::size_t offset, std::size_t size)
: offset_(offset)
, mode_(mode)
{
const auto aligned_size = page_ceil(size + page_size());
const auto aligned_file_size = page_floor(file_size);
const auto aligned_file_offset = page_floor(offset % aligned_file_size);
const auto region1_size = std::min(aligned_size, aligned_file_size - aligned_file_offset);
const auto region2_size = aligned_size - region1_size;
if (region2_size)
{
const auto region1_address = mapped_region(file, read_only, 0, (region1_size + region2_size) * 2).get_address();
const auto region2_address = reinterpret_cast<char*>(region1_address) + region1_size;
region1_ = mapped_region(file, mode, aligned_file_offset, region1_size, region1_address);
region2_ = mapped_region(file, mode, 0, region2_size, region2_address);
}
else
{
region1_ = mapped_region(file, mode, aligned_file_offset, region1_size);
region2_ = mapped_region();
}
size_ = region1_.get_size() + region2_.get_size();
offset_ = aligned_file_offset;
}
auto offset() const -> std::size_t { return offset_; }
auto size() const -> std::size_t { return size_; }
auto data() const -> const void* { return region1_.get_address(); }
auto data() -> void* { return region1_.get_address(); }
auto flush(bool async = true) -> void
{
region1_.flush(async);
region2_.flush(async);
}
auto mode() const -> mode_t { return mode_; }
private:
std::size_t offset_ = 0;
std::size_t size_ = 0;
mode_t mode_;
mapped_region region1_;
mapped_region region2_;
};
struct loop_mapping::impl final
{
std::tr2::sys::path file_path_;
file_mapping file_mapping_;
std::size_t file_size_;
std::size_t map_size_ = page_floor(256000000ULL);
std::shared_ptr<mapping> mapping_ = std::shared_ptr<mapping>(new mapping());
std::shared_ptr<mapping> prev_mapping_;
bool write_;
public:
impl(std::tr2::sys::path path, bool write)
: file_path_(std::move(path))
, file_mapping_(file_path_.string().c_str(), write ? read_write : read_only)
, file_size_(page_floor(std::tr2::sys::file_size(file_path_)))
, write_(write)
{
REQUIRE(file_size_ >= map_size_ * 3);
}
~impl()
{
prev_mapping_.reset();
mapping_.reset();
}
auto data(std::size_t offset, std::size_t size, boost::optional<bool> write_opt) -> void*
{
offset = offset % page_floor(file_size_);
REQUIRE(size < file_size_ - map_size_ * 3);
const auto write = write_opt.get_value_or(write_);
REQUIRE(!write || write_);
if ((write && mapping_->mode() == read_only) || offset < mapping_->offset() || offset + size >= mapping_->offset() + mapping_->size())
{
auto new_mapping = std::make_shared<loop::mapping>(file_mapping_, write ? read_write : read_only, file_size_, page_floor(offset), std::max(size + page_size(), map_size_));
if (mapping_)
mapping_->flush((new_mapping->offset() % file_size_) < (mapping_->offset() % file_size_));
if (prev_mapping_)
prev_mapping_->flush(false);
prev_mapping_ = std::move(mapping_);
mapping_ = std::move(new_mapping);
}
return reinterpret_cast<char*>(mapping_->data()) + offset - mapping_->offset();
}
}
-
// 8 processes to 8 different files 128GB each.
loop_mapping loop(...);
for (auto n = 0; true; ++n)
{
auto src = get_new_data(5000000/8);
auto dst = loop.data(n * 5000000/8, 5000000/8, true);
std::memcpy(dst, src, 5000000/8); // This becomes very slow after loop around.
std::this_thread::sleep_for(std::chrono::seconds(1));
}
Any ideas?
Target System:
1x 3TB Seagate Constellation ES.3
2x Xeon E5-2400 (6 core, 2.6Ghz)
6x 8GB DDR3 1600Mhz ECC
Windows Server 2012
8 buffers each 8 to 512GiB in size on a system with 48GiB of physical memory means that your mapping will have to be swapped. No surprise there.
The issue, as you have already remarked yourself, is that prior to being able to write to a page, you encounter a fault, and the page is read in. That doesn't happen on the first run, since merely a zero page is used. To make matters worse, reading in pages again competes with write-behind of dirty pages.
Now, there is unluckily no way of telling Windows "I'm going to overwrite this anyway", nor is there any way of making the disk load your stuff faster. However, you can start the transfer earlier (maybe when you're 3/4 through the buffer).
Windows Server 2012 (which you're using) supports PrefetchVirtualMemory which is a somewhat half-assed substitute for POSIX madvise(MADV_WILLNEED).
That is, of course, not exactly what you want to do when you already know that you will overwrite the complete memory page (or several of them) anyway, but it is as good as you can get. It's worth a try in any case.
Ideally, you would want to do something like a destructive madvise(MADV_DONTNEED) as implemented e.g. under Linux (and I believe FreeBSD, too) immediately before you overwrite a page, but I am not aware of any way of doing this under Windows (...short of destroying the view and the mapping and mapping from scratch, but then you throw away all data, so that's a bit useless).
Even with prefetching early you will still be limited by disk I/O bandwidth, but at least you can hide the latency.
Another "obvious" (but probably not that easy) solution would be to make the consumer faster. That would allow for a smaller buffer to begin with, and even on a huge buffer it would keep the working set smaller (both producer and consumer force pages into RAM while accessing them, so if the consumer accesses data with less delay after the producer has written them, they will both be using mostly the same set of pages.) Smaller working sets fit into RAM more easily.
But I realize that you probably didn't choose a several-gigabyte buffer for no reason.
Since your code is devoid of any comment, filled with auto variables, not compilable as is and I don't have 512Gb available on my PC to test it anyway, this will remain a passing tought off the top of my head.
each of your process only writes a few hundreds Kb/s, so there should be ample time to flush that to disk in the background.
However, it seems you are asking the boost mapping system to flush either synchronously or asynchronously the previous chunk depending on your mysterious offset computations:
mapping_->flush((new_mapping->offset() % file_size_) < (mapping_->offset() % file_size_));
I guess the rollover triggers a synchronous flush, which is a likely culprit for the sudden slowdown.
What the operating system does at this point depends on the boost implementation, which is not described (or at least in a way obvious enough for me to get it after a cursory look at their man page).
If boost stuffed your 48 Gb of memory with unflushed pages, you could certainly experience a sudden and prolonged deceleration.
At least worth a comment in your code if this mysterious line does something clever and completely different I missed entirely.
If you are able to back the memory mapping with the page file rather than a specific file, you can use the MEM_RESET flag with VirtualAlloc to prevent Windows from paging in the old contents.
The main issue I would anticipate in using this approach is that you can't easily recover the disk space when you are done. It may also require the system's page file settings to be changed; I believe it will work with the default settings, but not if a maximum page file size has been set.
I am going to assume that by "Loop around" you mean that the RAM got full.
What happens is that until the RAM get full, all you have to do is allocate a page and write in it (RAM speed), after the RAM gets full every page allocation turns to 2 actions:
1. you have to write the dirty page back (DISK speed)
2. and allocate a page (RAM speed)
And worst case you also have to bring the page from the file in the disk (DISK speed) if you are reading something from it.
So instead of working only in RAM speed (page allocation), every page allocation runs in DISK speed.
This doesnt happen with 2x8GB because it is small enough for all of the memory of both files to remain fully in the RAM.
The problem here it turns out is that when overwrite a valid page in memory the page first has to be read from the drive before being overwritten. There is no way to get around this issue as far as I know when using memory mapped files.
The reason it doesn't happen during the first pass is that the pages being overwritten are not "valid" and thus they do not need to be read back.

How do you calculate memory access time?

I create a large boolean 2d array (5000X5000 for a total of 25 billion elements at 23MB). Then I loop through and instantiate every element with a random true or false. Then I loop through and read every single element. All 25 million elements are read in ~100ms.
23MB is too big to fit in the CPU's cache and I think my program is too simple to benefit from any type of compiler optimization so am I right to conclude that the program is reading 25 million elements from RAM in ~100ms?
#include "stdafx.h"
#include <iostream>
#include <chrono>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
bool **locs;
locs = new bool*[5000];
for(int i = 0; i < 5000; i++)
locs[i] = new bool[5000];
for(int i = 0; i < 5000; i++)
for(int i2 = 0; i2 < 5000; i2++)
locs[i][i2] = rand() % 2 == 0 ? true : false;
int *idx = new int [5000*5000];
for(int i = 0; i < 5000*5000; i++)
*(idx + i) = rand() % 4999;
bool val;
int memAccesses = 0;
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 5000*5000; i++) {
val = locs[*(idx + i)][*(idx + ++i)];
memAccesses += 2;
}
auto finish = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(finish-start).count() << " ns\n";
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(finish-start).count() << " ms\n";
cout << "TOTAL MEMORY ACCESSES: " << memAccesses << endl;
cout << "The size of the array in memory is " << ((sizeof(bool)*5000*5000)/1048576) << "MB";
int exit; cin >> exit;
return 0;
}
/*
OUTPUT IS:
137013700 ns
137 ms
TOTAL MEMORY ACCESSES: 25000000
The size of the array in memory is 23MB
*/
As other answers have mentioned, the "speed" you are seeing (even if the CPU is executing your code and it is not stripped by the compiler) is about 250 MBps, which is very very low number for modern systems.
However, your methodology seems flawed to me (admittedly, I'm not an expert in benchmarking.) And here are the problems I see:
For any benchmark such as this, even in the simplest form, you need to distinguish random-access from sequential-access. Memory is not a random-access device (despite its name) and performs very poorly here. Your code seems to be accessing memory randomly, so you add that to your conclusion as a qualifier: that you are "reading 25 million elements from random locations from RAM in ~100ms."
Another aspect of this sort of benchmarks is the concept of latency vs. throughput. Again, if you want to conclude anything from your numbers and timings, you need to be aware what are you measuring exactly.
You are counting memory accesses incorrectly. Depending of the exact code your compiler is generating, this line:
val = locs[*(idx + i)][*(idx + ++i)];
might realistically access the memory system anywhere between 4 to 9 times.
At best, if i, idx, locs and val are all either in registers or access to them is eliminated, then you need to read *(idx + i), read locs[*(idx + i)] (remember that locs is an array of pointers to arrays, not a 2D array,) read *(idx + ++i), and finally read locs[*(idx + i)][*(idx + ++i)]. A few of these might be cached, but it's unlikely, with the cache-thrashing that's going on.
At worst, in addition to the above, you need two accesses for ++i (read, then write back,) one for idx, one for locs and one for val. I don't know, you might even need another read for the single i and/or two reads for the two idx occurrences (due to pointer aliasing and whatnot.)
You need to be aware that memory is never accessed in single bytes or even words. Memory is always read and written in units of cache-line. And cache line size can be different from system to system, although the most common size these days is 64 bytes. So, each time you read a memory location that is not in the cache, you are loading 64-bytes (or more) from RAM. If the memory locations you are reading are at the cache line boundary (some of the bytes in one cache line and some in the next) then you are loading two cache lines from RAM. Given a sane compiler and properly aligned variables in memory, this doesn't happen very often, but it might. So you have to at least multiply your calculated bandwidth used by the size of your cache line.
However, if you are accessing a memory location that is already in cache, then you don't load anything from RAM. You need to consider this in your conclusions too.
You also need to consider cache line eviction, your cache's associativity, number of levels, the fact that some cache levels are shared between instructions and data and some aren't, some are shared between cores and some aren't, and a lot of other things when evaluating the performance of caches and memory.
The DRAM chips also have a lot of weird and complex behaviors and characteristics. Some memory locations are faster to read after some others (due to the arrangements of rows and columns,) some accesses might get delayed a long time (at CPU speeds) because of the refresh cycle, other devices might be using the RAM or the bus that RAM is on, etc., etc. I'm far from familiar with the operations of modern memory chips, and even I know that it's a complete mess.
You have to consider the effects of compiler optimization on your code. This means that you have to look t your code after the compiler is done with it, in assembly form. You need to look at the generated assembly to be able to know what your code is actually doing: whether and which of your memory accesses are optimized out.
All in all, I don't think that you can conclude much useful information from your program. Sorry about that, but memory is very complex!
Portions (blocks) of memory will be stored in the processor cache at a time, which allows the processor to quickly access those items. However, that speed is perfectly reasonable for modern memory. Even the slowest DDR3 ram can transfer data at about 6 GB/s.
Cache usage is independent from program's complexity. Whenever data is read from RAM it goes into cache. Since cache has a certain size, there's always that amount of data available. If you access a memory location next to the previous, there is a good chance it will be cached already. In such case RAM is not accessed.
I would suggest reading CPU cache wikipedia entry to broaden your knowledge.
BTW: val = locs[*(idx + i)][*(idx + ++i)]; are you certain that this is evaluated from left to right? I am not. This is an undefined behavior. I'd suggest putting the ++i below the accessor line.
//EDIT:
There is nothing done with the value read from memory. It is quite possible that these instructions are not executed at all! Check the bytecode or add a (void) val; instruction which should force it to be generated.
No. The reads won't always go all the way down to the RAM. Blocks of memory get pulled into the cache when a read (or write) is performed. As long as the block from which you are reading is already in the cache, the cache is used. If you request data from a block that is not in the cache, then the RAM is accessed to fetch the block of memory and place it in the cache. Reading from the cache is significantly cheaper than reading from RAM.
EDIT
Again, write oprerations cause blocks from memory to get pulled into the cache. Because you are storing the values in your program before reading them, the data you are reading is most likely already in the cache from when you stored it. Therefore, it is likely that your loop that reads the values never needs to access RAM.