Minimize memory usage with Boost's file_mapping and mapped_region?

Minimize memory usage with Boost's file_mapping and mapped_region? - c++

For this problem I am loading a large three-dimensional volume from file into a program, but only need to look at three planes (x,y,z) at a time usually. I am currently using Boost::Interprocess::File_Mapping to create a map of the file (32 GB) and loading it onto my system which has 24 GB of RAM. The current method uses a single Boost::Interprocess::Mapped_Region for the file. The memory usage quickly approaches 99%.
I am new to the world of memory mapped file i/o and want to know how best to segment the file to reduce the amount of memory usage. Would creating reduced sized regions (each Z plane for instance) improve the results? I would like to use as little memory as possible without causing adverse effects.
Am I going about this the correct way, or is there a more straightforward method for performing this?

On Windows, it normally works OK. I've created a test application (sorry I hate boost because I think its quality is appaling, my sample uses ATL instead, but underlying Windows API are the same):
HRESULT TestMain( LPCTSTR strFileName )
{
CAtlFile file;
HRESULT hr = file.Create( strFileName, GENERIC_READ, FILE_SHARE_READ, OPEN_EXISTING );
if( FAILED( hr ) )
return hr;
CAtlFileMapping<BYTE> mapping;
hr = mapping.MapFile( file );
if( FAILED( hr ) )
return hr;
size_t sz = mapping.GetMappingSize();
BYTE res = 0;
for( size_t i = 0; i < sz; i++ )
res ^= mapping[ i ];
printf( "Read the complete file, %Iu bytes, the XOR is %.2X\n", sz, int( res ) );
return S_OK;
}
When asked to read a 12GB file on my machine with 8GB RAM, I saw the effect you're describing (resource monitor memory data for my process: commit 25 MB, private 20 MB, working set and shareable 6.5 GB which is amount of my free RAM). However, multiple sources on the Internets say those numbers mean nothing and don't affect performance, because unused physical pages will be discarded as soon as any process requests more memory, and this process is very cheap (unless of course you're writing to your memory mapped file).
Or, if you're really unhappy about this behavior, you can free unused portions yourself, by calling VirtualUnlock, as described here: https://stackoverflow.com/a/1882478/126995
Or, you can only map the portions of the file you need.
But the best you can do about it - optimize the layout of your data. If in your data file you're keeping voxels as double voxels[x][y][z], store them as struct { double voxels[8][8][8] } blocks[x/8][y/8][z/8] instead. This way, the block size is exactly 4kb which is a page size, and if you only need to access e.g. XZ plane, you'll save a lot of I/O bandwidth, by orders of magnitude. Just don't mess up with misalignment, i.e. if you have a header before your data, make sure the size of the header is 4kb*n where n is integer.

Related

Using memcpy on mmap'ed region crashes, a for loop does not

I have an NVIDIA Tegra TK1 processor module on a carrier board with a PCI-e slot connecting to it. In that PCIe slot is an FPGA board which exposes some registers and a 64K memory area via PCIe.
On the ARM CPU of the Tegra board, a minimal Linux installation is running.
I am using /dev/mem and the mmap function to obtain user-space pointers to the register structs and the 64K memory area.
The distinct register files and the memory block are all assigned addresses which are aligned and do not overlap with regards to 4KB memory pages.
I explicitly map whole pages with mmap, using the result of getpagesize(), which also is 4096.
I can read/write from/to those exposed registers just fine.
I can read from the memory area (64KB), doing uint32 word-by-word reads in a for loop, just fine. I.e. read contents are correct.
But if I use std::memcpy on the same address range, though, the Tegra CPU freezes, always. I do not see any error message, if GDB is attached I also don't see a thing in Eclipse when trying to step over the memcpy line, it just stops hard. And I have to reset the CPU using the hardware reset button, as the remote console is frozen.
This is debug build with no optimization (-O0), using gcc-linaro-6.3.1-2017.05-i686-mingw32_arm-linux-gnueabihf. I was told the 64K region is accessible byte-wise, I did not try that explicitly.
Is there an actual (potential) problem that I need to worry about, or is there a specific reason why memcpy does not work and maybe should not be used in the first place in this scenario - and I can just carry on using my for loops and think nothing of it?
EDIT: Another effect has been observed: The original code snippet was missing a "vital" printf in the copying for loop, that came before the memory read. That removed, I don't get back valid data. I now updated the code snippet to have an extra read from the same address instead of the printf, which also yields correct data. The confusion intensifies.
Here the (I think) important excerpts of what's going on. With minor modifications, to make sense as shown, in this "de-fluffed" form.
// void* physicalAddr: PCIe "BAR0" address as reported by dmesg, added to the physical address offset of FPGA memory region
// long size: size of the physical region to be mapped
//--------------------------------
// doing the memory mapping
//
const uint32_t pageSize = getpagesize();
assert( IsPowerOfTwo( pageSize ) );
const uint32_t physAddrNum = (uint32_t) physicalAddr;
const uint32_t offsetInPage = physAddrNum & (pageSize - 1);
const uint32_t firstMappedPageIdx = physAddrNum / pageSize;
const uint32_t lastMappedPageIdx = (physAddrNum + size - 1) / pageSize;
const uint32_t mappedPagesCount = 1 + lastMappedPageIdx - firstMappedPageIdx;
const uint32_t mappedSize = mappedPagesCount * pageSize;
const off_t targetOffset = physAddrNum & ~(off_t)(pageSize - 1);
m_fileID = open( "/dev/mem", O_RDWR | O_SYNC );
// addr passed as null means: we supply pages to map. Supplying non-null addr would mean, Linux takes it as a "hint" where to place.
void* mapAtPageStart = mmap( 0, mappedSize, PROT_READ | PROT_WRITE, MAP_SHARED, m_fileID, targetOffset );
if (MAP_FAILED != mapAtPageStart)
{
m_userSpaceMappedAddr = (volatile void*) ( uint32_t(mapAtPageStart) + offsetInPage );
}
//--------------------------------
// Accessing the mapped memory
//
//void* m_rawData: <== m_userSpaceMappedAddr
//uint32_t* destination: points to a stack object
//int length: size in 32bit words of the stack object (a struct with only U32's in it)
// this crashes:
std::memcpy( destination, m_rawData, length * sizeof(uint32_t) );
// this does not, AND does yield correct memory contents - but only with a preceding extra read
for (int i=0; i<length; ++i)
{
// This extra read makes the data gotten in the 2nd read below valid.
// Commented out, the data read into destination will not be valid.
uint32_t tmp = ((const volatile uint32_t*)m_rawData)[i];
(void)tmp; //pacify compiler
destination[i] = ((const volatile uint32_t*)m_rawData)[i];
}

Based on the description, it looks like your FPGA code is not responding correctly to load instructions that are reading from locations on your FPGA and it is causing the CPU to lock up. It's not crashing it is permanently stalled, hence the need for the hard reset. I had this problem also when debugging my PCIE logic on an FPGA.
Another indication that your logic is not responding correctly is that you need an extra read in order to get the right responses.
Your loop is doing 32-bit loads but memcpy is doing at least 64-bit loads, which changes how your logic responds. For example, it will need to use two TLPs with 32 bits of response if the first 128 bits of the completion and the next 32 bits in the second 128 bit TLP of the completion.
What I found super-useful was to add logic to log all the PCIE transactions into an SRAM and to be able to dump the SRAM out to see how the logic was behaving or misbehaving. We have a nifty utility, pcieflat, that prints one PCIE TLP per line. It even has documentation.
When the PCIE interface is not working well enough, I stream the log to a UART in hex which can be decoded by pcieflat.
This tool is also useful for debugging performance problems -- you can look at how well your DMA reads and writes are pipelined.
Alternatively, if you have integrated logic analyzer or similar on the FPGA, you can trace the activity that way. But it's nicer to have the TLPs parsed according to PCIE protocol.

Why is my C++ disk write test much slower than a simply file copy using bash?

Using below program I try to test how fast I can write to disk using std::ofstream.
I achieve around 300 MiB/s when writing a 1 GiB file.
However, a simple file copy using the cp command is at least twice as fast.
Is my program hitting the hardware limit or can it be made faster?
#include <chrono>
#include <iostream>
#include <fstream>
char payload[1000 * 1000]; // 1 MB
void test(int MB)
{
// Configure buffer
char buffer[32 * 1000];
std::ofstream of("test.file");
of.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
auto start_time = std::chrono::steady_clock::now();
// Write a total of 1 GB
for (auto i = 0; i != MB; ++i)
{
of.write(payload, sizeof(payload));
}
double elapsed_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - start_time).count();
double megabytes_per_ns = 1e3 / elapsed_ns;
double megabytes_per_s = 1e9 * megabytes_per_ns;
std::cout << "Payload=" << MB << "MB Speed=" << megabytes_per_s << "MB/s" << std::endl;
}
int main()
{
for (auto i = 1; i <= 10; ++i)
{
test(i * 100);
}
}
Output:
Payload=100MB Speed=3792.06MB/s
Payload=200MB Speed=1790.41MB/s
Payload=300MB Speed=1204.66MB/s
Payload=400MB Speed=910.37MB/s
Payload=500MB Speed=722.704MB/s
Payload=600MB Speed=579.914MB/s
Payload=700MB Speed=499.281MB/s
Payload=800MB Speed=462.131MB/s
Payload=900MB Speed=411.414MB/s
Payload=1000MB Speed=364.613MB/s
Update
I changed from std::ofstream to fwrite:
#include <chrono>
#include <cstdio>
#include <iostream>
char payload[1024 * 1024]; // 1 MiB
void test(int number_of_megabytes)
{
FILE* file = fopen("test.file", "w");
auto start_time = std::chrono::steady_clock::now();
// Write a total of 1 GB
for (auto i = 0; i != number_of_megabytes; ++i)
{
fwrite(payload, 1, sizeof(payload), file );
}
fclose(file); // TODO: RAII
double elapsed_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - start_time).count();
double megabytes_per_ns = 1e3 / elapsed_ns;
double megabytes_per_s = 1e9 * megabytes_per_ns;
std::cout << "Size=" << number_of_megabytes << "MiB Duration=" << long(0.5 + 100 * elapsed_ns/1e9)/100.0 << "s Speed=" << megabytes_per_s << "MiB/s" << std::endl;
}
int main()
{
test(256);
test(512);
test(1024);
test(1024);
}
Which improves the speed to 668MiB/s for a 1 GiB file:
Size=256MiB Duration=0.4s Speed=2524.66MiB/s
Size=512MiB Duration=0.79s Speed=1262.41MiB/s
Size=1024MiB Duration=1.5s Speed=664.521MiB/s
Size=1024MiB Duration=1.5s Speed=668.85MiB/s
Which is just as fast as dd:
time dd if=/dev/zero of=test.file bs=1024 count=0 seek=1048576
real 0m1.539s
user 0m0.001s
sys 0m0.344s

First, you're not really measuring the disk writing speed, but (partly) the speed of writing data to the OS disk cache. To really measure the disk writing speed, the data should be flushed to disk before calculating the time. Without flushing there could be a difference depending on the file size and the available memory.
There seems to be something wrong in the calculations too. You're not using the value of MB.
Also make sure the buffer size is a power of two, or at least a multiple of the disk page size (4096 bytes): char buffer[32 * 1024];. You might as well do that for payload too. (looks like you changed that from 1024 to 1000 in an edit where you added the calculations).
Do not use streams to write a (binary) buffer of data to disk, but instead write directly to the file, using FILE*, fopen(), fwrite(), fclose(). See this answer for an example and some timings.
To copy a file: open the source file in read-only and, if possible, forward-only mode, and using fread(), fwrite():
while fread() from source to buffer
fwrite() buffer to destination file
This should give you a speed comparable to the speed of an OS file copy (you might want to test some different buffer sizes).
This might be slightly faster using memory mapping:
open src, create memory mapping over the file
open/create dest, set file size to size of src, create memory mapping over the file
memcpy() src to dest
For large files smaller mapped views should be used.

Streams are slow
cp uses syscalls directly read(2) or mmap(2).

I'd wager that it's something clever inside either CP or the filesystem. If it's inside CP then it might be that the file that you are copying has a lot of 0s in it and cp is detecting this and writing a sparse version of your file. The man page for cp says "By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well." This could mean a few things but one of them is that cp could make a sparse version of your file which would require less disk write time.
If it's within your filesystem then it might be Deduplication.
As a long-shot 3rd, it might also be something within your OS or your disk firmware that is translating the read and write into some specialized instruction that doesn't require as much synchronization as your program requires (lower bus use means less latency).

You're using a relatively small buffer size. Small buffers mean more operations per second, which increases overhead. Disk systems have a small amount of latency before they receive the read/write request and begin processing it; a larger buffer amortizes that cost a little better. A smaller buffer may also mean that the disk is spending more time seeking.
You're not issuing multiple simultaneous requests - you require one read to finish before the next starts. This means that the disk may have dead time where it is doing nothing. Since all writes depend on all reads, and your reads are serial, you're starving the disk system of read requests (doubly so, since writes will take away from reads).
The total of requested read bytes across all read requests should be larger than the bandwidth-delay product of the disk system. If the disk has 0.5 ms delay and a 4 GB/sec performance, then you want to have 4 GB * 0.5 ms = 2 MB worth of reads outstanding at all times.
You're not using any of the operating system's hints that you're doing sequential reading.
To fix this:
Change your code to have more than one outstanding read request at all times.
Have enough read requests outstanding such that you're waiting on at least 2 MBs worth of data.
Use the posix_fadvise() flags to help the OS disk schedule and page cache optimize.
Consider using mmap to cut down on overhead.
Use a larger buffer size per read request to cut down on overhead.
This answer has more information:
https://stackoverflow.com/a/3756466/344638

The problem is that you specify too small buffer for your fstream
char buffer[32 * 1000];
std::ofstream of("test.file");
of.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
Your app runs in the user mode. To write to disk, ofstream calls system write function that executed in kernel mode. Then write transfers data to system cache, then to HDD cache and then it will be written to the disk.
This buffer size affect number of system calls (1 call for every 32*1000 bytes). During system call OS must switch execution context from user mode to kernel mode and then back. Switching context is overhead. In Linux it is equivalent about 2500-3500 simple CPU commands. Because of that, your app spending the most CPU time in context switching.
In your second app you use
FILE* file = fopen("test.file", "w");
FILE using the bigger buffer by default, that is why it produce more efficient code. You can try to specify small buffer with setvbuf. In this case you should see the same performance degradation.
Please note in your case, the bottle neck is not HDD performance. It is context switching

Windows shared memory access time slow

I am currently using shared memory with two mapped files (1.9 GBytes for the first one and 600 MBytes for the second) in a software.
I am using a process that read data from the first file, process the data and write the results to the second file.
I have noticed a strong delay sometimes (the reason is out of my knowledge) when reading or writing to the mapping view with memcpy function.
Mapped files are created this way :
m_hFile = ::CreateFileW(SensorFileName,
GENERIC_READ | GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL);
m_hMappedFile = CreateFileMapping(m_hFile,
NULL,
PAGE_READWRITE,
dwFileMapSizeHigh,
dwFileMapSizeLow,
NULL);
And memory mapping is done this way :
m_lpMapView = MapViewOfFile(m_hMappedFile,
FILE_MAP_ALL_ACCESS,
dwOffsetHigh,
dwOffsetLow,
m_i64ViewSize);
The dwOffsetHigh/dwOffsetLow are "matching" granularity from the system info.
The process is reading about 300KB * N times, storing that in a buffer, processing and then writing 300KB * N times the processed contents of the previous buffer to the second file.
I have two different memory views (created/moved with MapViewOfFile function) with a size of 10 MBytes as default size.
For memory view size, I tested 10kBytes, 100kB, 1MB, 10MB and 100MB. Statistically no difference, 80% of the time reading process is as described below (~200ms) but writing process is really slow.
Normally :
1/ Reading is done in ~200ms.
2/ Process done in 2.9 seconds.
3/ Writing is done in ~200ms.
I can see that 80% of the time, either reading or writing (in the worst case both are slow) will take between 2 and 10 seconds.
Example : For writing, I am using the below code
for (unsigned int i = 0 ; i < N ; i++) // N = 500~3k
{
// Check the position of the memory view for ponderation
if (###)
MoveView(iOffset);
if (m_lpMapView)
{
memcpy((BYTE*)m_lpMapView + iOffset, pANNHeader, uiANNStatus);
// uiSize = ~300 kBytes
memcpy((BYTE*)m_lpMapView + iTemp, pLine[i], uiSize);
}
else
return uiANNStatus;
}
After using GetTickCount function to pinpoint where is the delay, I am seeing that the second memcpy call is always the one taking most of the time.
So, so far I am seeing N (for test, I used N = 500) calls to memcpy taking 10 seconds at the worst time when using those shared memories.
I made a temporary software that was doing the same quantity of memcpy calls, same amount of data and couldn't see the problem.
For tests, I used the following conditions, they all show the same delay :
1/ I can see this on various computers, 32 or 64 bits from windows 7 to windows 10.
2/ Using the main thread or multi-threads (up to 8 with critical sections for synchronization purpose) for reading/writing.
3/ OS on SATA or SSD, memory mapped files of the software physically on a SATA or SSD hard-disk, and if on external hard-disk, tests were done through USB1, USB2 or USB3.
I am kindly asking you what you would think my mistake is for memcpy to go slow.
Best regards.

I found a solution that works for me but not might be the case for others.
Following Thomas Matthews comments, I checked the MSDN and found two interesting functions FlushViewOfFile and FlushFileBuffers (but couldn't find anything interesting about locking memory).
Calling both after the for loop force update of the mapped file.
I am having no more "random" delay, but instead of the expected 200ms, I have an average of 400ms which is enough for my application.
After doing some tests I saw that calling those too often will cause heavy hard-disk access and will make the delay worse (10 seconds for every for loop) so the flush should be use carefully.
Thanks.

HDD benchmark in C++ - measured transfer speed is too fast

I am m trying to develop a mini benchmarking system in C++ and I have trouble measuring the HDD read and write speed. More exactly, the transfer speed measured by me is huge: 400-600 MB/s for read and above 1000 MB/s for write. I have a 5400 RPM hard disk drive (not SSD), the real read/write speed (according to a benchmarking program) is roughly about 60 MB/s.
//blockSize is 4096
//my data buffer
char* mydata = (char*)malloc(1*blockSize);
//initialized with random data
srand(time(NULL));
for(int i=0;i<blockSize;i++){
mydata[i] = rand()%256;
}
double startt, endt, difft;
int times = 10*25000;
int i=0,j=0;
DWORD written;
HANDLE f, g;
DWORD read;
f=CreateFileA(
"newfolder/myfile.txt",
GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL
);
if(f==INVALID_HANDLE_VALUE){
std::cout<<"Error openning for write.";
return -1;
}
startt = clock();
for(i=0;i<times;i++){
WriteFile(
f,
mydata,
blockSize,
&written,
NULL
);
}
endt = clock();
difft = 1.0*(endt-startt)/(1.0*CLOCKS_PER_SEC);
std::cout<<"\nWrite time: "<<difft;
std::cout<<"\nWrite speed: "<<1.0*times*blockSize/difft/1024/1024<<" MB/s";
CloseHandle(f);
//------------------------------------------------------------------------------------------------
g=CreateFile("newfolder/myfile.txt",
GENERIC_READ,
0,
NULL,
OPEN_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL
);
if(g==INVALID_HANDLE_VALUE){
std::cout<<"Error opening for read.";
return -1;
}
startt = clock();
for(i=0;i<times;i++){
ReadFile(
g,
mydata,
blockSize,
&read,
NULL
);
}
endt = clock();
difft = 1.0*(endt-startt)/(1.0*CLOCKS_PER_SEC);
std::cout<<"\nRead time:"<<difft;
std::cout<<"\nRead speed: "<<1.0*times*blockSize/difft/1024/1024<<" MB/s";
CloseHandle(g);
I tried using fopen and fwrite functions too and I got similar results.
I ran my app on another computer. The write speed was about right, the read speed was still huge.
The most interesting thing is that the application actually creates a 1GB file in about 2 seconds which corresponds to a 500 MB/s write speed.
Does anybody have any idea what am I doing wrong?

Technically, you are doing nothing wrong. The problem is, that every OS uses caching for all I/O operations. The HDD itself also caches some data, so it can perform them efficiently.
This question is very platform-specific. You would need to fool caching somehow.
Perhaps, you should look at this library: Bonnie++. You may find it useful. It was written for Unix systems, but source code could reveal some useful techniques.
On Windows, based on this resource, additional flag FILE_FLAG_NO_BUFFERING passed to CreateFile function should be enough to disable buffering for this file.
Quote:
In these situations, caching can be turned off. This is done at the time the file is opened by passing FILE_FLAG_NO_BUFFERING as a value for the dwFlagsAndAttributes parameter of CreateFile. When caching is disabled, all read and write operations directly access the physical disk. However, the file metadata may still be cached. To flush the metadata to disk, use the FlushFileBuffers function.

You are measuring the performance of cache.
Try storing a lot more data than that, once the cache fills the data should be written straight to the disk.

I think I have figured it out.
Unbuffered file writing speed depends on the size of data the WriteFile function is writing. My experiments show that the bigger the data size, the higher the writing speed. For large amounts of data (>1MB) it even outperforms buffered writing, which I was able to measure by writing data larger than 2GB.
To summarize, one can measure the hard drive writing speed accurately by:
Opening the file using CreateFile and setting the FILE_FLAG_NO_BUFFERING flag.
Writing a lot of data at a time, using WriteFile.

Memory access time slow with VirtualAllocExNuma on Windows 7/64

In our application we are running on a dual Xeon server with memory configured as 12gb local to each processor and a memory bus connecting the two Xeon's. For performance reasons, we want to control where we allocate a large (>6gb) block of memory. Below is simplified code -
DWORD processorNumber = GetCurrentProcessorNumber();
UCHAR nodeNumber = 255;
GetNumaProcessorNode((UCHAR)processorNumber, &nodeNumber );
// get amount of physical memory available of node.
ULONGLONG availableMemory = MAXLONGLONG;
GetNumaAvailableMemoryNode(nodeNumber, &availableMemory )
// make sure that we don't request too much. Initial limit will be 75% of available memory
_allocateAmt = qMin(requestedMemory, availableMemory * 3 / 4);
// allocate the cached memory region now.
HANDLE handle = (HANDLE)GetCurrentProcess ();
cacheObject = (char*) VirtualAllocExNuma (handle, 0, _allocateAmt,
MEM_COMMIT | MEM_RESERVE ,
PAGE_READWRITE| PAGE_NOCACHE , nodeNumber);
The code as is, works correctly using VS2008 on Win 7/64.
In our application this block of memory functions as a cache store for static objects (1-2mb ea) that are normally stored on the hard drive. My problem is that when we transfer data into the cache area using memcpy, it takes > 10 times as longer than if we allocate memory using new char[xxxx]. And no other code changes.
We are at a loss to understand why this is happening. Any suggestions as to where to look?

PAGE_NOCACHE is murder on perf, it disables the CPU cache. Was that intentional?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js