Successive calls to mmap, any caching?

Successive calls to mmap, any caching? - c++

I read in a vector as in:
int readBytes(string filename, vector<uint32_t> &v)
{
// fstat file, get filesize, etc.
uint32_t *filebuf = (uint32_t*)mmap(0,filesize,PROT_READ,
MAP_FILE|MAP_PRIVATE,
fhand,0);
v = std::vector<uint32_t>(filebuf,filebuf+numrecords);
munmap(filebuf, filesize);
}
in main() I have two successive calls (purely as a test):
vector<uint32_t> v(10000);
readBytes(filename, v);
readBytes(filename, v);
// ...
The second call almost always gives a faster clock time:
Profile time [1st call]: 0.000214141 sec
Profile time [2nd call]: 0.000094109 sec
A look at the system calls indicates the memory chunks are differend:
mmap(NULL, 40000, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fe843ac8000
mmap(NULL, 40000, PROT_READ, MAP_PRIVATE, 4, 0) = 0x7fe843ac7000
Why is the second call faster? Coincidence? What, if anything, is cached?

Assuming you're talking about something *NIX-ish, there's probably a page cache, whose job is precisely to cache this sort of data to get this speedup. Unless something else came along between calls to evict those pages from the cache, they'll still be there.
So, the first call potentially has to:
allocate pages
map the pages into your process address space
copy the data from those pages into your vector (possibly faulting the data from disk as it goes)
the second call probably finds the pages still in the cache, and only has to:
map the pages into your process address space
copy the data from those pages into your vector (they're pre-faulted this time, so it's a simple memory operation)
In fact, I've skipped a step: the open/fstat step in your comment is probably also accelerated, via the inode cache.

Remember that your program sees virtual memory. There is a mapping table ("page tables") that maps virtual addresses seen by your program to the real physical memory. And the OS will ensure that the two mmap() calls map two different virtual addresses seen by your program to the same physical memory. So the data only has to be loaded from disk once.
More detal:
First mmap(): OS just records the mapping
When you actually try to read the data: A "page fault" happens, since the data isn't in memory. The OS catches that, reads data from disk to its disk cache, and updates the page tables so that your program can read directly from that disk cache, then it resumes your program automatically.
First munmap(): OS disables the mapping, and updates your page tables so you can't read the file any more. Note that the file is still in the OS's disk cache.
Second mmap(): OS just records the mapping
When you actually try to read the data: A "page fault" happens, since the data isn't mapped. The OS catches that, notices that the data is already in its disk cache, and updates the page tables so that your program can read directly from that disk cache, then it resumes your program automatically.
Second munmap(): OS disables the mapping, and updates your page tables so you can't read the file any more. Note that the file is still in the OS's disk cache.

Related

Windows shared memory access time slow

I am currently using shared memory with two mapped files (1.9 GBytes for the first one and 600 MBytes for the second) in a software.
I am using a process that read data from the first file, process the data and write the results to the second file.
I have noticed a strong delay sometimes (the reason is out of my knowledge) when reading or writing to the mapping view with memcpy function.
Mapped files are created this way :
m_hFile = ::CreateFileW(SensorFileName,
GENERIC_READ | GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL);
m_hMappedFile = CreateFileMapping(m_hFile,
NULL,
PAGE_READWRITE,
dwFileMapSizeHigh,
dwFileMapSizeLow,
NULL);
And memory mapping is done this way :
m_lpMapView = MapViewOfFile(m_hMappedFile,
FILE_MAP_ALL_ACCESS,
dwOffsetHigh,
dwOffsetLow,
m_i64ViewSize);
The dwOffsetHigh/dwOffsetLow are "matching" granularity from the system info.
The process is reading about 300KB * N times, storing that in a buffer, processing and then writing 300KB * N times the processed contents of the previous buffer to the second file.
I have two different memory views (created/moved with MapViewOfFile function) with a size of 10 MBytes as default size.
For memory view size, I tested 10kBytes, 100kB, 1MB, 10MB and 100MB. Statistically no difference, 80% of the time reading process is as described below (~200ms) but writing process is really slow.
Normally :
1/ Reading is done in ~200ms.
2/ Process done in 2.9 seconds.
3/ Writing is done in ~200ms.
I can see that 80% of the time, either reading or writing (in the worst case both are slow) will take between 2 and 10 seconds.
Example : For writing, I am using the below code
for (unsigned int i = 0 ; i < N ; i++) // N = 500~3k
{
// Check the position of the memory view for ponderation
if (###)
MoveView(iOffset);
if (m_lpMapView)
{
memcpy((BYTE*)m_lpMapView + iOffset, pANNHeader, uiANNStatus);
// uiSize = ~300 kBytes
memcpy((BYTE*)m_lpMapView + iTemp, pLine[i], uiSize);
}
else
return uiANNStatus;
}
After using GetTickCount function to pinpoint where is the delay, I am seeing that the second memcpy call is always the one taking most of the time.
So, so far I am seeing N (for test, I used N = 500) calls to memcpy taking 10 seconds at the worst time when using those shared memories.
I made a temporary software that was doing the same quantity of memcpy calls, same amount of data and couldn't see the problem.
For tests, I used the following conditions, they all show the same delay :
1/ I can see this on various computers, 32 or 64 bits from windows 7 to windows 10.
2/ Using the main thread or multi-threads (up to 8 with critical sections for synchronization purpose) for reading/writing.
3/ OS on SATA or SSD, memory mapped files of the software physically on a SATA or SSD hard-disk, and if on external hard-disk, tests were done through USB1, USB2 or USB3.
I am kindly asking you what you would think my mistake is for memcpy to go slow.
Best regards.

I found a solution that works for me but not might be the case for others.
Following Thomas Matthews comments, I checked the MSDN and found two interesting functions FlushViewOfFile and FlushFileBuffers (but couldn't find anything interesting about locking memory).
Calling both after the for loop force update of the mapped file.
I am having no more "random" delay, but instead of the expected 200ms, I have an average of 400ms which is enough for my application.
After doing some tests I saw that calling those too often will cause heavy hard-disk access and will make the delay worse (10 seconds for every for loop) so the flush should be use carefully.
Thanks.

Invalidating memory page

I am implementing a reader of huge compressed raster files. Decompression is performed partially on the fly. Only requested regions of the raster are decompressed and stored in memory cache. Reader works similarly as memory mapping of a file but the data is not mapped to memory 1:1, it is decompressed.
It is implemented using anonymous memory mapping:
char* raster_cache = static_cast<char*>(mmap(0, UNCOMPRESSED_RASTER_SIZE, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0));
Reading of an area which is not cached yet emits segmentation violation signal which is caught and handled using libsigsegv (see my previous question):
struct CacheHandlerData
{
std::mutex mutex;
// other data needed for decompression
};
int cache_sigsegv_handler(void* fault_address, void* user_data)
{
void* page_address = reinterpret_cast<void*>(reinterpret_cast<uintptr_t>(fault_address) & ~(PAGE_SIZE - 1));
CacheHandlerData* data = static_cast<CacheHandlerData*>(user_data);
std::lock_guard<std::mutex> lock(data->mutex);
unsigned char cached = 0;
mincore(page_address, 1, &cached);
if (!cached)
{
mprotect(page_address, PAGE_SIZE, PROT_WRITE);
// decompress whole page
mprotect(page_address, PAGE_SIZE, PROT_READ);
}
return 1;
}
The problem is that cached pages stay in memory forever. Because i write to the pages, they are marked as dirty and never invalidated.
QUESTION: Is there some possibility to mark pages as not dirty?
In case the system is running out of memory, the pages would be removed from memory similarly to a normal disk cache. It would also be needed to call mprotect(page_address, PAGE_SIZE, PROT_NONE) for the removed pages in order to cause a segmentation violation when the page is accessed again.
Thank you.
EDIT: I could use temporary file backed mapping instead of anonymous one. Pages would be swapped to disk in case the system is out of memory. But this solution loses benefits from using compressed data (smaller disk size, probably faster reading).

Irregular file writing performance in c++

I am writing an app which receives a binary data stream wtih a simple function call like put(DataBLock, dateTime); where each data package is 4 MB
I have to write these datablocks to seperate files for future use with some additional data like id, insertion time, tag etc...
So I both tried these two methods:
first with FILE:
data.id = seedFileId;
seedFileId++;
std::string fileName = getFileName(data.id);
char *fNameArray = (char*)fileName.c_str();
FILE* pFile;
pFile = fopen(fNameArray,"wb");
fwrite(reinterpret_cast<const char *>(&data.dataTime), 1, sizeof(data.dataTime), pFile);
data.dataInsertionTime = time(0);
fwrite(reinterpret_cast<const char *>(&data.dataInsertionTime), 1, sizeof(data.dataInsertionTime), pFile);
fwrite(reinterpret_cast<const char *>(&data.id), 1, sizeof(long), pFile);
fwrite(reinterpret_cast<const char *>(&data.tag), 1, sizeof(data.tag), pFile);
fwrite(reinterpret_cast<const char *>(&data.data_block[0]), 1, data.data_block.size() * sizeof(int), pFile);
fclose(pFile);
second with ostream:
ofstream fout;
data.id = seedFileId;
seedFileId++;
std::string fileName = getFileName(data.id);
char *fNameArray = (char*)fileName.c_str();
fout.open(fNameArray, ios::out| ios::binary | ios::app);
fout.write(reinterpret_cast<const char *>(&data.dataTime), sizeof(data.dataTime));
data.dataInsertionTime = time(0);
fout.write(reinterpret_cast<const char *>(&data.dataInsertionTime), sizeof(data.dataInsertionTime));
fout.write(reinterpret_cast<const char *>(&data.id), sizeof(long));
fout.write(reinterpret_cast<const char *>(&data.tag), sizeof(data.tag));
fout.write(reinterpret_cast<const char *>(&data.data_block[0]), data.data_block.size() * sizeof(int));
fout.close();
In my tests the first methods looks faster, but my main problem is in both ways at first everythings goes fine, for every file writing operation it tooks almost the same time (like 20 milliseconds), but after the 250 - 300th package it starts to make some peaks like 150 to 300 milliseconds and then goes down to 20 milliseconds and then again 150 ms and so on... So it becomes very unpredictable.
When I put some timers to the code I figured out that the main reason for these peaks are because of the fout.open(...) and pfile = fopen(...) lines. I have no idea if this is because of the operating system, hard drive, any kind of cache or buffer mechanism etc...
So the question is; why these file opening lines become problematic after some time, and is there a way to make file writing operation stable, I mean fixed time?
Thanks.
NOTE: I'm using Visual studio 2008 vc++, Windows 7 x64. (I tried also for 32 bit configuration but the result is same)
EDIT: After some point writing speed slows down as well even if the opening file time is minimum. I tried with different package sizes so here are the results:
For 2 MB packages it takes double time to slow down, I mean after ~ 600th item slowing down begins
For 4 MB packages almost 300th item
For 8 MB packages almost 150th item
So it seems to me it is some sort of caching problem or something? (in hard drive or OS). But I also tried with disabling hard drive cache but nothing changed...
Any idea?

This is all perfectly normal, you are observing the behavior of the file system cache. Which is a chunk of RAM that's is set aside by the operating system to buffer disk data. It is normally a fat gigabyte, can be much more if your machine has lots of RAM. Sounds like you've got 4 GB installed, not that much for a 64-bit operating system. Depends however on the RAM needs of other processes that run on the machine.
Your calls to fwrite() or ofstream::write() write to a small buffer created by the CRT, it in turns make operating system calls to flush full buffers. The OS writes normally completely very quickly, it is a simple memory-to-memory copy going from the CRT buffer to the file system cache. Effective write speed is in excess of a gigabyte/second.
The file system driver lazily writes the file system cache data to the disk. Optimized to minimize the seek time on the write head, by far the most expensive operation on the disk drive. Effective write speed is determined by the rotational speed of the disk platter and the time needed to position the write head. Typical is around 30 megabytes/second for consumer-level drives, give or take a factor of 2.
Perhaps you see the fire-hose problem here. You are writing to the file cache a lot faster than it can be emptied. This does hit the wall eventually, you'll manage to fill the cache to capacity and suddenly see the perf of your program fall off a cliff. Your program must now wait until space opens up in the cache so the write can complete, effective write speed is now throttled by disk write speeds.
The 20 msec delays you observe are normal as well. That's typically how long it takes to open a file. That is a time that's completely dominated by disk head seek times, it needs to travel to the file system index to write the directory entry. Nominal times are between 20 and 50 msec, you are on the low end of that already.
Clearly there is very little you can do in your code to improve this. What CRT functions you use certainly don't make any difference, as you found out. At best you could increase the size of the files you write, that reduces the overhead spent on creating the file.
Buying more RAM is always a good idea. But it of course merely delays the moment where the firehose overflows the bucket. You need better drive hardware to get ahead. An SSD is pretty nice, so is a striped raid array. Best thing to do is to simply not wait for your program to complete :)

So the question is; why these file opening lines become problematic
after some time, and is there a way to make file writing operation
stable, I mean fixed time?
This observation(.i.e. varying time taken in write operation) does not mean that there is problem in OS or File System.There could be various reason behind your observation. One possible reason could be the delayed write may be used by kernel to write the data to disk. Sometime kernel cache it(buffer) in case another process should read or write it soon so that extra disk operation can be avoided.
This situation may lead to inconsistency in the time taken in different write call for same size of data/buffer.
File I/O is bit complex and complicated topic and depends on various other factors. For complete information on internal algorithm on File System, you may want to refer the great great classic book "The Design Of UNIX Operating System" By Maurice J Bach which describes these concepts and the implementation in detailed way.
Having said that, you may want to use the flush call immediately after your write call in both version of your program(.i.e. C and C++). This way you may get the consistent time in your file I/O write time. Otherwise your programs behaviour look correct to me.
//C program
fwrite(data,fp);
fflush(fp);
//C++ Program
fout.write(data);
fout.flush();

It's possible that the spikes are not related to I/O itself, but NTFS metadata: when your file count reach some limit, some NTFS AVL-like data structure needs some refactoring and... bump!
To check it you should preallocate the file entries, for example creating all the files with zero size, and then opening them when writing, just for testing: if my theory is correct you shouldn't see your spikes anymore.
UHH - and you must disable file indexing (Windows search service) there! Just remembered of it... see here.

Linux & Windows: Using Large Files to save physical memory

Good afternoon, We have implemented a C++ cKeyArray class to test whether we can use the Large File API to save physical memory. During Centos Linux testing, we found that the Linux File API was just as fast as using the heap for random access processing. Here are the numbers: for a 2,700,000 row SQL database where the KeySize for each row is 62 bytes,
cKeyArray class using LINUX File API
BruteForceComparisons = 197275 BruteForceTimeElapsed = 1,763,504,445 microsecs
Each BruteForce Comparisons requires two random access, there the mean time required for each random access = 1,763,504,445 microsecs / (2 * 197275) = 4470 microsecs
Heap , no cKeyArray class
BruteForceComparisons = 197275 BruteForceTimeElapsed = 1,708,442,690microsecs
the mean time required for each random access = 4300 microsecs.
On 32 bit Windows,the numbers are,
cKeyArray class using Windows File API
BruteForceComparisons = 197275 BruteForceTimeElapsed = 9243787 millisecs
the mean time for each random access is 23.4 millisec
Heap, no cKeyArray class
BruteForceComparisons = 197275 BruteForceTimeElapsed = 2,141,941 millisecs
the mean time requires for each random access is 5.4 millisec
We are wondering why the Linux cKeyArray numbers are just as good the Linux heap numbers while on 32 bit Windows the mean heap random access time is 4 times as fast the cKeyArray Windows File API. Is there some way we can speed up the Windows cKeyArray File API?
Previouly, we received a lot of good suggestions from Stack Overflow on using the Windows Memory Mapped File API. Based on these Stack Overflow suggestions we have implemented a Memory Mapped File MRU caching class which functions properly.
Because we want to devlop a cross-platform solution, we want to do due diligence to see why the Linux File API is so fast? Thank you. We are trying to post a portion of the cKeyArray class implementation below.
#define KEYARRAY_THRESHOLD 100000000
// Use file instead of memory if requirement is above this number
cKeyArray::cKeyArray(long RecCount_,int KeySize_,int MatchCodeSize_, char* TmpFileName_) {
RecCount=RecCount_;
KeySize=KeySize_;
MatchCodeSize=MatchCodeSize_;
MemBuffer=0;
KeyBuffer=0;
MemFile=0;
MemFileName[0]='\x0';
ReturnBuffer=new char[MatchCodeSize + 1];
if (RecCount*KeySize<=KEYARRAY_THRESHOLD) {
InMemory=true;
MemBuffer=new char[RecCount*KeySize];
memset(MemBuffer,0,RecCount*KeySize);
} else {
InMemory=false;
strcpy(MemFileName,TmpFileName_);
try {
MemFile=
new cFile(MemFileName,cFile::CreateAlways,cFile::ReadWrite);
}
catch (cException e)
{
throw e;
}
try {
MemFile->SetFilePointer(
(int64_t)(RecCount*KeySize),cFile::FileBegin);
}
catch (cException e)
{
throw e;
}
if (!(MemFile->SetEndOfFile()))
throw cException(ERR_FILEOPEN,MemFileName);
KeyBuffer=new char[KeySize];
}
}
char *cKeyArray::GetKey(long Record_) {
memset(ReturnBuffer,0,MatchCodeSize + 1);
if (InMemory) {
memcpy(ReturnBuffer,MemBuffer+Record_*KeySize,MatchCodeSize);
} else {
MemFile->SetFilePointer((int64_t)(Record_*KeySize),cFile::FileBegin);
MemFile->ReadFile(KeyBuffer,KeySize);
memcpy(ReturnBuffer,KeyBuffer,MatchCodeSize);
}
return ReturnBuffer;
}
uint32_t cKeyArray::GetDupeGroup(long Record_) {
uint32_t DupeGroup(0);
if (InMemory) {
memcpy((char*)&DupeGroup,
MemBuffer+Record_*KeySize + MatchCodeSize,sizeof(uint32_t));
} else {
MemFile->SetFilePointer(
(int64_t)(Record_*KeySize + MatchCodeSize) ,cFile::FileBegin);
MemFile->ReadFile((char*)&DupeGroup,sizeof(uint32_t));
}
return DupeGroup;
}

On Linux, the OS aggressively caches file data in main memory -- so although you haven't explicitly allocated memory for the file contents, they are nevertheless stored in RAM. Here's a decent link with some more information about the page cache -- only one thing is missing from that description, which is that most Linux filesystems actually implement the standard I/O interfaces as thin wrappers around the page cache. That means that even though you haven't explicitly memory mapped the file, the system is still treating it as though it were memory mapped under-the-covers. That's why you see roughly equivalent performance with either approach.
I second the suggestion to factor the platform-specific stuff out, and use whichever appropach is fastest for each platform. Be sure to benchmark -- don't ever make assumptions about performance.

Your memory-mapped solution should be as much as 10x faster than the file solution even in Linux. That is the speed I experience in my test cases.
Each file access system call takes hundreds of CPU cycles to complete. Time which your program could be using to do real work.
One explanation for why the speeds are similar could be that your memory map has not been used before. When a memory mapped page is accessed for the first time it must be assigned to a physical page of RAM and zeroed out or if it is a disk file it must be loaded from disk into RAM. All of that takes a considerable amount of time.
If you touch (read or write a value) each 4K of RAM before using it you should see a significant speed increase in the memory map.

Reading binary files, Linux Buffer Cache

I am busy writing something to test the read speeds for disk IO on Linux.
At the moment I have something like this to read the files:
Edited to change code to this:
const int segsize = 1048576;
char buffer[segsize];
ifstream file;
file.open(sFile.c_str());
while(file.readsome(buffer,segsize)) {}
For foo.dat, which is 150GB, the first time I read it in, it takes around 2 minutes.
However if I run it within 60 seconds of the first run, it will then take around 3 seconds to run. How is that possible? Surely the only place that could be read from that fast is the buffer cache in RAM, and the file is too big to fit in RAM.
The machine has 50GB of ram, and the drive is a NFS mount with all the default settings. Please let me know where I could look to confirm that this file is actually being read at this speed? Is my code wrong? It appears to take a correct amount of time the first time the file is read.
Edited to Add:
Found out that my files were only reading up to a random point. I've managed to fix this by changing segsize down to 1024 from 1048576. I have no idea why changing this allows the ifstream to read the whole file instead of stopping at a random point.
Thanks for the answers.

On Linux, you can do this for a quick troughput test:
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.863904 s, 243 MB/s
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.0748273 s, 2.8 GB/s
$ sync && echo 3 > /proc/sys/vm/drop_caches
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.919688 s, 228 MB/s
echo 3 > /proc/sys/vm/drop_caches will flush the cache properly

in_avail doesn't give the length of the file, but a lower bound of what is available (especially if the buffer has already been used, it return the size available in the buffer). Its goal is to know what can be read without blocking.
unsigned int is most probably unable to hold a length of more than 4GB, so what is read can very well be in the cache.
C++0x Stream Positioning may be interesting to you if you are using large files

in_avail returns the lower bound of how much is available to read in the streams read buffer, not the size of the file. To read the whole file via the stream, just keep
calling the stream's readsome() method and checking how much was read with the gcount() method - when that returns zero, you have read everthing.

It appears to take a correct amount of time the first time the file is read.
On that first read, you're reading 150GB in about 2 minutes. That works out to about 10 gigabits per second. Is that what you're expecting (based on the network to your NFS mount)?

One possibility is that the file could be at least in part sparse. A sparse file has regions that are truly empty - they don't even have disk space allocated to them. These sparse regions also don't consume much cache space, and so reading the sparse regions will essentially only require time to zero out the userspace pages they're being read into.
You can check with ls -lsh. The first column will be the on-disk size - if it's less than the file size, the file is indeed sparse. To de-sparse the file, just write to every page of it.
If you would like to test for true disk speeds, one option would be to use the O_DIRECT flag to open(2) to bypass the cache. Note that all IO using O_DIRECT must be page-aligned, and some filesystems do not support it (in particular, it won't work over NFS). Also, it's a bad idea for anything other than benchmarking. See some of Linus's rants in this thread.
Finally, to drop all caches on a linux system for testing, you can do:
echo 3 > /proc/sys/vm/drop_caches
If you do this on both client and server, you will force the file out of memory. Of course, this will have a negative performance impact on anything else running at the time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js