Performance of Win32 memory mapped files vs. CRT fopen/fread

Performance of Win32 memory mapped files vs. CRT fopen/fread - c++

I need to read (scan) a file sequentially and process its content.
File size can be anything from very small (some KB) to very large (some GB).
I tried two techniques using VC10/VS2010 on Windows 7 64-bit:
Win32 memory mapped files (i.e. CreateFile, CreateFileMapping, MapViewOfFile, etc.)
fopen and fread from CRT.
I thought that memory mapped file technique could be faster than CRT functions, but some tests showed that the speed is almost the same in both cases.
The following C++ statements are used for MMF:
HANDLE hFile = CreateFile(
filename,
GENERIC_READ,
FILE_SHARE_READ,
NULL,
OPEN_EXISTING,
FILE_FLAG_SEQUENTIAL_SCAN,
NULL
);
HANDLE hFileMapping = CreateFileMapping(
hFile,
NULL,
PAGE_READONLY,
0,
0,
NULL
);
The file is read sequentially, chunk by chunk; each chunk is SYSTEM_INFO.dwAllocationGranularity in size.
Considering that speed is almost the same with MMF and CRT, I'd use CRT functions because they are simpler and multi-platform. But I'm curious: am I using the MMF technique correctly? Is it normal that MMF performance in this case of scannig file sequentially is the same as CRT one?
Thanks.

I believe you'll not see much difference if you access the file sequentially. Because file I/O is very heavily cached, + read-ahead is probably also used.
The thing would be different if you had many "jumps" during the file data processing. Then, each time setting a new file pointer and reading a new file portion will probably kill CRT, whereas MMF will give you the maximum possible performance

Since you are scanning the file sequentially I would not expect disk usage pattern to be much different for either method.
For large files, MMF might reduce data locality and even result in a copy of all or part of the file being placed in the pagefile, whereas processing via CRT using a small buffer would all take place in RAM. In this instance, MMF would probably be slower. You can mitigate this by only mapping in part of the underlying file at a time, but then things get more complex without any likely win over direct sequential I/O.
MMF are really the way Windows implements inter-process shared memory, rather than a way to speed up generalized file I/O. The file manager cache in the kernel is what you really need to leverage here.

I thought that memory mapped file
technique could be faster than CRT
functions, but some tests showed that
the speed is almost the same in both
cases.
You are probably hitting the file system cache for your tests. Unless you explicitly create file handles to bypass the file system cache (FILE_FLAG_NO_BUFFERING when calling CreateFile), the file system cache will kick in and keep recently accessed files in memory.
There is a small speed difference between reading a file that is in the file system cache with buffering turned on, as the operating system has to perform an extra copy, as well as system call overhead. But for your purposes, you should probably stick with the CRT file functions.
Gustavo Duarte has a great article on memory mapped files (from a generic OS perspective).

Both methods will eventually come down to disk i/o, that will be your bottleneck. I would go with one method that my higher level functionality likes more - if i have need streaming, I'll go with files, if I need sequential access and fixed size files, I would consider memory mapped files.
Or, in case when you have an algorithm that works only on memory, then mem-mapped files can be easier way out.

Using ReadFile:
Enters Kernel Mode
Does a memcpy from the Disk Cache
If data isn't in the Disk Cache, triggers a Page Fault which makes the Cache Manager read data from the disk.
Exits Kernel Mode
Cost of entering and leaving Kernel Mode was about 1600 CPU cycles when I measured it.
Avoid small reads, since every call to ReadFile has the overhead of entering and leaving Kernel Mode.
Memory Mapped Files:
Basically places the Disk Cache right into your application's address space.
If data is in the cache, you just read it.
If data isn't there, triggers a Page Fault that makes the Cache Manager read data from the disk. (There is a User/Kernel mode transition to handle this exception)
Disk reads don't always succeed. You need to be able to handle memory exceptions from the system, otherwise a disk read failure will be an application crash.
So both ways will use the same Disk Cache, and will use the same mechanism of getting data into the cache (Page Fault exception -> Cache Manager reads data from the disk). The Cache Manager is also responsible for doing data prefetching and such, so it can read more than one page at a time. You don't get a page fault on every memory page.
So the main advantages of Memory-Mapped files are:
Can possibly use data in-place without copying it out first
Fewer User<->Kernel Mode transitions (depends on access patterns)
And the disadvantages are:
Need to handle access violation exceptions for failed disk reads
Takes up address space in the program to map entire files

Related

C++ memory mapped files [duplicate]

POSIX environments provide at least two ways of accessing files. There's the standard system calls open(), read(), write(), and friends, but there's also the option of using mmap() to map the file into virtual memory.
When is it preferable to use one over the other? What're their individual advantages that merit including two interfaces?

mmap is great if you have multiple processes accessing data in a read only fashion from the same file, which is common in the kind of server systems I write. mmap allows all those processes to share the same physical memory pages, saving a lot of memory.
mmap also allows the operating system to optimize paging operations. For example, consider two programs; program A which reads in a 1MB file into a buffer creating with malloc, and program B which mmaps the 1MB file into memory. If the operating system has to swap part of A's memory out, it must write the contents of the buffer to swap before it can reuse the memory. In B's case any unmodified mmap'd pages can be reused immediately because the OS knows how to restore them from the existing file they were mmap'd from. (The OS can detect which pages are unmodified by initially marking writable mmap'd pages as read only and catching seg faults, similar to Copy on Write strategy).
mmap is also useful for inter process communication. You can mmap a file as read / write in the processes that need to communicate and then use synchronization primitives in the mmap'd region (this is what the MAP_HASSEMAPHORE flag is for).
One place mmap can be awkward is if you need to work with very large files on a 32 bit machine. This is because mmap has to find a contiguous block of addresses in your process's address space that is large enough to fit the entire range of the file being mapped. This can become a problem if your address space becomes fragmented, where you might have 2 GB of address space free, but no individual range of it can fit a 1 GB file mapping. In this case you may have to map the file in smaller chunks than you would like to make it fit.
Another potential awkwardness with mmap as a replacement for read / write is that you have to start your mapping on offsets of the page size. If you just want to get some data at offset X you will need to fixup that offset so it's compatible with mmap.
And finally, read / write are the only way you can work with some types of files. mmap can't be used on things like pipes and ttys.

One area where I found mmap() to not be an advantage was when reading small files (under 16K). The overhead of page faulting to read the whole file was very high compared with just doing a single read() system call. This is because the kernel can sometimes satisify a read entirely in your time slice, meaning your code doesn't switch away. With a page fault, it seemed more likely that another program would be scheduled, making the file operation have a higher latency.

mmap has the advantage when you have random access on big files. Another advantage is that you access it with memory operations (memcpy, pointer arithmetic), without bothering with the buffering. Normal I/O can sometimes be quite difficult when using buffers when you have structures bigger than your buffer. The code to handle that is often difficult to get right, mmap is generally easier. This said, there are certain traps when working with mmap.
As people have already mentioned, mmap is quite costly to set up, so it is worth using only for a given size (varying from machine to machine).
For pure sequential accesses to the file, it is also not always the better solution, though an appropriate call to madvise can mitigate the problem.
You have to be careful with alignment restrictions of your architecture(SPARC, itanium), with read/write IO the buffers are often properly aligned and do not trap when dereferencing a casted pointer.
You also have to be careful that you do not access outside of the map. It can easily happen if you use string functions on your map, and your file does not contain a \0 at the end. It will work most of the time when your file size is not a multiple of the page size as the last page is filled with 0 (the mapped area is always in the size of a multiple of your page size).

In addition to other nice answers, a quote from Linux system programming written by Google's expert Robert Love:
Advantages of mmap( )
Manipulating files via mmap( ) has a handful of advantages over the
standard read( ) and write( ) system calls. Among them are:
Reading from and writing to a memory-mapped file avoids the
extraneous copy that occurs when using the read( ) or write( ) system
calls, where the data must be copied to and from a user-space buffer.
Aside from any potential page faults, reading from and writing to a memory-mapped file does not incur any system call or context switch
overhead. It is as simple as accessing memory.
When multiple processes map the same object into memory, the data is shared among all the processes. Read-only and shared writable
mappings are shared in their entirety; private writable mappings have
their not-yet-COW (copy-on-write) pages shared.
Seeking around the mapping involves trivial pointer manipulations. There is no need for the lseek( ) system call.
For these reasons, mmap( ) is a smart choice for many applications.
Disadvantages of mmap( )
There are a few points to keep in mind when using mmap( ):
Memory mappings are always an integer number of pages in size. Thus, the difference between the size of the backing file and an
integer number of pages is "wasted" as slack space. For small files, a
significant percentage of the mapping may be wasted. For example, with
4 KB pages, a 7 byte mapping wastes 4,089 bytes.
The memory mappings must fit into the process' address space. With a 32-bit address space, a very large number of various-sized mappings
can result in fragmentation of the address space, making it hard to
find large free contiguous regions. This problem, of course, is much
less apparent with a 64-bit address space.
There is overhead in creating and maintaining the memory mappings and associated data structures inside the kernel. This overhead is
generally obviated by the elimination of the double copy mentioned in
the previous section, particularly for larger and frequently accessed
files.
For these reasons, the benefits of mmap( ) are most greatly realized
when the mapped file is large (and thus any wasted space is a small
percentage of the total mapping), or when the total size of the mapped
file is evenly divisible by the page size (and thus there is no wasted
space).

Memory mapping has a potential for a huge speed advantage compared to traditional IO. It lets the operating system read the data from the source file as the pages in the memory mapped file are touched. This works by creating faulting pages, which the OS detects and then the OS loads the corresponding data from the file automatically.
This works the same way as the paging mechanism and is usually optimized for high speed I/O by reading data on system page boundaries and sizes (usually 4K) - a size for which most file system caches are optimized to.

An advantage that isn't listed yet is the ability of mmap() to keep a read-only mapping as clean pages. If one allocates a buffer in the process's address space, then uses read() to fill the buffer from a file, the memory pages corresponding to that buffer are now dirty since they have been written to.
Dirty pages can not be dropped from RAM by the kernel. If there is swap space, then they can be paged out to swap. But this is costly and on some systems, such as small embedded devices with only flash memory, there is no swap at all. In that case, the buffer will be stuck in RAM until the process exits, or perhaps gives it back withmadvise().
Non written to mmap() pages are clean. If the kernel needs RAM, it can simply drop them and use the RAM the pages were in. If the process that had the mapping accesses it again, it cause a page fault the kernel re-loads the pages from the file they came from originally. The same way they were populated in the first place.
This doesn't require more than one process using the mapped file to be an advantage.

Do memory mapped files provide advantage for large buffers?

My program works with large data sets that need to be stored in contiguous memory (several Gigabytes). Allocating memory using std::allocator (i.e. malloc or new) causes system stalls as large portions of virtual memory are reserved and physical memory gets filled up.
Since the program will mostly only work on small portions at a time, my question is if using memory mapped files would provide an advantage (i.e. mmap or the Windows equivalent.) That is creating a large sparse temporary file and mapping it to virtual memory. Or is there another technique that would change the system's pagination strategy such that less pages are loaded into physical memory at a time.
I'm trying to avoid building a streaming mechanism that loads portions of a file at a time and instead rely on the system's vm pagination.

Yes, mmap has the potential to speed things up.
Things to consider:
Remember the VMM will page things in and out in page size blocked (4k on Linux)
If your memory access is well localised over time, this will work well. But if you do random access over your entire file, you will end up with a lot of seeking and thrashing (still). So, consider whether your 'small portions' correspond with localised bits of the file.
For large allocations, malloc and free will use mmap with MAP_ANON anyway. So the difference in memory mapping a file is simply that you are getting the VMM to do the I/O for you.
Consider using madvise with mmap to assist the VMM in paging well.
When you use open and read (plus, as erenon suggests, posix_fadvise), your file is still held in buffers anyway (i.e. it's not immediately written out) unless you also use O_DIRECT. So in both situations, you are relying on the kernel for I/O scheduling.

If the data is already in a file, it would speed up things, especially in the non-sequential case. (In the sequential case, read wins)
If using open and read, consider using posix_fadvise as well.

This really depends on your mmap() implementation. Mapping a file into memory has several advantages that can be exploited by the kernel:
The kernel knows that the contents of the mmap() pages is already present on disk. If it decides to evict these pages, it can omit the write back.
You reduce copying operations: read() operations typically first read the data into kernel memory, then copy it over to user space.
The reduced copies also mean that less memory is used to store data from the file, which means more memory is available for other uses, which can reduce paging as well.
This is also, why it is generally a bad idea to use large caches within an I/O library: Modern kernels already cache everything they ever read from disk, caching a copy in user space means that the amount of data that can be cached is actually reduced.
Of course, you also avoid a lot of headaches that result from buffering data of unknown size in your application. But that is just a convenience for you as a programmer.
However, even though the kernel can exploit these properties, it does not necessarily do so. My experience is that LINUX mmap() is generally fine; on AIX, however, I have witnessed really bad mmap() performance. So, if your goal is performance, it's the old measure-compare-decide stand by.

mmap versus memory allocated with new

I have a BitVector class that can either allocate memory dynamically using new or it can mmap a file. There isn't a noticeable difference in performance when using it with small files, but when using a 16GB file I have found that the mmap file is far slower than the memory allocated with new. (Something like 10x slower or more.) Note that my machine has 64GB of RAM.
The code in question is loading values from a large disk file and placing them into a Bloom filter which uses my BitVector class for storage.
At first I thought this might be because the backing for the mmap file was on the same disk as the file I was loading from, but this didn't seem to be the issue. I put the two files on two physically different disks, and there was no change in performance. (Although I believe they are on the same controller.)
Then, I used mlock to try to force everything into RAM, but the mmap implementation was still really slow.
So, for the time being I'm just allocating the memory directly. The only thing I'm changing in the code for this comparison is a flag the BitVector constructor.
Note that to measure performance I'm both looking at top and watching how many states I can add into the Bloom filter per second. The CPU usage doesn't even register on top when using mmap - although jbd2/sda1-8 starts to move up (I'm running on an Ubuntu server), which looks to be a process that is dealing with journaling for the drive. The input and output files are stored on two HDDs.
Can anyone explain this huge difference in performance?
Thanks!

Just to start with, mmap is an system call or interface provided to access the Virtual Memory of the system.
Now, in linux (I hope you are working on *nix) a lot of performance improvement is acheived by lazy loading or more commonly known as Copy-On-Write.
For mmap as well, this kind of lazy loading is implemented.
What happens is, when you call mmap on a file, kernel does not immediately allocate main memory pages for the file to be mapped. Instead, it waits for the program to write/read from the illusionary page, at which stage, a page fault occurs, and the corresponding interrupt handler will then actually load that particular file part that can be held in that page frame (Also the page table is updated, so that next time, when you are reading/writing to same page, it is pointing to a valid frame).
Now, you can control this behavior with mlock, madvise, MAP_POPULATE flag with mmap etc.
MAP_POPULATE flags with mmap, tells the kernel to map the file to memory pages before the call returns rather than page faulting every time you access a new page.So, till the file is loaded, the function will be blocked.
From the Man Page:
MAP_POPULATE (since Linux 2.5.46)
Populate (prefault) page tables for a mapping. For a file
mapping, this causes read-ahead on the file. Later accesses
to the mapping will not be blocked by page faults.
MAP_POPULATE is supported for private mappings only since
Linux 2.6.23.

Speeding up file I/O: mmap() vs. read()

I have a Linux application that reads 150-200 files (4-10GB) in parallel. Each file is read in turn in small, variably sized blocks, typically less than 2K each.
I currently need to maintain over 200 MB/s read rate combined from the set of files. The disks handle this just fine. There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment).
We have implemented two different read systems both make heavy use of posix_advise: first is a mmaped read in which we map the entirety of the data set and read on demand.
The second is a read()/seek() based system.
Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes.
So my question comes to these:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
B: Are there systematic ways for mmap to better deal with very large mapped data?
Mmap-vs-reading-blocks
is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read.

Reads back to what? What is the final destination of this data?
Since it sounds like you are completely IO bound, mmap and read should make no difference. The interesting part is in how you get the data to your receiver.
Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. To do this using zero-copy, try the splice system call. You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout.
if (pid = fork()) {
waitpid(pid, ...);
} else {
dup2(dest, 1);
dup2(source, 0);
execlp("cat", "cat");
}
Update0
If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. Your processing step should accept data from stdin, or a pipe.
To answer your more specific questions:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
That's as good as it gets with regard to telling the kernel what to do from userspace. The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. I'd just go with splicing the files into a pipe.
B: Are there systematic ways for mmap to better deal with very large mapped data?
Yes. The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing):
MAP_HUGETLB
Allocate the mapping using "huge pages."
This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files.
MAP_NORESERVE
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.
This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.**
MAP_POPULATE
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. I suspect this flag is redundant, the VFS likely does this better by default.

Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong).
And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. Can't than be half a megabyte instead?

The problem here doesn't seem to be which api is used. It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access).
mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow.
I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often.

Manipulating data in Memory instead of file

consider the function below:
int Func(char* filename);
int Func(FILE* filepointer);
these two do the same, reads alot of data from the given file (by name or pointer), analyze he result, and returns it.
I wanna call this function with lots of different data. Therefore I should write data into file, then pass the new filename to Func. but data is huge and reading and writing in hard is very slow. actually the analyze time is much less than I/O.
can I get rid of save/load data all the time by any means?
for example by making a FILE* pointer which points somewhere in Memory?
Update: obviously I don't have the source code of Func! It's a DLL call.

You could use memory-mapped file technique or something like boost::iostreams with custom memory sinks / sources.
Actually, the second variant is a lot more flexible, but sometimes all that flexi- and versatibility is simply not needed.

In many operating systems you can use an in-memory filesystem such as tmpfs -- and in Windows "temporary files" (opened with the appropriate flags, then rewound rather than closed) behave similarly (i.e., can stay in memory).
However, there isn't all that much to be gained there compared to writing (with lots of buffering) and reading (ditto) sequentially from an un-fragmented disk, for large files -- tmpfs's performance advantages are mostly for small files. If your performance is very bad, either the disk is horribly fragmented, or (perhaps more likely these days of self-adjusting filesystems) you're not using buffering appropriately (possibly just not buffering enough). (of course, both factors could be in play). Modern devices and filesystems can have awesome performance when just streaming huge buffers to and from memory, after all.
For a given amount of RAM devoted to buffering, you can get better performance (for what from app level look like huge numbers of tiny writes and reads) if that RAM is in userland in your app's address space (rather than under kernel control e.g. in a tmpfs), simply because you'll need fewer context switches -- and switches from user to kernel mode and back tend to dominate runtime when the only other ops performed are copies of small amounts of memory back and forth. When you use very large buffers in your app's stdio library, your "I/O" amounts to userland memory-memory copies within your address space with very rare "streaming" ops that actually transfers those buffers back and forth.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js