Related
My program works with large data sets that need to be stored in contiguous memory (several Gigabytes). Allocating memory using std::allocator (i.e. malloc or new) causes system stalls as large portions of virtual memory are reserved and physical memory gets filled up.
Since the program will mostly only work on small portions at a time, my question is if using memory mapped files would provide an advantage (i.e. mmap or the Windows equivalent.) That is creating a large sparse temporary file and mapping it to virtual memory. Or is there another technique that would change the system's pagination strategy such that less pages are loaded into physical memory at a time.
I'm trying to avoid building a streaming mechanism that loads portions of a file at a time and instead rely on the system's vm pagination.
Yes, mmap has the potential to speed things up.
Things to consider:
Remember the VMM will page things in and out in page size blocked (4k on Linux)
If your memory access is well localised over time, this will work well. But if you do random access over your entire file, you will end up with a lot of seeking and thrashing (still). So, consider whether your 'small portions' correspond with localised bits of the file.
For large allocations, malloc and free will use mmap with MAP_ANON anyway. So the difference in memory mapping a file is simply that you are getting the VMM to do the I/O for you.
Consider using madvise with mmap to assist the VMM in paging well.
When you use open and read (plus, as erenon suggests, posix_fadvise), your file is still held in buffers anyway (i.e. it's not immediately written out) unless you also use O_DIRECT. So in both situations, you are relying on the kernel for I/O scheduling.
If the data is already in a file, it would speed up things, especially in the non-sequential case. (In the sequential case, read wins)
If using open and read, consider using posix_fadvise as well.
This really depends on your mmap() implementation. Mapping a file into memory has several advantages that can be exploited by the kernel:
The kernel knows that the contents of the mmap() pages is already present on disk. If it decides to evict these pages, it can omit the write back.
You reduce copying operations: read() operations typically first read the data into kernel memory, then copy it over to user space.
The reduced copies also mean that less memory is used to store data from the file, which means more memory is available for other uses, which can reduce paging as well.
This is also, why it is generally a bad idea to use large caches within an I/O library: Modern kernels already cache everything they ever read from disk, caching a copy in user space means that the amount of data that can be cached is actually reduced.
Of course, you also avoid a lot of headaches that result from buffering data of unknown size in your application. But that is just a convenience for you as a programmer.
However, even though the kernel can exploit these properties, it does not necessarily do so. My experience is that LINUX mmap() is generally fine; on AIX, however, I have witnessed really bad mmap() performance. So, if your goal is performance, it's the old measure-compare-decide stand by.
I have a Linux application that reads 150-200 files (4-10GB) in parallel. Each file is read in turn in small, variably sized blocks, typically less than 2K each.
I currently need to maintain over 200 MB/s read rate combined from the set of files. The disks handle this just fine. There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment).
We have implemented two different read systems both make heavy use of posix_advise: first is a mmaped read in which we map the entirety of the data set and read on demand.
The second is a read()/seek() based system.
Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes.
So my question comes to these:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
B: Are there systematic ways for mmap to better deal with very large mapped data?
Mmap-vs-reading-blocks
is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read.
Reads back to what? What is the final destination of this data?
Since it sounds like you are completely IO bound, mmap and read should make no difference. The interesting part is in how you get the data to your receiver.
Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. To do this using zero-copy, try the splice system call. You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout.
if (pid = fork()) {
waitpid(pid, ...);
} else {
dup2(dest, 1);
dup2(source, 0);
execlp("cat", "cat");
}
Update0
If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. Your processing step should accept data from stdin, or a pipe.
To answer your more specific questions:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
That's as good as it gets with regard to telling the kernel what to do from userspace. The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. I'd just go with splicing the files into a pipe.
B: Are there systematic ways for mmap to better deal with very large mapped data?
Yes. The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing):
MAP_HUGETLB
Allocate the mapping using "huge pages."
This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files.
MAP_NORESERVE
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.
This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.**
MAP_POPULATE
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. I suspect this flag is redundant, the VFS likely does this better by default.
Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong).
And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. Can't than be half a megabyte instead?
The problem here doesn't seem to be which api is used. It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access).
mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow.
I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often.
I have a situation where I need to work with a number (15-30) of large (several hundreds mb) data structures. They won't fit into memory all at the same time. To make things worse, the algorithms operating on them work across all those structures, i.e. not first one, then the other etc. I need to make this as fast as possible.
So I figured I'd allocate memory on disk, in files that are basically direct binary representations of the data when it's loaded into memory, and use memory mapped files to access the data. I use mmap 'views' of for example 50 megabytes (50 mb of the files are loaded into memory at a time), so when I have 15 data sets, my process uses 750 mb of memory for the data. Which was OK initially (for testing), when I have more data I adjust the 50 mb down at the cost of some speed.
However this heuristic is hard-coded for now (I know the size of the data set I will test with). 'In the wild', my software will need to be able to determine the 'right' amount of memory to allocate to maximize performance. I could say 'I will target a memory use of 500 mb' and then divide 500 by the amount of data structures to come to a mmap view size. I have found that when trying to set this 'target memory usage' too high, that the virtual memory manager disk thrashing will (almost) lock up the machine and render it unusable until the processing finishes. This is to be avoided in my 'production' solution.
So my questions, all somewhat different approaches to the problem:
What is the 'best' target size for a single process? Should I just try to max out the 2gb that I have (assuming 32 bit Win XP and up, non-/3GB for now) or try to keep my process size smaller so that my software won't hog the machine? When I have 2 Visual Studio's, Outlook and a Firefox open on my machine, those use 1/2 gb of virtual memory easily by themselves - if I let my software use 2 gb of virtual memory the swapping will severely slow down the machine. But then how do I determine the 'best' process size.
What can I do to keep performance of the machine in check when working with memory-mapped files? My application does fairly simple numerical operations on the data, which basically means that it zips over hundreds of megabytes of data real quick, causing the whole memory-mapped files (several gigabytes) to be loaded into memory and swapped out again very quickly, again and again (think Monte Carlo style simulation).
Is there any chance that not using memory-mapped files and just using fseek/fgets is going to be faster or less intrusive than using memory mapped files?
Any articles, papers or books I can read about this? Either with 'cookbook' style solutions or fundamental concepts.
Thanks.
It occurs to me that you could set some predefined threshold for "too darn slow" and use the computer's wall-clock to make your alterations on the fly.
Start conservatively low. If this is below your "too darn slow" threshold, bump the size up a little bit for the next file. do this iteratively. When you go above the threshold, slowly back the size off iteratively.
I think it's a good place to try Address Windowing Extensions: http://msdn.microsoft.com/en-us/library/aa366527(v=VS.85).aspx
It will allow to use more than 4GB of memory by providing a sliding window. The drawback is that not all versions of windows have it.
I probably wouldn't use a memory-mapped file for this app. Memory-mapped files work best when you have a large virtual address space (at least relative to the size of the data you're processing). You map the entire file, and let the OS decide which pieces remain resident.
However, if you're repeatedly mapping and unmapping segments of the file (rather than the entire file), you'll probably end up doing just as well by reading chunks via fseek and fread -- note, however, that you do not want to read individual pieces of data this way (ie, do one large read rather than a lot of small reads).
The one way that manually segmented memory-mapped files might win is if you have sparse reads: if you'll only be touching, say 10% of a given file. In this case, memory mapping means the OS will read only those pages that are touched, whereas explicit reads will load the entire file.
Oh, and I would definitely not spend time trying to control my resource consumption. The OS will do that better than you can, because it knows about all competing processes.
It will probably be best to fix the size of the memory mapped file to be a some percentage of the total system memory with probably a set minimum.
Remember that the operating system will effectively load a whole memory page when you access a single byte, this may well happen in the background but will only be fast if sequential data accesses tend to be close together.
You should therefore try to keep sequential accesses to your data as close together in memory/the file as possible. You can also look a preloading strategies access your data speculatively before actually requiring the data. These are the same considerations that you will need when optimizing for memory cache efficiency.
If sequential data accesses are scattered widely in your file, you may be better off using fseek and fread to access the data since this will give you better fine-grain control of what data is written to memory when.
Also remember that there are no hard and fast rules. Optimizations can sometimes be counter-intuitive so try a whole bunch of different things and see which works best on the platform that this will need to operate on.
Perhaps you can use /LARGEADDRESSAWARE for you linker of Visual Studio, and use bcdedit for your process to use memory larger than 2GB.
Assuming the following for...
Output:
The file is opened...
Data is 'streamed' to disk. The data in memory is in a large contiguous buffer. It is written to disk in its raw form directly from that buffer. The size of the buffer is configurable, but fixed for the duration of the stream. Buffers are written to the file, one after another. No seek operations are conducted.
...the file is closed.
Input:
A large file (sequentially written as above) is read from disk from beginning to end.
Are there generally accepted guidelines for achieving the fastest possible sequential file I/O in C++?
Some possible considerations:
Guidelines for choosing the optimal buffer size
Will a portable library like boost::asio be too abstracted to expose the intricacies of a specific platform, or can they be assumed to be optimal?
Is asynchronous I/O always preferable to synchronous? What if the application is not otherwise CPU-bound?
I realize that this will have platform-specific considerations. I welcome general guidelines as well as those for particular platforms.
(my most immediate interest in Win x64, but I am interested in comments on Solaris and Linux as well)
Are there generally accepted guidelines for achieving the fastest possible sequential file I/O in C++?
Rule 0: Measure. Use all available profiling tools and get to know them. It's almost a commandment in programming that if you didn't measure it you don't know how fast it is, and for I/O this is even more true. Make sure to test under actual work conditions if you possibly can. A process that has no competition for the I/O system can be over-optimized, fine-tuned for conditions that don't exist under real loads.
Use mapped memory instead of writing to files. This isn't always faster but it allows the opportunity to optimize the I/O in an operating system-specific but relatively portable way, by avoiding unnecessary copying, and taking advantage of the OS's knowledge of how the disk actually being used. ("Portable" if you use a wrapper, not an OS-specific API call).
Try and linearize your output as much as possible. Having to jump around memory to find the buffers to write can have noticeable effects under optimized conditions, because cache lines, paging and other memory subsystem issues will start to matter. If you have lots of buffers look into support for scatter-gather I/O which tries to do that linearizing for you.
Some possible considerations:
Guidelines for choosing the optimal buffer size
Page size for starters, but be ready to tune from there.
Will a portable library like boost::asio be too abstracted to expose the intricacies
of a specific platform, or can they be assumed to be optimal?
Don't assume it's optimal. It depends on how thoroughly the library gets exercised on your platform, and how much effort the developers put into making it fast. Having said that a portable I/O library can be very fast, because fast abstractions exist on most systems, and it's usually possible to come up with a general API that covers a lot of the bases. Boost.Asio is, to the best of my limited knowledge, fairly fine tuned for the particular platform it is on: there's a whole family of OS and OS-variant specific APIs for fast async I/O (e.g. epoll, /dev/epoll, kqueue, Windows overlapped I/O), and Asio wraps them all.
Is asynchronous I/O always preferable to synchronous? What if the application is not otherwise CPU-bound?
Asynchronous I/O isn't faster in a raw sense than synchronous I/O. What asynchronous I/O does is ensure that your code is not wasting time waiting for the I/O to complete. It is faster in a general way than the other method of not wasting that time, namely using threads, because it will call back into your code when I/O is ready and not before. There are no false starts or concerns with idle threads needing to be terminated.
A general advice is to turn off buffering and read/write in large chunks (but not too large, then you will waste too much time waiting for the whole I/O to complete where otherwise you could start munching away at the first megabyte already. It's trivial to find the sweet spot with this algorithm, there's only one knob to turn: the chunk size).
Beyond that, for input mmap()ing the file shared and read-only is (if not the fastest, then) the most efficient way. Call madvise() if your platform has it, to tell the kernel how you will traverse the file, so it can do readahead and throw out the pages afterwards again quickly.
For output, if you already have a buffer, consider underpinning it with a file (also with mmap()), so you don't have to copy the data in userspace.
If mmap() is not to your liking, then there's fadvise(), and, for the really tough ones, async file I/O.
(All of the above is POSIX, Windows names may be different).
For Windows, you'll want to make sure you use the FILE_FLAG_SEQUENTIAL_SCAN in your CreateFile() call, if you opt to use the platform specific Windows API call. This will optimize caching for the I/O. As far as buffer sizes go, a buffer size that is a multiple of the disk sector size is typically advised. 8K is a nice starting point with little to be gained from going larger.
This article discusses the comparison between async and sync on Windows.
http://msdn.microsoft.com/en-us/library/aa365683(VS.85).aspx
As you noted above it all depends on the machine / system / libraries that you are using. A fast solution on one system may be slow on another.
A general guideline though would be to write in as large of chunks as possible.Typically writing a byte at a time is the slowest.
The best way to know for sure is to code a few different ways and profile them.
You asked about C++, but it sounds like you're past that and ready to get a little platform-specific.
On Windows, FILE_FLAG_SEQUENTIAL_SCAN with a file mapping is probably the fastest way. In fact, your process can exit before the file actually makes it on to the disk. Without an explicitly-blocking flush operation, it can take up to 5 minutes for Windows to begin writing those pages.
You need to be careful if the files are not on local devices but a network drive. Network errors will show up as SEH errors, which you will need to be prepared to handle.
On *nixes, you might get a bit higher performance writing sequentially to a raw disk device. This is possible on Windows too, but not as well supported by the APIs. This will avoid a little filesystem overhead, but it may not amount to enough to be useful.
Loosely speaking, RAM is 1000 or more times faster than disks, and CPU is faster still. There are probably not a lot of logical optimizations that will help, except avoiding movements of the disk heads (seek) whenever possible. A dedicated disk just for this file can help significantly here.
You will get the absolute fastest performance by using CreateFile and ReadFile. Open the file with FILE_FLAG_SEQUENTIAL_SCAN.
Read with a buffer size that is a power of two. Only benchmarking can determine this number. I have seen it to be 8K once. Another time I found it to be 8M! This varies wildly.
It depends on the size of the CPU cache, on the efficiency of OS read-ahead and on the overhead associated with doing many small writes.
Memory mapping is not the fastest way. It has more overhead because you can't control the block size and the OS needs to fault in all pages.
On Linux, buffered reads and writes speed up things a lot up, increasingly with increasing buffers sizes, but the returns are diminishing and you generally want to use BUFSIZ (defined by stdio.h) as larger buffer sizes won't help much.
mmaping provides the fastest access to files, but the mmap call itself is rather expensive. For small files (16KiB) read and write system calls win (see https://stackoverflow.com/a/39196499/1084774 for the numbers on reading through read and mmap).
I'm working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I've got a first implementation up and running and am now looking towards improving performance, particularly at doing I/O more efficiently since the input file gets scanned many times.
Is there a rule of thumb for using mmap() versus reading in blocks via C++'s fstream library? What I'd like to do is read large blocks from disk into a buffer, process complete records from the buffer, and then read more.
The mmap() code could potentially get very messy since mmap'd blocks need to lie on page sized boundaries (my understanding) and records could potentially lie across page boundaries. With fstreams, I can just seek to the start of a record and begin reading again, since we're not limited to reading blocks that lie on page sized boundaries.
How can I decide between these two options without actually writing up a complete implementation first? Any rules of thumb (e.g., mmap() is 2x faster) or simple tests?
I was trying to find the final word on mmap / read performance on Linux and I came across a nice post (link) on the Linux kernel mailing list. It's from 2000, so there have been many improvements to IO and virtual memory in the kernel since then, but it nicely explains the reason why mmap or read might be faster or slower.
A call to mmap has more overhead than read (just like epoll has more overhead than poll, which has more overhead than read). Changing virtual memory mappings is a quite expensive operation on some processors for the same reasons that switching between different processes is expensive.
The IO system can already use the disk cache, so if you read a file, you'll hit the cache or miss it no matter what method you use.
However,
Memory maps are generally faster for random access, especially if your access patterns are sparse and unpredictable.
Memory maps allow you to keep using pages from the cache until you are done. This means that if you use a file heavily for a long period of time, then close it and reopen it, the pages will still be cached. With read, your file may have been flushed from the cache ages ago. This does not apply if you use a file and immediately discard it. (If you try to mlock pages just to keep them in cache, you are trying to outsmart the disk cache and this kind of foolery rarely helps system performance).
Reading a file directly is very simple and fast.
The discussion of mmap/read reminds me of two other performance discussions:
Some Java programmers were shocked to discover that nonblocking I/O is often slower than blocking I/O, which made perfect sense if you know that nonblocking I/O requires making more syscalls.
Some other network programmers were shocked to learn that epoll is often slower than poll, which makes perfect sense if you know that managing epoll requires making more syscalls.
Conclusion: Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes (MAP_SHARED isn't very interesting if there is no actual sharing). Read files normally if you access data sequentially or discard it after reading. And if either method makes your program less complex, do that. For many real world cases there's no sure way to show one is faster without testing your actual application and NOT a benchmark.
(Sorry for necro'ing this question, but I was looking for an answer and this question kept coming up at the top of Google results.)
There are lots of good answers here already that cover many of the salient points, so I'll just add a couple of issues I didn't see addressed directly above. That is, this answer shouldn't be considered a comprehensive of the pros and cons, but rather an addendum to other answers here.
mmap seems like magic
Taking the case where the file is already fully cached1 as the baseline2, mmap might seem pretty much like magic:
mmap only requires 1 system call to (potentially) map the entire file, after which no more system calls are needed.
mmap doesn't require a copy of the file data from kernel to user-space.
mmap allows you to access the file "as memory", including processing it with whatever advanced tricks you can do against memory, such as compiler auto-vectorization, SIMD intrinsics, prefetching, optimized in-memory parsing routines, OpenMP, etc.
In the case that the file is already in the cache, it seems impossible to beat: you just directly access the kernel page cache as memory and it can't get faster than that.
Well, it can.
mmap is not actually magic because...
mmap still does per-page work
A primary hidden cost of mmap vs read(2) (which is really the comparable OS-level syscall for reading blocks) is that with mmap you'll need to do "some work" for every 4K page accessed in a new mapping, even though it might be hidden by the page-fault mechanism.
For a example a typical implementation that just mmaps the entire file will need to fault-in so 100 GB / 4K = 25 million faults to read a 100 GB file. Now, these will be minor faults, but 25 million page faults is still not going to be super fast. The cost of a minor fault is probably in the 100s of nanos in the best case.
mmap relies heavily on TLB performance
Now, you can pass MAP_POPULATE to mmap to tell it to set up all the page tables before returning, so there should be no page faults while accessing it. Now, this has the little problem that it also reads the entire file into RAM, which is going to blow up if you try to map a 100GB file - but let's ignore that for now3. The kernel needs to do per-page work to set up these page tables (shows up as kernel time). This ends up being a major cost in the mmap approach, and it's proportional to the file size (i.e., it doesn't get relatively less important as the file size grows)4.
Finally, even in user-space accessing such a mapping isn't exactly free (compared to large memory buffers not originating from a file-based mmap) - even once the page tables are set up, each access to a new page is going to, conceptually, incur a TLB miss. Since mmaping a file means using the page cache and its 4K pages, you again incur this cost 25 million times for a 100GB file.
Now, the actual cost of these TLB misses depends heavily on at least the following aspects of your hardware: (a) how many 4K TLB enties you have and how the rest of the translation caching works performs (b) how well hardware prefetch deals with with the TLB - e.g., can prefetch trigger a page walk? (c) how fast and how parallel the page walking hardware is. On modern high-end x86 Intel processors, the page walking hardware is in general very strong: there are at least 2 parallel page walkers, a page walk can occur concurrently with continued execution, and hardware prefetching can trigger a page walk. So the TLB impact on a streaming read load is fairly low - and such a load will often perform similarly regardless of the page size. Other hardware is usually much worse, however!
read() avoids these pitfalls
The read() syscall, which is what generally underlies the "block read" type calls offered e.g., in C, C++ and other languages has one primary disadvantage that everyone is well-aware of:
Every read() call of N bytes must copy N bytes from kernel to user space.
On the other hand, it avoids most the costs above - you don't need to map in 25 million 4K pages into user space. You can usually malloc a single buffer small buffer in user space, and re-use that repeatedly for all your read calls. On the kernel side, there is almost no issue with 4K pages or TLB misses because all of RAM is usually linearly mapped using a few very large pages (e.g., 1 GB pages on x86), so the underlying pages in the page cache are covered very efficiently in kernel space.
So basically you have the following comparison to determine which is faster for a single read of a large file:
Is the extra per-page work implied by the mmap approach more costly than the per-byte work of copying file contents from kernel to user space implied by using read()?
On many systems, they are actually approximately balanced. Note that each one scales with completely different attributes of the hardware and OS stack.
In particular, the mmap approach becomes relatively faster when:
The OS has fast minor-fault handling and especially minor-fault bulking optimizations such as fault-around.
The OS has a good MAP_POPULATE implementation which can efficiently process large maps in cases where, for example, the underlying pages are contiguous in physical memory.
The hardware has strong page translation performance, such as large TLBs, fast second level TLBs, fast and parallel page-walkers, good prefetch interaction with translation and so on.
... while the read() approach becomes relatively faster when:
The read() syscall has good copy performance. E.g., good copy_to_user performance on the kernel side.
The kernel has an efficient (relative to userland) way to map memory, e.g., using only a few large pages with hardware support.
The kernel has fast syscalls and a way to keep kernel TLB entries around across syscalls.
The hardware factors above vary wildly across different platforms, even within the same family (e.g., within x86 generations and especially market segments) and definitely across architectures (e.g., ARM vs x86 vs PPC).
The OS factors keep changing as well, with various improvements on both sides causing a large jump in the relative speed for one approach or the other. A recent list includes:
Addition of fault-around, described above, which really helps the mmap case without MAP_POPULATE.
Addition of fast-path copy_to_user methods in arch/x86/lib/copy_user_64.S, e.g., using REP MOVQ when it is fast, which really help the read() case.
Update after Spectre and Meltdown
The mitigations for the Spectre and Meltdown vulnerabilities considerably increased the cost of a system call. On the systems I've measured, the cost of a "do nothing" system call (which is an estimate of the pure overhead of the system call, apart from any actual work done by the call) went from about 100 ns on a typical modern Linux system to about 700 ns. Furthermore, depending on your system, the page-table isolation fix specifically for Meltdown can have additional downstream effects apart from the direct system call cost due to the need to reload TLB entries.
All of this is a relative disadvantage for read() based methods as compared to mmap based methods, since read() methods must make one system call for each "buffer size" worth of data. You can't arbitrarily increase the buffer size to amortize this cost since using large buffers usually performs worse since you exceed the L1 size and hence are constantly suffering cache misses.
On the other hand, with mmap, you can map in a large region of memory with MAP_POPULATE and the access it efficiently, at the cost of only a single system call.
1 This more-or-less also includes the case where the file wasn't fully cached to start with, but where the OS read-ahead is good enough to make it appear so (i.e., the page is usually cached by the time you want it). This is a subtle issue though because the way read-ahead works is often quite different between mmap and read calls, and can be further adjusted by "advise" calls as described in 2.
2 ... because if the file is not cached, your behavior is going to be completely dominated by IO concerns, including how sympathetic your access pattern is to the underlying hardware - and all your effort should be in ensuring such access is as sympathetic as possible, e.g. via use of madvise or fadvise calls (and whatever application level changes you can make to improve access patterns).
3 You could get around that, for example, by sequentially mmaping in windows of a smaller size, say 100 MB.
4 In fact, it turns out the MAP_POPULATE approach is (at least one some hardware/OS combination) only slightly faster than not using it, probably because the kernel is using faultaround - so the actual number of minor faults is reduced by a factor of 16 or so.
The main performance cost is going to be disk i/o. "mmap()" is certainly quicker than istream, but the difference might not be noticeable because the disk i/o will dominate your run-times.
I tried Ben Collins's code fragment (see above/below) to test his assertion that "mmap() is way faster" and found no measurable difference. See my comments on his answer.
I would certainly not recommend separately mmap'ing each record in turn unless your "records" are huge - that would be horribly slow, requiring 2 system calls for each record and possibly losing the page out of the disk-memory cache.....
In your case I think mmap(), istream and the low-level open()/read() calls will all be about the same. I would recommend mmap() in these cases:
There is random access (not sequential) within the file, AND
the whole thing fits comfortably in memory OR there is locality-of-reference within the file so that certain pages can be mapped in and other pages mapped out. That way the operating system uses the available RAM to maximum benefit.
OR if multiple processes are reading/working on the same file, then mmap() is fantastic because the processes all share the same physical pages.
(btw - I love mmap()/MapViewOfFile()).
mmap is way faster. You might write a simple benchmark to prove it to yourself:
char data[0x1000];
std::ifstream in("file.bin");
while (in)
{
in.read(data, 0x1000);
// do something with data
}
versus:
const int file_size=something;
const int page_size=0x1000;
int off=0;
void *data;
int fd = open("filename.bin", O_RDONLY);
while (off < file_size)
{
data = mmap(NULL, page_size, PROT_READ, 0, fd, off);
// do stuff with data
munmap(data, page_size);
off += page_size;
}
Clearly, I'm leaving out details (like how to determine when you reach the end of the file in the event that your file isn't a multiple of page_size, for instance), but it really shouldn't be much more complicated than this.
If you can, you might try to break up your data into multiple files that can be mmap()-ed in whole instead of in part (much simpler).
A couple of months ago I had a half-baked implementation of a sliding-window mmap()-ed stream class for boost_iostreams, but nobody cared and I got busy with other stuff. Most unfortunately, I deleted an archive of old unfinished projects a few weeks ago, and that was one of the victims :-(
Update: I should also add the caveat that this benchmark would look quite different in Windows because Microsoft implemented a nifty file cache that does most of what you would do with mmap in the first place. I.e., for frequently-accessed files, you could just do std::ifstream.read() and it would be as fast as mmap, because the file cache would have already done a memory-mapping for you, and it's transparent.
Final Update: Look, people: across a lot of different platform combinations of OS and standard libraries and disks and memory hierarchies, I can't say for certain that the system call mmap, viewed as a black box, will always always always be substantially faster than read. That wasn't exactly my intent, even if my words could be construed that way. Ultimately, my point was that memory-mapped i/o is generally faster than byte-based i/o; this is still true. If you find experimentally that there's no difference between the two, then the only explanation that seems reasonable to me is that your platform implements memory-mapping under the covers in a way that is advantageous to the performance of calls to read. The only way to be absolutely certain that you're using memory-mapped i/o in a portable way is to use mmap. If you don't care about portability and you can rely on the particular characteristics of your target platforms, then using read may be suitable without sacrificing measurably any performance.
Edit to clean up answer list:
#jbl:
the sliding window mmap sounds
interesting. Can you say a little more
about it?
Sure - I was writing a C++ library for Git (a libgit++, if you will), and I ran into a similar problem to this: I needed to be able to open large (very large) files and not have performance be a total dog (as it would be with std::fstream).
Boost::Iostreams already has a mapped_file Source, but the problem was that it was mmapping whole files, which limits you to 2^(wordsize). On 32-bit machines, 4GB isn't big enough. It's not unreasonable to expect to have .pack files in Git that become much larger than that, so I needed to read the file in chunks without resorting to regular file i/o. Under the covers of Boost::Iostreams, I implemented a Source, which is more or less another view of the interaction between std::streambuf and std::istream. You could also try a similar approach by just inheriting std::filebuf into a mapped_filebuf and similarly, inheriting std::fstream into a mapped_fstream. It's the interaction between the two that's difficult to get right. Boost::Iostreams has some of the work done for you, and it also provides hooks for filters and chains, so I thought it would be more useful to implement it that way.
I'm sorry Ben Collins lost his sliding windows mmap source code. That'd be nice to have in Boost.
Yes, mapping the file is much faster. You're essentially using the the OS virtual memory subsystem to associate memory-to-disk and vice versa. Think about it this way: if the OS kernel developers could make it faster they would. Because doing so makes just about everything faster: databases, boot times, program load times, et cetera.
The sliding window approach really isn't that difficult as multiple continguous pages can be mapped at once. So the size of the record doesn't matter so long as the largest of any single record will fit into memory. The important thing is managing the book-keeping.
If a record doesn't begin on a getpagesize() boundary, your mapping has to begin on the previous page. The length of the region mapped extends from the first byte of the record (rounded down if necessary to the nearest multiple of getpagesize()) to the last byte of the record (rounded up to the nearest multiple of getpagesize()). When you're finished processing a record, you can unmap() it, and move on to the next.
This all works just fine under Windows too using CreateFileMapping() and MapViewOfFile() (and GetSystemInfo() to get SYSTEM_INFO.dwAllocationGranularity --- not SYSTEM_INFO.dwPageSize).
mmap should be faster, but I don't know how much. It very much depends on your code. If you use mmap it's best to mmap the whole file at once, that will make you life a lot easier. One potential problem is that if your file is bigger than 4GB (or in practice the limit is lower, often 2GB) you will need a 64bit architecture. So if you're using a 32 environment, you probably don't want to use it.
Having said that, there may be a better route to improving performance. You said the input file gets scanned many times, if you can read it out in one pass and then be done with it, that could potentially be much faster.
Perhaps you should pre-process the files, so each record is in a separate file (or at least that each file is a mmap-able size).
Also could you do all of the processing steps for each record, before moving onto the next one? Maybe that would avoid some of the IO overhead?
I agree that mmap'd file I/O is going to be faster, but while your benchmarking the code, shouldn't the counter example be somewhat optimized?
Ben Collins wrote:
char data[0x1000];
std::ifstream in("file.bin");
while (in)
{
in.read(data, 0x1000);
// do something with data
}
I would suggest also trying:
char data[0x1000];
std::ifstream iifle( "file.bin");
std::istream in( ifile.rdbuf() );
while( in )
{
in.read( data, 0x1000);
// do something with data
}
And beyond that, you might also try making the buffer size the same size as one page of virtual memory, in case 0x1000 is not the size of one page of virtual memory on your machine... IMHO mmap'd file I/O still wins, but this should make things closer.
I remember mapping a huge file containing a tree structure into memory years ago. I was amazed by the speed compared to normal de-serialization which involves lot of work in memory, like allocating tree nodes and setting pointers.
So in fact I was comparing a single call to mmap (or its counterpart on Windows)
against many (MANY) calls to operator new and constructor calls.
For such kind of task, mmap is unbeatable compared to de-serialization.
Of course one should look into boosts relocatable pointer for this.
This sounds like a good use-case for multi-threading... I'd think you could pretty easily setup one thread to be reading data while the other(s) process it. That may be a way to dramatically increase the perceived performance. Just a thought.
To my mind, using mmap() "just" unburdens the developer from having to write their own caching code. In a simple "read through file eactly once" case, this isn't going to be hard (although as mlbrock points out you still save the memory copy into process space), but if you're going back and forth in the file or skipping bits and so forth, I believe the kernel developers have probably done a better job implementing caching than I can...
I think the greatest thing about mmap is potential for asynchronous reading with:
addr1 = NULL;
while( size_left > 0 ) {
r = min(MMAP_SIZE, size_left);
addr2 = mmap(NULL, r,
PROT_READ, MAP_FLAGS,
0, pos);
if (addr1 != NULL)
{
/* process mmap from prev cycle */
feed_data(ctx, addr1, MMAP_SIZE);
munmap(addr1, MMAP_SIZE);
}
addr1 = addr2;
size_left -= r;
pos += r;
}
feed_data(ctx, addr1, r);
munmap(addr1, r);
Problem is that I can't find the right MAP_FLAGS to give a hint that this memory should be synced from file asap.
I hope that MAP_POPULATE gives the right hint for mmap (i.e. it will not try to load all contents before return from call, but will do that in async. with feed_data). At least it gives better results with this flag even that manual states that it does nothing without MAP_PRIVATE since 2.6.23.