Limit buffer cache used for mmap

Limit buffer cache used for mmap - c++

I have a data structure that I'd like to rework to page out on-demand. mmap seems like an easy way to run some initial experiments. However, I want to limit the amount of buffer cache that the mmap uses. The machine has enough memory to page the entire data structure into cache, but for test reasons (and some production reasons too) I don't want to allow it to do that.
Is there a way to limit the amount of buffer cache used by mmap?
Alternatively, an mmap alternative that can achieve something similar and still limit memory usage would work too.

From my understanding, it is not possible. Memory mapping is controlled by the operating system. The kernel will make the decisions how to use the available memory in the best way, but it looks at the system in total. I'm not aware that quotas for caches on a process level are supported (at least, I have not seen such APIs in Linux or BSD).
There is madvise to give the kernel hints, but it does not support to limit the cache used for one process. You can give it hints like MADV_DONTNEED, which will reduce the pressure on the cache of other applications, but I would expect that it will do more harm than good, as it will most likely make caching less efficient, which will lead to more IO load on the system in total.
I see only two alternatives. One is trying to solve the problem at the OS level, and the other is to solve it at the application level.
At the OS level, I see two options:
You could run a virtual machine, but most likely this is not what you want. I would also expect that it will not improve the overall system performance. Still, it would be at least a way to define upper limits on the memory consumption.
Docker is the another idea that comes to mind, also operating at the OS level, but to the best of my knowledge, it does not support defining cache quotas. I don't think it will work.
That leaves only one option, which is to look at the application level. Instead of using memory mapped files, you could use explicit file system operations. If you need to have full control over the buffer, I think it is the only practical option. It is more work than memory mapping, and it is also not guaranteed to perform better.
If you want to stay with memory mapping, you could also map only parts of the file in memory and unmap other parts when you exceed your memory quota. It also has the same problem as the explicit file IO operations (more implementation work and non-trivial tuning to find a good caching strategy).
Having said that, you could question the requirement to limit the cache memory usage. I would expect that the kernel does a pretty good job at allocating memory resources in a good way. At least, it will likely be better than the solutions that I have sketched. (Explicit file IO, plus an internal cache, might be fast, but it is not trivial to implement and tune. Here, is a comparison of the trade-offs: mmap() vs. reading blocks.)
During testing, you could run the application with ionice -c 3 and nice -n 20 to somewhat reduce the impact on the other productive applications.
There is also a tool called nocache. I never used it but when reading through its documentation, it seems somewhat related to your question.

It might be possible to accomplish this through the use of mmap() and Linux Control Groups (more generally, here or here). Once installed, you have the ability to create arbitrary limits on the amount of, among other things, physical memory used by a process. As an example, here we limit the physical memory to 128 megs and swap memory to 256 megs:
cgcreate -g memory:/limitMemory
echo $(( 128 * 1024 * 1024 )) > /sys/fs/cgroup/memory/limitMemory/memory.limit_in_bytes
echo $(( 256 * 1024 * 1024 )) > /sys/fs/cgroup/memory/limitMemory/memory.memsw.limit_in_bytes

I would go the route of only map parts of the file at a time so you can retain full control on exactly how much memory is used.

you may use ipc shared memory segment, you will be the master of your memory segments.

Related

Checking number of actually allocated pages using anonymous mmap()

I'm allocating some memory using mmap() with MAP_ANONYMOUS flag. Because of the problem's features, often I need to write some data to one page ( for example, located in the middle ) of large memory chunk, and leave the other part untouched. So, I wouldn't this rather big part of unused memory to allocate physically.
In my view, such mmap() call just give me a pointer to some 0-filled virtual pages, but due to to copy-on-write and demand paging mechanisms, no one page ( besides first maybe ) isn't actually allocated in RAM until first write attempt to it's memory.
The problem is: one moment I've got MAP_FAILED from mmap() ( there were a lot of successful calls with fewer allocation queries before ), when tried to allocate a memory chunk larger than my physical RAM size. So, it seems there are much more pages actually allocated, even if there wasn't write access to them.
I need your help guys with two questions, first:
Is my views to anonymous memory allocation correct, and (if not) , what are the inaccuracies?
And the second:
How could I measure number of actually allocated pages after anonymous mmap() is done? I've tried use mincore(), and it's results show me that almost all pages are "resident" in memory ( i.e., physically allocated? ). So, it seems mincore() results are wrong, or I'm totally stuck :(
Upd.
It seems memory overcommiting mentioned by #Art indeed could influence this. But when I'm trying to disable it ( setting /proc/sys/vm/overcommit_memory to 1 mode or using mmap() with MAP_NORESERVE flag, my machine is heavily freezing, and hard reset is the only thing that helps.

"It's complicated." (caveat: I have much more experience of VM systems from not Linux, but I've picked up a few details from Linux over the years)
Generally your assumption is true. Many operating systems don't actually do much on mmap other than record "there is an allocation of X pages of object Y at V". Access to those pages causes a page fault which leads to an actual allocation. Some (or maybe all?) versions of Linux map a pre-zeroed read-only page to such allocations (not sure if it has changed or if it's still done) and the rest is handled like copy-on-write (for me it feels like a bad idea because it should generate more TLB flushes for a dubious optimization of reading zeroed memory which should happen rarely, but I guess Linux people have benchmarked it and found it good).
There are a few caveats. Just because you don't use the RAM doesn't mean that you'll be allowed to overcommit so much memory. Certain Linux distributions started default to a setting that disallows overcommit or disallows more than X% overcommit. Centos/RedHat was one of those (this was probably a decade ago, I don't know the state today). This means that even when you're just using 5% of your actual physical memory, if you have created anonymous mappings of 100+X% of your RAM, mmap will fail. There is a sysctl for it, look it up. It's something like /proc/sys/vm/overcommit_memory or overcommit_ratio or both, or probably even more under there.
Then you have to be aware of resource limits. Check with ulimit if you're allowed to create mappings that big (ulimit -v), the problem could be as simple as that. Also (I'm not sure how Linux does it here, but you can create simple test programs to try it), resource limits for data size (ulimit -d) can be accounted as potential pages you can use rather than actual pages you currently use (in other words no overcommit in resource limits).
Then let's get back to faulting in only the pages you use. There has been research into detecting access patterns and predicting future faults (so that mmap:ed files could do read-ahead), but I don't know the state of it in Linux. It feels like it would be dumb to apply this to anonymous mappings, but you never know, maybe someone is trying to be clever. To make sure, use madvise(..., MADV_RANDOM) on the mmap:ed block you get.
Finally mincore. My experience from it on Linux is that it's garbage, last time I tried it it didn't work for anonymous mappings (or was it private?). In your case it could be as simple as it reporting that the read-only copy-on-write zeroed page for anonymous mappings is considered to be "in core".

Optimising data-structures so that they take advantage of virtual memory

I would like to know how to optimise data-structures in openCV (the mat type specifically) so that I am able to leverage the operating systems built in memory/virtual memory management.
For a full context please read the Q and A here - but otherwise the situation could be summed up that I have a large collection of mats* that I'll need to access arbitrarily and rapidly. The main complication is that full amount of data is well above the amount of RAM available.
(*Conceptually the data is a recursively defined 3D array of 3D arrays, but let's not muddy the water with that confusion!)
Rather than build my own LRU cache and RAM-hungry and inefficient 'page' addressing strategies to access it, I'd rather let the OS do this for me.
I think I get the concepts, but when it comes to the actual implementation I'm twiddling thumbs:
Is this a generic C++ consideration, or something I need to address at the openCV level?
Is it as simple as making the granularity of the of data close to (but not over) 4KB? (see the solution here for the 4KB motivation)
How would the mat(s) actually be saved, accessed and represented on disk? (is this how memory-mapping is involved?)

Is this a generic C++ consideration, or something I need to address at the openCV level?
You just allocate and use boatloads of memory. The whole point of paging / virtual memory is that it's completely transparent. Everything gets extremely slow, but keeps working. You don't get ENOMEM until you're out of swap space + RAM.
On a normal Linux system, your normal swap partition should be very small (under 1GB), so you'll probably need to dd a swap file, and mkswap / swapon on it. Make sure the swap file is has read-write permission for root only. Obviously every major OS will have its own procedures.
Is it as simple as making the granularity of the of data close to (but not over) 4KB? (see the solution here for the 4KB motivation)
If you have pointers to other data, make sure you keep them together. You want all the small "hot" data to be in only a few pages that a decent OS LRU algorithm won't page out.
If you have hot data mixed with cold data, it will easily get paged out and lead to an extra page-file round trip before the cache miss for the final data can even happen.
Like Yakk says, sequential access patterns will do much better, because disk I/O does better with multi-block reads. (Even SSDs have better throughput with larger blocks). This also allows prefetching, which allows one I/O request to start before the previous one's data arrives. Maxing out I/O throughput requires pipelining requests.
Try to design your algorithms to do sequential accesses when possible. This is advantageous at all levels of memory, from paging all the way up to L1 cache. Sequential access even enables auto-vectorization with vector-registers.
Cache blocking (aka loop tiling) techniques are also applicable to page misses. Google for details, but the main idea is to do all the steps of your algorithm over a subset of the data, instead of touching all the data at each step. Then each piece of data only has to be loaded into cache once total, instead of once for each step of your algorithm.
Think of DRAM as a cache for your giant virtual address space.
How would the mat(s) actually be saved, accessed and represented on disk? (is this how memory-mapping is involved?)
Swap space / the pagefile is the backing store for your process's address space. So yes, it's very similar to what you'd get if you allocated memory by mmaping a big file instead of making an anonymous allocation.

How to allocate a memory block and place it into Cache?

I want to dynamically allocate a memory block for an array in C/C++, and this array will be accessed at a high frequency. So I want this array to stay on chip, i.e., in the Cache. How can I do this explicitly with code in C/C++?

There is no standard C++ language feature that allows you to do this.
Depending on your compiler and CPU, you may be able to use an arch-specific CPU instruction in an asm block:
T* p = new T(...);
size_t n = sizeof(T);
asm {
"CACHE n bytes at address p"
}
...or some builtin compiler function ("intrinsic") that does this.
You will need to consult your CPU manual and/or your compiler manual.
As an example, x86 CPUs have a set of instructions starting with PREFETCH.
And another example, GCC has a function called __builtin_prefetch. See GCC Data Prefetch Support

I will try to answer this question from a bit different perspective. Do you really need to do this. And even if it would be a way to do so, will it worth it? Imagine there is a "magic" void * malloc_and_lock_in_cache( int cacheLevel ) function. What you going to do with this data. If it's an application limited to while (1) loop with random array access from single thread you will have such behaviour anyway due to optimisation and CPU architecture. If you think about more real world solutions you always have logic around access. For example locking for multithreading, certain conditions, etc. The the question - do the rest of your application algorithms are so perfect that only left to do is to allocate array in cache.
Do all other access/sorting/lookup functions are state-of-art logic which cannot be reviewed rather then gaining very limited performance kickback trying to overwrite CPU optimisation.
Also do you consider to run your application without ANY operation system on a raw hardware so you shouldn't care about how your allocation will affects OS behaviour, rest of application running around?
And what should happen if your application will run inside virtual machine or environments like XEN.?
I can remember one similar popular subject 15-18 years ago about physical memory usage and disk caching utilities. Indeed tools like MS-DOS smartdrive or similar utilities were REALLY useful and speed up things a lot. Usenet was full of 'tuning advices' and performance analyses for things like write-through/write-back settings.
Especially if your DOS application were processing large amounts of data and implemented some memory swapping logic (I am talking about times then 4MB RAM was luxury) that's became mostly a drama, that from one point of view you need as much memory you can, but from another point of view you need swapping, so you actually need to swap, but swapping goes through cache etc..
But what happened next. We've got VM386 mode, disk cache/memory swaps integrated into OS, and who was care anymore about things like tuning smartdrive/ramdisks. In general it was 'cheaper' to allocate as much as you need VM then implement own voodoo algorithms to swap physical memory blocks (although this functionality is still in WinAPI).
So I would really recommend to concentrate efforts on algorithms and application design rather then trying to use some very low level features with really unpredictable results until you dont develop some new microkernel OS.

I don't think you can. First, which cache? L3, L2, L1? You can prefetch, and align so it its access is more optimized, and then you can query it periodically maybe to make it stay and not go LRU'd, but you can't really make it stay in cache.

First you have to know what's the architecture of the machine you want to run the code on. Then you should check it there's an instruction doing that kind of stuff.
Actually using the memory heavily will force the cache controller to put this region in cache.
And there are three rules of optimizing, you may want to know them first :)
http://c2.com/cgi/wiki?RulesOfOptimization

Dumping buffered data when memory limit is near

My application buffers data for likely requests in the background. Currently I limit the size of the buffer based on a command-line parameter, and begin dumping less-used data when we hit this limit. This is not ideal because it relies on the user to specify a performance-critical parameter. Is there a better way to handle this? Is there a way to automatically monitor system memory use and dump the oldest/least-recently-used data before the system starts to thrash?
A complicating factor here is that my application runs on Linux, OSX, and Windows. But I'll take a good way to do this on only one platform over nothing.

Your best bet would likely be to monitor your applications working set/resident set size, and try to react when it doesn't grow after your allocations. Some pointers on what to look for:
Windows: GetProcessMemoryInfo
Linux: /proc/self/statm
OS X: task_info()
Windows also has GlobalMemoryStatusEx which gives you a nice Available Physical Memory figure.

I like your current solution. Letting the user decide is good. It's not obvious everyone would want the buffer to be as big as possible, is it? If you do invest in implemting some sort of memory monitor for automatically adjusting the buffer/cache size, at least let the user choose between the user set limit and the automatic/dynamic one.

I know this isn't a direct answer, but I'd say step back a bit and maybe don't do this.
Even if you have the API to see current physical memory usage, that's not enough to choose an ideal cache size. That would depend on your typical and future workloads for both the program and the machine (and the overall system of all clients running this program + the server(s) they're querying), the platform's caching behavior, whether the system should be tuned for throughput or latency, and so on. In a tight memory situation, you're going to be competing for memory with other opportunistic caches, including the OS's disk cache. On the one hand, you want to be exerting some pressure on them, to force out other low-value data. On the other hand, if you get greedy while there's plenty of memory, you're going to be affecting the behavior of other adaptive caches.
And with speculative caching/prefetching, the LRU value function is odd: you will (hopefully) fetch the most-likely-to-be-called data first, and less-likely data later, so the LRU data in your prefetch cache may be less valuable than older data. This could lead to perverse behavior in the systemwide set of caches by artificially "heating up" less commonly used data.
It seems unlikely that your program would be able to make a cache size choice better than a simple fixed size, perhaps scaled based on the size of overall physical memory on the machine. And there's very little chance it could beat a sysadmin who knows the machine's typical workload and its performance goals.
Using an adaptive cache sizing strategy means that your program's resource usage is going to be both variable and unpredictable. (With respect to both memory and the I/O and server requests used to populate that prefetch cache.) For a lot of server situations, that's not good. (Especially in HPC or DB servers, which this sounds like it might be for, or a high-utilization/high-throughput environment.) Consistency, configurability, and availability are often more important than maximum resource utilization. And locality of reference tends to fall off quickly, so you're likely getting very diminishing returns with larger cache sizes. If this is going to be used server-side, at least leave the option for explicit control of cache sizes, and probably make that the default, if not only, option.

There is a way: it is called virtual memory (vm). All three operating systems listed will use virtual memory (vm), unless there is no hardware support (which may be true in embedded systems). So I will assume that vm support is present.
Here is a quote from the architecture notes of the Varnish project:
The really short answer is that computers do not have two kinds of storage any more.
I would suggest you read the full text here: http://www.varnish-cache.org/trac/wiki/ArchitectNotes
It is a good read, and I believe will answer your question.

You could attempt to allocate some large-ish block of memory then check for a memory allocation exception. If the exception occurs, dump data. The problem is, this will only work when all system memory (or process limit) is reached. This means your application is likely to start swapping.
try {
char *buf = new char[10 * 1024 * 1024]; // 10 megabytes
free(buf);
} catch (const std::bad_alloc &) {
// Memory allocation failed - clean up old buffers
}
The problems with this approach are:
Running out of system memory can be dangerous and cause random applications to be shut down
Better memory management might be a better solution. If there is data that can be freed, why has it not already been freed? Is there a periodic process you could run to clean up unneeded data?

mmap() vs. reading blocks

I'm working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I've got a first implementation up and running and am now looking towards improving performance, particularly at doing I/O more efficiently since the input file gets scanned many times.
Is there a rule of thumb for using mmap() versus reading in blocks via C++'s fstream library? What I'd like to do is read large blocks from disk into a buffer, process complete records from the buffer, and then read more.
The mmap() code could potentially get very messy since mmap'd blocks need to lie on page sized boundaries (my understanding) and records could potentially lie across page boundaries. With fstreams, I can just seek to the start of a record and begin reading again, since we're not limited to reading blocks that lie on page sized boundaries.
How can I decide between these two options without actually writing up a complete implementation first? Any rules of thumb (e.g., mmap() is 2x faster) or simple tests?

I was trying to find the final word on mmap / read performance on Linux and I came across a nice post (link) on the Linux kernel mailing list. It's from 2000, so there have been many improvements to IO and virtual memory in the kernel since then, but it nicely explains the reason why mmap or read might be faster or slower.
A call to mmap has more overhead than read (just like epoll has more overhead than poll, which has more overhead than read). Changing virtual memory mappings is a quite expensive operation on some processors for the same reasons that switching between different processes is expensive.
The IO system can already use the disk cache, so if you read a file, you'll hit the cache or miss it no matter what method you use.
However,
Memory maps are generally faster for random access, especially if your access patterns are sparse and unpredictable.
Memory maps allow you to keep using pages from the cache until you are done. This means that if you use a file heavily for a long period of time, then close it and reopen it, the pages will still be cached. With read, your file may have been flushed from the cache ages ago. This does not apply if you use a file and immediately discard it. (If you try to mlock pages just to keep them in cache, you are trying to outsmart the disk cache and this kind of foolery rarely helps system performance).
Reading a file directly is very simple and fast.
The discussion of mmap/read reminds me of two other performance discussions:
Some Java programmers were shocked to discover that nonblocking I/O is often slower than blocking I/O, which made perfect sense if you know that nonblocking I/O requires making more syscalls.
Some other network programmers were shocked to learn that epoll is often slower than poll, which makes perfect sense if you know that managing epoll requires making more syscalls.
Conclusion: Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes (MAP_SHARED isn't very interesting if there is no actual sharing). Read files normally if you access data sequentially or discard it after reading. And if either method makes your program less complex, do that. For many real world cases there's no sure way to show one is faster without testing your actual application and NOT a benchmark.
(Sorry for necro'ing this question, but I was looking for an answer and this question kept coming up at the top of Google results.)

There are lots of good answers here already that cover many of the salient points, so I'll just add a couple of issues I didn't see addressed directly above. That is, this answer shouldn't be considered a comprehensive of the pros and cons, but rather an addendum to other answers here.
mmap seems like magic
Taking the case where the file is already fully cached1 as the baseline2, mmap might seem pretty much like magic:
mmap only requires 1 system call to (potentially) map the entire file, after which no more system calls are needed.
mmap doesn't require a copy of the file data from kernel to user-space.
mmap allows you to access the file "as memory", including processing it with whatever advanced tricks you can do against memory, such as compiler auto-vectorization, SIMD intrinsics, prefetching, optimized in-memory parsing routines, OpenMP, etc.
In the case that the file is already in the cache, it seems impossible to beat: you just directly access the kernel page cache as memory and it can't get faster than that.
Well, it can.
mmap is not actually magic because...
mmap still does per-page work
A primary hidden cost of mmap vs read(2) (which is really the comparable OS-level syscall for reading blocks) is that with mmap you'll need to do "some work" for every 4K page accessed in a new mapping, even though it might be hidden by the page-fault mechanism.
For a example a typical implementation that just mmaps the entire file will need to fault-in so 100 GB / 4K = 25 million faults to read a 100 GB file. Now, these will be minor faults, but 25 million page faults is still not going to be super fast. The cost of a minor fault is probably in the 100s of nanos in the best case.
mmap relies heavily on TLB performance
Now, you can pass MAP_POPULATE to mmap to tell it to set up all the page tables before returning, so there should be no page faults while accessing it. Now, this has the little problem that it also reads the entire file into RAM, which is going to blow up if you try to map a 100GB file - but let's ignore that for now3. The kernel needs to do per-page work to set up these page tables (shows up as kernel time). This ends up being a major cost in the mmap approach, and it's proportional to the file size (i.e., it doesn't get relatively less important as the file size grows)4.
Finally, even in user-space accessing such a mapping isn't exactly free (compared to large memory buffers not originating from a file-based mmap) - even once the page tables are set up, each access to a new page is going to, conceptually, incur a TLB miss. Since mmaping a file means using the page cache and its 4K pages, you again incur this cost 25 million times for a 100GB file.
Now, the actual cost of these TLB misses depends heavily on at least the following aspects of your hardware: (a) how many 4K TLB enties you have and how the rest of the translation caching works performs (b) how well hardware prefetch deals with with the TLB - e.g., can prefetch trigger a page walk? (c) how fast and how parallel the page walking hardware is. On modern high-end x86 Intel processors, the page walking hardware is in general very strong: there are at least 2 parallel page walkers, a page walk can occur concurrently with continued execution, and hardware prefetching can trigger a page walk. So the TLB impact on a streaming read load is fairly low - and such a load will often perform similarly regardless of the page size. Other hardware is usually much worse, however!
read() avoids these pitfalls
The read() syscall, which is what generally underlies the "block read" type calls offered e.g., in C, C++ and other languages has one primary disadvantage that everyone is well-aware of:
Every read() call of N bytes must copy N bytes from kernel to user space.
On the other hand, it avoids most the costs above - you don't need to map in 25 million 4K pages into user space. You can usually malloc a single buffer small buffer in user space, and re-use that repeatedly for all your read calls. On the kernel side, there is almost no issue with 4K pages or TLB misses because all of RAM is usually linearly mapped using a few very large pages (e.g., 1 GB pages on x86), so the underlying pages in the page cache are covered very efficiently in kernel space.
So basically you have the following comparison to determine which is faster for a single read of a large file:
Is the extra per-page work implied by the mmap approach more costly than the per-byte work of copying file contents from kernel to user space implied by using read()?
On many systems, they are actually approximately balanced. Note that each one scales with completely different attributes of the hardware and OS stack.
In particular, the mmap approach becomes relatively faster when:
The OS has fast minor-fault handling and especially minor-fault bulking optimizations such as fault-around.
The OS has a good MAP_POPULATE implementation which can efficiently process large maps in cases where, for example, the underlying pages are contiguous in physical memory.
The hardware has strong page translation performance, such as large TLBs, fast second level TLBs, fast and parallel page-walkers, good prefetch interaction with translation and so on.
... while the read() approach becomes relatively faster when:
The read() syscall has good copy performance. E.g., good copy_to_user performance on the kernel side.
The kernel has an efficient (relative to userland) way to map memory, e.g., using only a few large pages with hardware support.
The kernel has fast syscalls and a way to keep kernel TLB entries around across syscalls.
The hardware factors above vary wildly across different platforms, even within the same family (e.g., within x86 generations and especially market segments) and definitely across architectures (e.g., ARM vs x86 vs PPC).
The OS factors keep changing as well, with various improvements on both sides causing a large jump in the relative speed for one approach or the other. A recent list includes:
Addition of fault-around, described above, which really helps the mmap case without MAP_POPULATE.
Addition of fast-path copy_to_user methods in arch/x86/lib/copy_user_64.S, e.g., using REP MOVQ when it is fast, which really help the read() case.
Update after Spectre and Meltdown
The mitigations for the Spectre and Meltdown vulnerabilities considerably increased the cost of a system call. On the systems I've measured, the cost of a "do nothing" system call (which is an estimate of the pure overhead of the system call, apart from any actual work done by the call) went from about 100 ns on a typical modern Linux system to about 700 ns. Furthermore, depending on your system, the page-table isolation fix specifically for Meltdown can have additional downstream effects apart from the direct system call cost due to the need to reload TLB entries.
All of this is a relative disadvantage for read() based methods as compared to mmap based methods, since read() methods must make one system call for each "buffer size" worth of data. You can't arbitrarily increase the buffer size to amortize this cost since using large buffers usually performs worse since you exceed the L1 size and hence are constantly suffering cache misses.
On the other hand, with mmap, you can map in a large region of memory with MAP_POPULATE and the access it efficiently, at the cost of only a single system call.
1 This more-or-less also includes the case where the file wasn't fully cached to start with, but where the OS read-ahead is good enough to make it appear so (i.e., the page is usually cached by the time you want it). This is a subtle issue though because the way read-ahead works is often quite different between mmap and read calls, and can be further adjusted by "advise" calls as described in 2.
2 ... because if the file is not cached, your behavior is going to be completely dominated by IO concerns, including how sympathetic your access pattern is to the underlying hardware - and all your effort should be in ensuring such access is as sympathetic as possible, e.g. via use of madvise or fadvise calls (and whatever application level changes you can make to improve access patterns).
3 You could get around that, for example, by sequentially mmaping in windows of a smaller size, say 100 MB.
4 In fact, it turns out the MAP_POPULATE approach is (at least one some hardware/OS combination) only slightly faster than not using it, probably because the kernel is using faultaround - so the actual number of minor faults is reduced by a factor of 16 or so.

The main performance cost is going to be disk i/o. "mmap()" is certainly quicker than istream, but the difference might not be noticeable because the disk i/o will dominate your run-times.
I tried Ben Collins's code fragment (see above/below) to test his assertion that "mmap() is way faster" and found no measurable difference. See my comments on his answer.
I would certainly not recommend separately mmap'ing each record in turn unless your "records" are huge - that would be horribly slow, requiring 2 system calls for each record and possibly losing the page out of the disk-memory cache.....
In your case I think mmap(), istream and the low-level open()/read() calls will all be about the same. I would recommend mmap() in these cases:
There is random access (not sequential) within the file, AND
the whole thing fits comfortably in memory OR there is locality-of-reference within the file so that certain pages can be mapped in and other pages mapped out. That way the operating system uses the available RAM to maximum benefit.
OR if multiple processes are reading/working on the same file, then mmap() is fantastic because the processes all share the same physical pages.
(btw - I love mmap()/MapViewOfFile()).

mmap is way faster. You might write a simple benchmark to prove it to yourself:
char data[0x1000];
std::ifstream in("file.bin");
while (in)
{
in.read(data, 0x1000);
// do something with data
}
versus:
const int file_size=something;
const int page_size=0x1000;
int off=0;
void *data;
int fd = open("filename.bin", O_RDONLY);
while (off < file_size)
{
data = mmap(NULL, page_size, PROT_READ, 0, fd, off);
// do stuff with data
munmap(data, page_size);
off += page_size;
}
Clearly, I'm leaving out details (like how to determine when you reach the end of the file in the event that your file isn't a multiple of page_size, for instance), but it really shouldn't be much more complicated than this.
If you can, you might try to break up your data into multiple files that can be mmap()-ed in whole instead of in part (much simpler).
A couple of months ago I had a half-baked implementation of a sliding-window mmap()-ed stream class for boost_iostreams, but nobody cared and I got busy with other stuff. Most unfortunately, I deleted an archive of old unfinished projects a few weeks ago, and that was one of the victims :-(
Update: I should also add the caveat that this benchmark would look quite different in Windows because Microsoft implemented a nifty file cache that does most of what you would do with mmap in the first place. I.e., for frequently-accessed files, you could just do std::ifstream.read() and it would be as fast as mmap, because the file cache would have already done a memory-mapping for you, and it's transparent.
Final Update: Look, people: across a lot of different platform combinations of OS and standard libraries and disks and memory hierarchies, I can't say for certain that the system call mmap, viewed as a black box, will always always always be substantially faster than read. That wasn't exactly my intent, even if my words could be construed that way. Ultimately, my point was that memory-mapped i/o is generally faster than byte-based i/o; this is still true. If you find experimentally that there's no difference between the two, then the only explanation that seems reasonable to me is that your platform implements memory-mapping under the covers in a way that is advantageous to the performance of calls to read. The only way to be absolutely certain that you're using memory-mapped i/o in a portable way is to use mmap. If you don't care about portability and you can rely on the particular characteristics of your target platforms, then using read may be suitable without sacrificing measurably any performance.
Edit to clean up answer list:
#jbl:
the sliding window mmap sounds
interesting. Can you say a little more
about it?
Sure - I was writing a C++ library for Git (a libgit++, if you will), and I ran into a similar problem to this: I needed to be able to open large (very large) files and not have performance be a total dog (as it would be with std::fstream).
Boost::Iostreams already has a mapped_file Source, but the problem was that it was mmapping whole files, which limits you to 2^(wordsize). On 32-bit machines, 4GB isn't big enough. It's not unreasonable to expect to have .pack files in Git that become much larger than that, so I needed to read the file in chunks without resorting to regular file i/o. Under the covers of Boost::Iostreams, I implemented a Source, which is more or less another view of the interaction between std::streambuf and std::istream. You could also try a similar approach by just inheriting std::filebuf into a mapped_filebuf and similarly, inheriting std::fstream into a mapped_fstream. It's the interaction between the two that's difficult to get right. Boost::Iostreams has some of the work done for you, and it also provides hooks for filters and chains, so I thought it would be more useful to implement it that way.

I'm sorry Ben Collins lost his sliding windows mmap source code. That'd be nice to have in Boost.
Yes, mapping the file is much faster. You're essentially using the the OS virtual memory subsystem to associate memory-to-disk and vice versa. Think about it this way: if the OS kernel developers could make it faster they would. Because doing so makes just about everything faster: databases, boot times, program load times, et cetera.
The sliding window approach really isn't that difficult as multiple continguous pages can be mapped at once. So the size of the record doesn't matter so long as the largest of any single record will fit into memory. The important thing is managing the book-keeping.
If a record doesn't begin on a getpagesize() boundary, your mapping has to begin on the previous page. The length of the region mapped extends from the first byte of the record (rounded down if necessary to the nearest multiple of getpagesize()) to the last byte of the record (rounded up to the nearest multiple of getpagesize()). When you're finished processing a record, you can unmap() it, and move on to the next.
This all works just fine under Windows too using CreateFileMapping() and MapViewOfFile() (and GetSystemInfo() to get SYSTEM_INFO.dwAllocationGranularity --- not SYSTEM_INFO.dwPageSize).

mmap should be faster, but I don't know how much. It very much depends on your code. If you use mmap it's best to mmap the whole file at once, that will make you life a lot easier. One potential problem is that if your file is bigger than 4GB (or in practice the limit is lower, often 2GB) you will need a 64bit architecture. So if you're using a 32 environment, you probably don't want to use it.
Having said that, there may be a better route to improving performance. You said the input file gets scanned many times, if you can read it out in one pass and then be done with it, that could potentially be much faster.

Perhaps you should pre-process the files, so each record is in a separate file (or at least that each file is a mmap-able size).
Also could you do all of the processing steps for each record, before moving onto the next one? Maybe that would avoid some of the IO overhead?

I agree that mmap'd file I/O is going to be faster, but while your benchmarking the code, shouldn't the counter example be somewhat optimized?
Ben Collins wrote:
char data[0x1000];
std::ifstream in("file.bin");
while (in)
{
in.read(data, 0x1000);
// do something with data
}
I would suggest also trying:
char data[0x1000];
std::ifstream iifle( "file.bin");
std::istream in( ifile.rdbuf() );
while( in )
{
in.read( data, 0x1000);
// do something with data
}
And beyond that, you might also try making the buffer size the same size as one page of virtual memory, in case 0x1000 is not the size of one page of virtual memory on your machine... IMHO mmap'd file I/O still wins, but this should make things closer.

I remember mapping a huge file containing a tree structure into memory years ago. I was amazed by the speed compared to normal de-serialization which involves lot of work in memory, like allocating tree nodes and setting pointers.
So in fact I was comparing a single call to mmap (or its counterpart on Windows)
against many (MANY) calls to operator new and constructor calls.
For such kind of task, mmap is unbeatable compared to de-serialization.
Of course one should look into boosts relocatable pointer for this.

This sounds like a good use-case for multi-threading... I'd think you could pretty easily setup one thread to be reading data while the other(s) process it. That may be a way to dramatically increase the perceived performance. Just a thought.

To my mind, using mmap() "just" unburdens the developer from having to write their own caching code. In a simple "read through file eactly once" case, this isn't going to be hard (although as mlbrock points out you still save the memory copy into process space), but if you're going back and forth in the file or skipping bits and so forth, I believe the kernel developers have probably done a better job implementing caching than I can...

I think the greatest thing about mmap is potential for asynchronous reading with:
addr1 = NULL;
while( size_left > 0 ) {
r = min(MMAP_SIZE, size_left);
addr2 = mmap(NULL, r,
PROT_READ, MAP_FLAGS,
0, pos);
if (addr1 != NULL)
{
/* process mmap from prev cycle */
feed_data(ctx, addr1, MMAP_SIZE);
munmap(addr1, MMAP_SIZE);
}
addr1 = addr2;
size_left -= r;
pos += r;
}
feed_data(ctx, addr1, r);
munmap(addr1, r);
Problem is that I can't find the right MAP_FLAGS to give a hint that this memory should be synced from file asap.
I hope that MAP_POPULATE gives the right hint for mmap (i.e. it will not try to load all contents before return from call, but will do that in async. with feed_data). At least it gives better results with this flag even that manual states that it does nothing without MAP_PRIVATE since 2.6.23.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js