Is there a way to code data directly to the hard drive (similar to how one can do with RAM)?

Is there a way to code data directly to the hard drive (similar to how one can do with RAM)? - c++

My question concerns C/C++. It is possible to manipulate the data on the RAM with pretty great flexibility. You can also give the GPU direct commands using OpenGL, allowing one to manipulate VRAM as well.
My curiosity is whether it is possible to do this to the hard drive (even though this would likely be a horrible idea with many, many possibilities of corrupting existing data). The logic of my question comes from an assumption that the hard drive is similar to RAM and VRAM (bytes of data), but just accesses data slower.
I'm not asking about how to perform file IO, but instead how to directly modify bytes of memory on the hard drive (maybe via some sort of "hard-drive pointer").
If my assumption is totally off, a detailed correction about how the hard drive's data storage is different from RAM or VRAM would be very helpful. Thank you!

Modern operating systems in combination with modern CPUs offer the ability to memory-map disk clusters to memory pages.
The memory pages are initially marked as invalid, and as soon as you try to access them an invalid page "trap" or "interrupt" occurs, which is handled by the operating system, which loads the corresponding cluster into that memory page.
If you write to that page there is either a hardware-supported "dirty" bit, or another interrupt mechanism: the memory page is initially marked as read-only, so the first time you try to write to it there is another interrupt, which simply marks the page as dirty and turns it read-write. Then, you know that the page needs to be flushed to disk at a convenient time.
Note that reading and writing is usually done via Direct Memory Access (DMA) so the CPU is free to do other things while the pages are being transferred.
So, yes, you can do it, either with the help of the operating system, or by writing all that very complex code yourself.

Not for you. Being able to write directly to the hard drive would give you infinite potential to mess up things beyond all recognition. (The technical term is FUBAR, and the F doesn't stand for Mess).
And if you write hard disk drivers, I sincerely hope you are not trying to ask for help here.

Related

C++: Is it more efficient to store data or continually read it

Ok so I'm working on a game project. Just finished rebuilding a game engine I designed some time ago. I'm looking at making a proprietary file type to store data rather than using a database like sqlite.
Looking at making this work with the game as efficiently and quickly as possible right off the bat without going too deep into it. And then improving over time.
My question is: Is it more efficient overall to load the data from the file and store it in a data manager class to be reused? Or is it more efficient overall to continually pull from the file?
Assuming the file follows some form of consistent structure for it's data. And we're looking at the largest "table" being something like 30 columns with roughly 1000 rows of data.

Here's a handy chart of "Latency Numbers Every Computer Programmer Should Know"
The far right hand side of the chart (red) has the time it takes to read 1 MB from disk. The green column has the same value read from RAM.
What this shows us is that you should do almost anything to avoid having to directly interact with the disk. Keeping data in RAM is good. Keeping data on disk is bad. (Memory mapped files might provide a way to handle this.)
This aside, reinventing the wheel is almost always the wrong solution. Sqlite works and works well. If it's not ideally suited for your needs, there are other file types out there.
If you're "looking at making this work with the game as efficiently and quickly as possible right off the bat without going too deep into it. And then improving over time", you'll find that's easiest to do if you reuse preexisting solutions to common problems.

Keeping reading from a file is generally not a good idea; modern operating systems do keep large IO caches (so if you keep reading the same stuff it won't really hit the disk), but syscalls are of course way more onerous than straight accessing memory - although, whether this is actually going to be a performance problem for your specific case is impossible to judge with the information you provided. On the other hand, if you have a lot of data to access keeping it all in memory can be wasteful, slow to load and, when under memory pressure, lead to paging.
The easy way out of this conundrum is to map the file in memory; the data is automatically fetched from disk when required and, unless the system is under memory pressure, frequently accessed pages remain cached in RAM, guaranteeing you fast access.
Of course this is feasible only if the data you need to map is smaller than the address space, but given the example you provided (30 columns/1000 rows, which is really small) it shouldn't be a problem at all.

If you can hold the data in RAM then it is more efficient. This is because it is quicker for your computer to access values that are in RAM, a cache or the CPU's registers than it is to get it from the hard drive. Reading from the hard drive requires alot of time from the drivers of the operating system; therefore holding the data is more efficient

How to build an application layer pre-fetching system

I'm working in a C/C++ mixed project that has the following situation.
I need to have a iteration to go through very small chunks (rarely larger chunks as well) in a file one by one. Ideally, I should just read them once consecutively. I think will be a better solution in this case to read a big chunk into a buffer and consume it later, rather than read each of them instantly when I need.
The problem is, how do I balance the cache size? Is there any well-known algorithm/library that I can take advantage of?
UPDATE: (changes the title)
Thanks for you guys' replies and I understand there are different levels of caching mechanism working in our boxes. But that not enough in my case.
I think I missed something important here. Actually I'm building an application upon an existing framework, in which requesting reads to the engine frquently will cost too much for me. (Yes, i believe the engine do take advantage of OS and disk level caches.) And what I'm trying to do is indeed to build an application level pre-fetching system.
Thoughts?

in general you should try to use what the OS gives you, rather than creating your own cache (because you run the risk of caching twice). for linux, you can request OS level caching via readahead(); i don't know what the windows equivalent would be.
looking into this some more, there is also a block level (ie disk) parameter, set via blockdev --setra. it's probably not a good idea to change that on your system (unless it is dedicated to just this one task), but if the value there (blockdev --getra) is already larger than your typical chunk size then you may not need to do anything else.
[and just to address the other point mentioned in the question comments - while an OS will cache file data in free memory, i don't believe that it will pre-emptively read an otherwise unread file (apart from to meet the requirements above). but if anyone knows otherwise, please post details...]

Have you tried mmap()ing the file instead of read()ing from it? In some cases this might be more efficient, in some cases this might not. However it is usually best to let the system optimize for you, since it knows more about the hardware than an application. mmap() will let the system know that you need the whole file, so it might just be more optimal.

memory safety for encrypted, sensitive data

im writing a server in c++ that will handle safe connections where sensitive data will be sent.
the goal is never saving the data in unencrypted form anywhere outside memory, and keeping it at a defined space in the memory (to be overwritten after its no longer needed)
will allocating a large chunk of memory and using it to store the sensitive data be sufficient and ensure that there is no leakage of data ?

From the manual of a tool that handles passwords:
It's also debatable whether mlock() is a proper way to protect sensitive
information. According to POSIX, mlock()-ing a page guarantees that it
is in memory (useful for realtime applications), not that it isn't
in the swap (useful for security applications). Possibly an encrypted
swap partition (or no swap partition) is a better solution.
However, Linux does guarantee that it is not in the swap and specifically discusses the security applications. It also mentions:
But be aware that the suspend mode on laptops and some desktop computers will
save a copy of the system's RAM to disk, regardless of memory locks.

Why don't you use SELinux? Then no process can access other stuff unless you tell it can.
I think if you are securing a program handling sensitive data, you should start by using a secure OS. If the OS is not secure enough then there is nothing your application can do to fix that.
And maybe when using SELinux you don't have to do anything special in your application making your application smaller, simpler and also more secure?

What you want is locking some region of memory into RAM. See the manpage for mlock(2).

Locking the memory (or, if you use Linux, using large pages, since these cannot be paged out) is a good start. All other considerations left aside, this does at least not write plaintext to harddisk in unpredictable ways.
Overwriting memory when no longer needed does not hurt, but is probably useless, because
any pages that are reclaimed and later given to another process will be zeroed out by the operating system anyway (every modern OS does that)
as long as some data is on a computer, you must assume that someone will be able to steal it, one way or the other
there are more exploits in the operating system and in your own code than you are aware of (this happens to the best programmers, and it happens again and again)
There are countless concerns when attempting to prevent someone from stealing sensitive data, and it is by no means an easy endeavour. Encrypting data, trying not to have any obvious exploits, and trying to avoid the most stupid mistakes is as good as you will get. Beyond that, nothing is really safe, because for every N things you plan for, there exists a N+1 thing.
Take my wife's work laptop as a parade example. The intern setting up the machines in their company (at least it's my guess that he's an intern) takes every possible measure and configures everything in paranoia mode to ensure that data on the computer cannot be stolen and that working becomes as much of an ordeal as possible. What you end up with is a bitlocker-protected computer that takes 3 passwords to even boot up, and on which you can practically do nothing, and a screensaver that locks the workstation every time you pick up the phone and forget shaking the mouse. At the same time, this super secure computer has an enabled firewire port over which everybody can read and write anything in the computer's memory without a password.

How to optimize paging for large in memory database

I have an application where the entire database is implemented in memory using a stl-map for each table in the database.
Each item in the stl-map is a complex object with references to other items in the other stl-maps.
The application works with a large amount of data, so it uses more than 500 MByte RAM. Clients are able to contact the application and get a filtered version of the entire database. This is done by running through the entire database, and finding items relevant for the client.
When the application have been running for an hour or so, then Windows 2003 SP2 starts to page out parts of the RAM for the application (Eventhough there is 16 GByte RAM on the machine).
After the application have been partly paged out then a client logon takes a long time (10 mins) because it now generates a page fault for each pointer lookup in the stl-map. If running the client logon a second time right after then it is fast (few secs) because all the memory is now back in RAM.
I can see it is possible to tell Windows to lock memory in RAM, but this is generally only recommended for device drivers, and only for "small" amounts of memory.
I guess a poor mans solution could be to loop through the entire memory database, and thus tell Windows we are still interested in keeping the datamodel in RAM.
I guess another poor mans solution could be to disable the pagefile completely on Windows.
I guess the expensive solution would be a SQL database, and then rewrite the entire application to use a database layer. Then hopefully the database system will have implemented means to for fast access.
Are there other more elegant solutions ?

This sounds like either a memory leak, or a serious fragmentation problem. It seems to me that the first step would be to figure out what's causing 500 Mb of data to use up 16 Gb of RAM and still want more.
Edit: Windows has a working set trimmer that actively attempts to page out idle data. The basic idea is that it goes through and marks pages as being available, but leaves the data in them (and the virtual memory manager knows what data is in them). If, however, you attempt to access that memory before it's allocated to other purposes, it'll be marked as being in use again, which will normally prevent it from being paged out.
If you really think this is the source of your problem, you can indirectly control the working set trimmer by calling SetProcessWorkingSetSize. At least in my experience, this is only rarely of much use, but you may be in one of those unusual situations where it's really helpful.

As #Jerry Coffin said, it really sounds like your actual problem is a memory leak. Fix that.
But for the record, none of your "poor mans solutions" would work. At all.
Windows pages out some of your data because there's not room for it in RAM.
Looping through the entire memory database would load in every byte of the data model, yes... which would cause other parts of it to be paged out. In the end, you'd generate a lot of page faults, and the only difference in the end would be which parts of the data structure are paged out.
Disabling the page file? Yes, if you think a hard crash is better than low performance. Windows doesn't page data out because it's fun. It does that to handle situations where it would otherwise run out of memory. If you disable the pagefile, the app will just crash when it would otherwise page out data.
If your dataset really is so big it doesn't fit in memory, then I don't see why an SQL database would be especially "expensive". Unlike your current solution, databases are optimized for this purpose. They're meant to handle datasets too large to fit in memory, and to do this efficiently.
It sounds like you have a memory leak. Fixing that would be the elegant, efficient and correct solution.
If you can't do that, then either
throw more RAM at the problem (the app ends up using 16GB? Throw 32 or 64GB at it then), or
switch to a format that's optimized for efficient disk access (A SQL database probably)

We have a similar problem and the solution we choose was to allocate everything in a shared memory block. AFAIK, Windows doesn't page this out. However, using stl-map here is not for faint of heart either and was beyond what we required.
We are using Boost Shared Memory to implement this for us and it works well. Follow examples closely and you will be up and running quickly. Boost also has Boost.MultiIndex that will do a lot of what you want.
For a no cost sql solution have you looked at Sqlite? They have an option to run as an in memory database.
Good luck, sounds like an interesting application.

I have an application where the entire
database is implemented in memory
using a stl-map for each table in the
database.
That's the start of the end: STL's std::map is extremely memory inefficient. Same applies to std::list. Every element would be allocated separately causing rather serious memory waste. I often use std::vector + sort() + find() instead of std::map in applications where it is possible (more searches than modifications) and I know in advance memory usage might become an issue.
When the application have been running
for an hour or so, then Windows 2003
SP2 starts to page out parts of the
RAM for the application (Eventhough
there is 16 GByte RAM on the machine).
Hard to tell without knowing how your application is written. Windows has the feature to unload from RAM whatever memory of idle applications can be unloaded. But that normally affects memory mapped files and alike.
Otherwise, I would strongly suggest to read up the Windows memory management documentation . It is not very easy to understand, yet Windows has all sorts and types of memory available to applications. I never had luck with it, but probably in your application using custom std::allocator would work.

I can believe it is the fault of flawed pagefile behaviour -i've run my laptops mostly with pagefile turned off since nt4.0. In my experience, at least up to XP Pro, Windows intrusively swaps pages out just to provide the dubious benefit of having a really-really-slow extension to the maximum working set space.
Ask what benefit swapping to harddisk is achieving with 16 Gigabityes of real RAM available? If your working set it so big as to need more virtual memory than +10 Gigs, then once swapping is actualy required processes will take anything from a bit longer, to thousands of times longer to complete. On Windows the untameable file system cache seems to antagonise the relationships.
Now when I (very) occasionaly run out of working set on my XP laptops, there is no traffic jam, the guilty app just crashes. A utility to suspend memory glugging processes before that time and make an alert would be nice, but there is no such thing just a violation, a crash, and sometimes explorer.exe goes down too.
Pagefiles - who needs em'

---- Edit
Given snakefoot explanation, the problem is swapping out memory that is not used for a longer period of time and due to this not having the data in memory when needed. This is the same as this:
Can I tell Windows not to swap out a particular processes’ memory?
and VirtualLock function should do its job:
http://msdn.microsoft.com/en-us/library/aa366895(VS.85).aspx
---- Previous answer
First of all you need to distinguish between memory leak and memory need problems.
If you have a memory leak then it would be bigger effort to convert entire application to SQL than to debug the application.
SQL cannot be faster then a well designed, domain specific in-memory database and if you have bugs, chances are you will have different ones in an SQL version as well.
If this is a memory need problem, then you will need to switch to SQL anyway and this sounds like a good moment.

mmap() vs. reading blocks

I'm working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I've got a first implementation up and running and am now looking towards improving performance, particularly at doing I/O more efficiently since the input file gets scanned many times.
Is there a rule of thumb for using mmap() versus reading in blocks via C++'s fstream library? What I'd like to do is read large blocks from disk into a buffer, process complete records from the buffer, and then read more.
The mmap() code could potentially get very messy since mmap'd blocks need to lie on page sized boundaries (my understanding) and records could potentially lie across page boundaries. With fstreams, I can just seek to the start of a record and begin reading again, since we're not limited to reading blocks that lie on page sized boundaries.
How can I decide between these two options without actually writing up a complete implementation first? Any rules of thumb (e.g., mmap() is 2x faster) or simple tests?

I was trying to find the final word on mmap / read performance on Linux and I came across a nice post (link) on the Linux kernel mailing list. It's from 2000, so there have been many improvements to IO and virtual memory in the kernel since then, but it nicely explains the reason why mmap or read might be faster or slower.
A call to mmap has more overhead than read (just like epoll has more overhead than poll, which has more overhead than read). Changing virtual memory mappings is a quite expensive operation on some processors for the same reasons that switching between different processes is expensive.
The IO system can already use the disk cache, so if you read a file, you'll hit the cache or miss it no matter what method you use.
However,
Memory maps are generally faster for random access, especially if your access patterns are sparse and unpredictable.
Memory maps allow you to keep using pages from the cache until you are done. This means that if you use a file heavily for a long period of time, then close it and reopen it, the pages will still be cached. With read, your file may have been flushed from the cache ages ago. This does not apply if you use a file and immediately discard it. (If you try to mlock pages just to keep them in cache, you are trying to outsmart the disk cache and this kind of foolery rarely helps system performance).
Reading a file directly is very simple and fast.
The discussion of mmap/read reminds me of two other performance discussions:
Some Java programmers were shocked to discover that nonblocking I/O is often slower than blocking I/O, which made perfect sense if you know that nonblocking I/O requires making more syscalls.
Some other network programmers were shocked to learn that epoll is often slower than poll, which makes perfect sense if you know that managing epoll requires making more syscalls.
Conclusion: Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes (MAP_SHARED isn't very interesting if there is no actual sharing). Read files normally if you access data sequentially or discard it after reading. And if either method makes your program less complex, do that. For many real world cases there's no sure way to show one is faster without testing your actual application and NOT a benchmark.
(Sorry for necro'ing this question, but I was looking for an answer and this question kept coming up at the top of Google results.)

There are lots of good answers here already that cover many of the salient points, so I'll just add a couple of issues I didn't see addressed directly above. That is, this answer shouldn't be considered a comprehensive of the pros and cons, but rather an addendum to other answers here.
mmap seems like magic
Taking the case where the file is already fully cached1 as the baseline2, mmap might seem pretty much like magic:
mmap only requires 1 system call to (potentially) map the entire file, after which no more system calls are needed.
mmap doesn't require a copy of the file data from kernel to user-space.
mmap allows you to access the file "as memory", including processing it with whatever advanced tricks you can do against memory, such as compiler auto-vectorization, SIMD intrinsics, prefetching, optimized in-memory parsing routines, OpenMP, etc.
In the case that the file is already in the cache, it seems impossible to beat: you just directly access the kernel page cache as memory and it can't get faster than that.
Well, it can.
mmap is not actually magic because...
mmap still does per-page work
A primary hidden cost of mmap vs read(2) (which is really the comparable OS-level syscall for reading blocks) is that with mmap you'll need to do "some work" for every 4K page accessed in a new mapping, even though it might be hidden by the page-fault mechanism.
For a example a typical implementation that just mmaps the entire file will need to fault-in so 100 GB / 4K = 25 million faults to read a 100 GB file. Now, these will be minor faults, but 25 million page faults is still not going to be super fast. The cost of a minor fault is probably in the 100s of nanos in the best case.
mmap relies heavily on TLB performance
Now, you can pass MAP_POPULATE to mmap to tell it to set up all the page tables before returning, so there should be no page faults while accessing it. Now, this has the little problem that it also reads the entire file into RAM, which is going to blow up if you try to map a 100GB file - but let's ignore that for now3. The kernel needs to do per-page work to set up these page tables (shows up as kernel time). This ends up being a major cost in the mmap approach, and it's proportional to the file size (i.e., it doesn't get relatively less important as the file size grows)4.
Finally, even in user-space accessing such a mapping isn't exactly free (compared to large memory buffers not originating from a file-based mmap) - even once the page tables are set up, each access to a new page is going to, conceptually, incur a TLB miss. Since mmaping a file means using the page cache and its 4K pages, you again incur this cost 25 million times for a 100GB file.
Now, the actual cost of these TLB misses depends heavily on at least the following aspects of your hardware: (a) how many 4K TLB enties you have and how the rest of the translation caching works performs (b) how well hardware prefetch deals with with the TLB - e.g., can prefetch trigger a page walk? (c) how fast and how parallel the page walking hardware is. On modern high-end x86 Intel processors, the page walking hardware is in general very strong: there are at least 2 parallel page walkers, a page walk can occur concurrently with continued execution, and hardware prefetching can trigger a page walk. So the TLB impact on a streaming read load is fairly low - and such a load will often perform similarly regardless of the page size. Other hardware is usually much worse, however!
read() avoids these pitfalls
The read() syscall, which is what generally underlies the "block read" type calls offered e.g., in C, C++ and other languages has one primary disadvantage that everyone is well-aware of:
Every read() call of N bytes must copy N bytes from kernel to user space.
On the other hand, it avoids most the costs above - you don't need to map in 25 million 4K pages into user space. You can usually malloc a single buffer small buffer in user space, and re-use that repeatedly for all your read calls. On the kernel side, there is almost no issue with 4K pages or TLB misses because all of RAM is usually linearly mapped using a few very large pages (e.g., 1 GB pages on x86), so the underlying pages in the page cache are covered very efficiently in kernel space.
So basically you have the following comparison to determine which is faster for a single read of a large file:
Is the extra per-page work implied by the mmap approach more costly than the per-byte work of copying file contents from kernel to user space implied by using read()?
On many systems, they are actually approximately balanced. Note that each one scales with completely different attributes of the hardware and OS stack.
In particular, the mmap approach becomes relatively faster when:
The OS has fast minor-fault handling and especially minor-fault bulking optimizations such as fault-around.
The OS has a good MAP_POPULATE implementation which can efficiently process large maps in cases where, for example, the underlying pages are contiguous in physical memory.
The hardware has strong page translation performance, such as large TLBs, fast second level TLBs, fast and parallel page-walkers, good prefetch interaction with translation and so on.
... while the read() approach becomes relatively faster when:
The read() syscall has good copy performance. E.g., good copy_to_user performance on the kernel side.
The kernel has an efficient (relative to userland) way to map memory, e.g., using only a few large pages with hardware support.
The kernel has fast syscalls and a way to keep kernel TLB entries around across syscalls.
The hardware factors above vary wildly across different platforms, even within the same family (e.g., within x86 generations and especially market segments) and definitely across architectures (e.g., ARM vs x86 vs PPC).
The OS factors keep changing as well, with various improvements on both sides causing a large jump in the relative speed for one approach or the other. A recent list includes:
Addition of fault-around, described above, which really helps the mmap case without MAP_POPULATE.
Addition of fast-path copy_to_user methods in arch/x86/lib/copy_user_64.S, e.g., using REP MOVQ when it is fast, which really help the read() case.
Update after Spectre and Meltdown
The mitigations for the Spectre and Meltdown vulnerabilities considerably increased the cost of a system call. On the systems I've measured, the cost of a "do nothing" system call (which is an estimate of the pure overhead of the system call, apart from any actual work done by the call) went from about 100 ns on a typical modern Linux system to about 700 ns. Furthermore, depending on your system, the page-table isolation fix specifically for Meltdown can have additional downstream effects apart from the direct system call cost due to the need to reload TLB entries.
All of this is a relative disadvantage for read() based methods as compared to mmap based methods, since read() methods must make one system call for each "buffer size" worth of data. You can't arbitrarily increase the buffer size to amortize this cost since using large buffers usually performs worse since you exceed the L1 size and hence are constantly suffering cache misses.
On the other hand, with mmap, you can map in a large region of memory with MAP_POPULATE and the access it efficiently, at the cost of only a single system call.
1 This more-or-less also includes the case where the file wasn't fully cached to start with, but where the OS read-ahead is good enough to make it appear so (i.e., the page is usually cached by the time you want it). This is a subtle issue though because the way read-ahead works is often quite different between mmap and read calls, and can be further adjusted by "advise" calls as described in 2.
2 ... because if the file is not cached, your behavior is going to be completely dominated by IO concerns, including how sympathetic your access pattern is to the underlying hardware - and all your effort should be in ensuring such access is as sympathetic as possible, e.g. via use of madvise or fadvise calls (and whatever application level changes you can make to improve access patterns).
3 You could get around that, for example, by sequentially mmaping in windows of a smaller size, say 100 MB.
4 In fact, it turns out the MAP_POPULATE approach is (at least one some hardware/OS combination) only slightly faster than not using it, probably because the kernel is using faultaround - so the actual number of minor faults is reduced by a factor of 16 or so.

The main performance cost is going to be disk i/o. "mmap()" is certainly quicker than istream, but the difference might not be noticeable because the disk i/o will dominate your run-times.
I tried Ben Collins's code fragment (see above/below) to test his assertion that "mmap() is way faster" and found no measurable difference. See my comments on his answer.
I would certainly not recommend separately mmap'ing each record in turn unless your "records" are huge - that would be horribly slow, requiring 2 system calls for each record and possibly losing the page out of the disk-memory cache.....
In your case I think mmap(), istream and the low-level open()/read() calls will all be about the same. I would recommend mmap() in these cases:
There is random access (not sequential) within the file, AND
the whole thing fits comfortably in memory OR there is locality-of-reference within the file so that certain pages can be mapped in and other pages mapped out. That way the operating system uses the available RAM to maximum benefit.
OR if multiple processes are reading/working on the same file, then mmap() is fantastic because the processes all share the same physical pages.
(btw - I love mmap()/MapViewOfFile()).

mmap is way faster. You might write a simple benchmark to prove it to yourself:
char data[0x1000];
std::ifstream in("file.bin");
while (in)
{
in.read(data, 0x1000);
// do something with data
}
versus:
const int file_size=something;
const int page_size=0x1000;
int off=0;
void *data;
int fd = open("filename.bin", O_RDONLY);
while (off < file_size)
{
data = mmap(NULL, page_size, PROT_READ, 0, fd, off);
// do stuff with data
munmap(data, page_size);
off += page_size;
}
Clearly, I'm leaving out details (like how to determine when you reach the end of the file in the event that your file isn't a multiple of page_size, for instance), but it really shouldn't be much more complicated than this.
If you can, you might try to break up your data into multiple files that can be mmap()-ed in whole instead of in part (much simpler).
A couple of months ago I had a half-baked implementation of a sliding-window mmap()-ed stream class for boost_iostreams, but nobody cared and I got busy with other stuff. Most unfortunately, I deleted an archive of old unfinished projects a few weeks ago, and that was one of the victims :-(
Update: I should also add the caveat that this benchmark would look quite different in Windows because Microsoft implemented a nifty file cache that does most of what you would do with mmap in the first place. I.e., for frequently-accessed files, you could just do std::ifstream.read() and it would be as fast as mmap, because the file cache would have already done a memory-mapping for you, and it's transparent.
Final Update: Look, people: across a lot of different platform combinations of OS and standard libraries and disks and memory hierarchies, I can't say for certain that the system call mmap, viewed as a black box, will always always always be substantially faster than read. That wasn't exactly my intent, even if my words could be construed that way. Ultimately, my point was that memory-mapped i/o is generally faster than byte-based i/o; this is still true. If you find experimentally that there's no difference between the two, then the only explanation that seems reasonable to me is that your platform implements memory-mapping under the covers in a way that is advantageous to the performance of calls to read. The only way to be absolutely certain that you're using memory-mapped i/o in a portable way is to use mmap. If you don't care about portability and you can rely on the particular characteristics of your target platforms, then using read may be suitable without sacrificing measurably any performance.
Edit to clean up answer list:
#jbl:
the sliding window mmap sounds
interesting. Can you say a little more
about it?
Sure - I was writing a C++ library for Git (a libgit++, if you will), and I ran into a similar problem to this: I needed to be able to open large (very large) files and not have performance be a total dog (as it would be with std::fstream).
Boost::Iostreams already has a mapped_file Source, but the problem was that it was mmapping whole files, which limits you to 2^(wordsize). On 32-bit machines, 4GB isn't big enough. It's not unreasonable to expect to have .pack files in Git that become much larger than that, so I needed to read the file in chunks without resorting to regular file i/o. Under the covers of Boost::Iostreams, I implemented a Source, which is more or less another view of the interaction between std::streambuf and std::istream. You could also try a similar approach by just inheriting std::filebuf into a mapped_filebuf and similarly, inheriting std::fstream into a mapped_fstream. It's the interaction between the two that's difficult to get right. Boost::Iostreams has some of the work done for you, and it also provides hooks for filters and chains, so I thought it would be more useful to implement it that way.

I'm sorry Ben Collins lost his sliding windows mmap source code. That'd be nice to have in Boost.
Yes, mapping the file is much faster. You're essentially using the the OS virtual memory subsystem to associate memory-to-disk and vice versa. Think about it this way: if the OS kernel developers could make it faster they would. Because doing so makes just about everything faster: databases, boot times, program load times, et cetera.
The sliding window approach really isn't that difficult as multiple continguous pages can be mapped at once. So the size of the record doesn't matter so long as the largest of any single record will fit into memory. The important thing is managing the book-keeping.
If a record doesn't begin on a getpagesize() boundary, your mapping has to begin on the previous page. The length of the region mapped extends from the first byte of the record (rounded down if necessary to the nearest multiple of getpagesize()) to the last byte of the record (rounded up to the nearest multiple of getpagesize()). When you're finished processing a record, you can unmap() it, and move on to the next.
This all works just fine under Windows too using CreateFileMapping() and MapViewOfFile() (and GetSystemInfo() to get SYSTEM_INFO.dwAllocationGranularity --- not SYSTEM_INFO.dwPageSize).

mmap should be faster, but I don't know how much. It very much depends on your code. If you use mmap it's best to mmap the whole file at once, that will make you life a lot easier. One potential problem is that if your file is bigger than 4GB (or in practice the limit is lower, often 2GB) you will need a 64bit architecture. So if you're using a 32 environment, you probably don't want to use it.
Having said that, there may be a better route to improving performance. You said the input file gets scanned many times, if you can read it out in one pass and then be done with it, that could potentially be much faster.

Perhaps you should pre-process the files, so each record is in a separate file (or at least that each file is a mmap-able size).
Also could you do all of the processing steps for each record, before moving onto the next one? Maybe that would avoid some of the IO overhead?

I agree that mmap'd file I/O is going to be faster, but while your benchmarking the code, shouldn't the counter example be somewhat optimized?
Ben Collins wrote:
char data[0x1000];
std::ifstream in("file.bin");
while (in)
{
in.read(data, 0x1000);
// do something with data
}
I would suggest also trying:
char data[0x1000];
std::ifstream iifle( "file.bin");
std::istream in( ifile.rdbuf() );
while( in )
{
in.read( data, 0x1000);
// do something with data
}
And beyond that, you might also try making the buffer size the same size as one page of virtual memory, in case 0x1000 is not the size of one page of virtual memory on your machine... IMHO mmap'd file I/O still wins, but this should make things closer.

I remember mapping a huge file containing a tree structure into memory years ago. I was amazed by the speed compared to normal de-serialization which involves lot of work in memory, like allocating tree nodes and setting pointers.
So in fact I was comparing a single call to mmap (or its counterpart on Windows)
against many (MANY) calls to operator new and constructor calls.
For such kind of task, mmap is unbeatable compared to de-serialization.
Of course one should look into boosts relocatable pointer for this.

This sounds like a good use-case for multi-threading... I'd think you could pretty easily setup one thread to be reading data while the other(s) process it. That may be a way to dramatically increase the perceived performance. Just a thought.

To my mind, using mmap() "just" unburdens the developer from having to write their own caching code. In a simple "read through file eactly once" case, this isn't going to be hard (although as mlbrock points out you still save the memory copy into process space), but if you're going back and forth in the file or skipping bits and so forth, I believe the kernel developers have probably done a better job implementing caching than I can...

I think the greatest thing about mmap is potential for asynchronous reading with:
addr1 = NULL;
while( size_left > 0 ) {
r = min(MMAP_SIZE, size_left);
addr2 = mmap(NULL, r,
PROT_READ, MAP_FLAGS,
0, pos);
if (addr1 != NULL)
{
/* process mmap from prev cycle */
feed_data(ctx, addr1, MMAP_SIZE);
munmap(addr1, MMAP_SIZE);
}
addr1 = addr2;
size_left -= r;
pos += r;
}
feed_data(ctx, addr1, r);
munmap(addr1, r);
Problem is that I can't find the right MAP_FLAGS to give a hint that this memory should be synced from file asap.
I hope that MAP_POPULATE gives the right hint for mmap (i.e. it will not try to load all contents before return from call, but will do that in async. with feed_data). At least it gives better results with this flag even that manual states that it does nothing without MAP_PRIVATE since 2.6.23.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js