Why is reading from a memory mapped file so fast?

Why is reading from a memory mapped file so fast? - c++

I don't have much experience with memory mapped i/o, but after using them for the first time I'm stunned at how fast they are. In my performance tests, I'm seeing that reading from memory mapped files is 30X faster than reading through regular c++ stdio.
My test data is a 3GB binary file, it contains 20 large double precision floating point arrays. The way my test program is structured, I call an external module's read method, which uses memory mapped i/o behind the scenes. Every time I call the read method, this external module returns a pointer and a size of the data that the pointer points to. Upon returning from this method, I call memcpy to copy the contents of the returned buffer into another array. Since I'm doing a memcpy to copy data from the memory mapped file, I expected the memory mapped reads to be not considerably faster than normal stdio, but I'm astonished that it's 30X faster.
Why is reading from a memory mapped file so fast?
PS: I use a Windows machine. I benchmarked my i/o speeds and my machine's max disk transfer rate is around 90 MiB/s

The OS kernel routines for IO, like read or write calls, are still just functions. Those functions are written to copy data to/from userspace buffer to a kernel space structure, and then to a device. When you consider that there is a user buffer, a IO library buffer (stdio buf for example), a kernel buffer, then a file, the data may potentially go through 3 copies to get between your program and the disk. The IO routines also have to be robust, and lastly, the sys calls themselves impose a latency (trapping to kernel, context switch, waking process up again).
When you memory map a file, you are skipping right through much of that, eliminating buffer copies. By effectively treating the file like a big virtual array, you enable random access without going through the syscall overhead, so you lower the latency per IO, and if the original code is inefficient (many small random IO calls) then the overhead is reduced even more drastically.
The abstraction of a virtual memory, multiprocessing OS has a price, and this is it.
You can, however, improve IO in some cases by disabling buffering in cases when you know it will hurt performance, such as large contiguous writes, but beyond that, you really cant improve on the performance of memory mapped IO without eliminating the OS altogether.

Related

Is it better to read an entire file in std::string or to manipulate a file with std::ifstream?

I am actually developing scientific C++ simulation programs which read data, compute lots of values from them and finally store the results in a file. I wanted to know if reading all the data at once at the beginning of the program is faster than keep accessing the file via std::ifstream during the program.
The data I am using are not very big (several MB), but I do not even know what "big" is for a heap allocation...
I guess it depends on the data and so on (and after some test, effectively it depends), but I was wondering on what it was depending and whether there is a kind of general principle we should be following.
Long story short, the question is: does keeping a file opened and using file manipulators is faster than a potentially big heap allocation and using string manipulators?

Take a look at mmap. This API allows you to map a file descriptor into your address space using the same paging mechanism as is used for RAM. This way you should get both the benefit random access to data while not unnecessarily copying unneeded data into RAM.

Is reading all the data at once at the beginning of the program faster than keep accessing the file via std::ifstream during the program? Yes, probably it is. Keep in mind that working memory is fast and expensive, while storage memory (a hard drive) exists precisely to be cheap at the cost of being slow.
What is "big" for a heap allocation? The operating system is going to try to fool your process into thinking that all existing working memory is free. This is not actually true, and the OS will "swap" one type of memory for the other if some process requests too much memory. But in principle, you should think that a heap allocation is big if it is comparable to the total size of working memory.
Does keeping a file opened and using file manipulators is faster than a potentially big heap allocation and using string manipulators? No, it is not faster, but it has another advantage: it is memory-efficient. If you only put the needed data into memory in order to work with them, you are saving memory for all other processes in the machine (which could be other threads of your program, for instance). This is a very interesting property in order to have scalable software.

(Expect this to be closed because it is an "opinion based" question.)
My thoughts:
This sounds like premature optimization. Write it the easy way, then optimize if it is too slow.
Working in memory is generally thousands of times faster. Heap allocations slow down based on the number of allocations, not the size of the allocations. It does not sound like you are working with a whole lot of data though.
If your files are "several MB" then the OS will probably cache it anyway.

Reading data from a file in large chunks is a lot faster than many read requests of small size. For example, 1 read of 10MB is faster than 10 reads of 1MB.
When I optimize file I/O, I read the data into a uint8_t buffer and then parse the buffer. One thorn with this method is reading of text files. There is a possibility that the text encoded data may span across a buffer boundary. For example, you have 4 numbers per text line and only 2 are in the buffer (or only 2 digits of the number are in the buffer). You will have to write code to handle these cases.
If you consider your program as a pipeline, you may be able to optimize futher. You can implement threads: a reading thread, a processing thread, and a writing (output) thread. The reading thread reads into buffers. When there is enough data for processing, the reading thread wakes up the processing thread. The processing thread processes the data that was read, and when there is some output, it stores it into a shared buffer and wakes up the output thread. So with the pipeline model, the data enters the pipe via the reading thread. At some point in the pipe, the processing thread processes the data. The writing thread takes the data from the processing thread and outputs it (exiting the pipeline).
Also, organizing your data so it fits into a processor cache line will also speed up your program.

Do memory mapped files provide advantage for large buffers?

My program works with large data sets that need to be stored in contiguous memory (several Gigabytes). Allocating memory using std::allocator (i.e. malloc or new) causes system stalls as large portions of virtual memory are reserved and physical memory gets filled up.
Since the program will mostly only work on small portions at a time, my question is if using memory mapped files would provide an advantage (i.e. mmap or the Windows equivalent.) That is creating a large sparse temporary file and mapping it to virtual memory. Or is there another technique that would change the system's pagination strategy such that less pages are loaded into physical memory at a time.
I'm trying to avoid building a streaming mechanism that loads portions of a file at a time and instead rely on the system's vm pagination.

Yes, mmap has the potential to speed things up.
Things to consider:
Remember the VMM will page things in and out in page size blocked (4k on Linux)
If your memory access is well localised over time, this will work well. But if you do random access over your entire file, you will end up with a lot of seeking and thrashing (still). So, consider whether your 'small portions' correspond with localised bits of the file.
For large allocations, malloc and free will use mmap with MAP_ANON anyway. So the difference in memory mapping a file is simply that you are getting the VMM to do the I/O for you.
Consider using madvise with mmap to assist the VMM in paging well.
When you use open and read (plus, as erenon suggests, posix_fadvise), your file is still held in buffers anyway (i.e. it's not immediately written out) unless you also use O_DIRECT. So in both situations, you are relying on the kernel for I/O scheduling.

If the data is already in a file, it would speed up things, especially in the non-sequential case. (In the sequential case, read wins)
If using open and read, consider using posix_fadvise as well.

This really depends on your mmap() implementation. Mapping a file into memory has several advantages that can be exploited by the kernel:
The kernel knows that the contents of the mmap() pages is already present on disk. If it decides to evict these pages, it can omit the write back.
You reduce copying operations: read() operations typically first read the data into kernel memory, then copy it over to user space.
The reduced copies also mean that less memory is used to store data from the file, which means more memory is available for other uses, which can reduce paging as well.
This is also, why it is generally a bad idea to use large caches within an I/O library: Modern kernels already cache everything they ever read from disk, caching a copy in user space means that the amount of data that can be cached is actually reduced.
Of course, you also avoid a lot of headaches that result from buffering data of unknown size in your application. But that is just a convenience for you as a programmer.
However, even though the kernel can exploit these properties, it does not necessarily do so. My experience is that LINUX mmap() is generally fine; on AIX, however, I have witnessed really bad mmap() performance. So, if your goal is performance, it's the old measure-compare-decide stand by.

What's the best way of implementing a buffer of fixed size when using fread in C++?

Suppose that you have a file of integers and you want to read them one by one.
You have two options for buffering.
Declare an array buffer of size N and use setvbuf to tell fread which buffer to use. Then when calling the function fread to read an integer you write fread(&myInt, sizeof(myInt), 1, inputFile);
Declare the same array buffer but this time don't use the function setvbuf. Instead work on the buffering by yourself. So call fread(buffer, bufferSize*sizeof(int), 1, inputFile)
From my understanding setvbuf exists to make your life easier, but does it come at a cost? Which method would you use in terms of performance?

I would use neither of your examples. I don't think that part of the I/O is the performance bottleneck.
The vbuf is an area for the input routine to place data before putting it into your destination. It could be used as a cache or as a preformatting buffer.
Most of the time, I/O bottlenecks are related to the quantity of data fetched and the number of fetches. For example, reading one byte at a time is less efficient than reading a block of bytes.
Another I/O related bottleneck is the duration between input requests. I/O devices prefer to keep streaming data, non-stop. Some input devices, like hard drives, have an overhead time between when the request is received and when the data starts transmitting. For hard drives, this would be the disk speed up time.
Your best performance is not to waste development time messing with the C or C++ libraries. You need to use hardware assist. Some platforms have a device called a Direct Memory Access controller (DMA). This device can take data from an input source and deliver it to memory without using the CPU. The CPU can be executing instructions while the DMA is transferring data. In order to use hardware assistance, you need to write code at the OS driver level, or access the OS drivers directly.
The C and C++ I/O libraries are designed for a platform independent concept called streams. There may be execution overhead associated with this (such as extra buffering). If you don't care about different platforms, then access the OS drivers directly.
Don't waste your time messing with the C and C++ libraries. Not much performance gain there. More performance lies in accessing the OS drivers directly (or using your own). How and when you access the I/O will show bigger performance gains than tweaking the C and C++ libraries.
Lastly, using the processors data cache effectively will gain you performance too.

Speeding up file I/O: mmap() vs. read()

I have a Linux application that reads 150-200 files (4-10GB) in parallel. Each file is read in turn in small, variably sized blocks, typically less than 2K each.
I currently need to maintain over 200 MB/s read rate combined from the set of files. The disks handle this just fine. There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment).
We have implemented two different read systems both make heavy use of posix_advise: first is a mmaped read in which we map the entirety of the data set and read on demand.
The second is a read()/seek() based system.
Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes.
So my question comes to these:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
B: Are there systematic ways for mmap to better deal with very large mapped data?
Mmap-vs-reading-blocks
is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read.

Reads back to what? What is the final destination of this data?
Since it sounds like you are completely IO bound, mmap and read should make no difference. The interesting part is in how you get the data to your receiver.
Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. To do this using zero-copy, try the splice system call. You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout.
if (pid = fork()) {
waitpid(pid, ...);
} else {
dup2(dest, 1);
dup2(source, 0);
execlp("cat", "cat");
}
Update0
If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. Your processing step should accept data from stdin, or a pipe.
To answer your more specific questions:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
That's as good as it gets with regard to telling the kernel what to do from userspace. The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. I'd just go with splicing the files into a pipe.
B: Are there systematic ways for mmap to better deal with very large mapped data?
Yes. The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing):
MAP_HUGETLB
Allocate the mapping using "huge pages."
This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files.
MAP_NORESERVE
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.
This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.**
MAP_POPULATE
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. I suspect this flag is redundant, the VFS likely does this better by default.

Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong).
And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. Can't than be half a megabyte instead?

The problem here doesn't seem to be which api is used. It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access).
mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow.
I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often.

Manipulating data in Memory instead of file

consider the function below:
int Func(char* filename);
int Func(FILE* filepointer);
these two do the same, reads alot of data from the given file (by name or pointer), analyze he result, and returns it.
I wanna call this function with lots of different data. Therefore I should write data into file, then pass the new filename to Func. but data is huge and reading and writing in hard is very slow. actually the analyze time is much less than I/O.
can I get rid of save/load data all the time by any means?
for example by making a FILE* pointer which points somewhere in Memory?
Update: obviously I don't have the source code of Func! It's a DLL call.

You could use memory-mapped file technique or something like boost::iostreams with custom memory sinks / sources.
Actually, the second variant is a lot more flexible, but sometimes all that flexi- and versatibility is simply not needed.

In many operating systems you can use an in-memory filesystem such as tmpfs -- and in Windows "temporary files" (opened with the appropriate flags, then rewound rather than closed) behave similarly (i.e., can stay in memory).
However, there isn't all that much to be gained there compared to writing (with lots of buffering) and reading (ditto) sequentially from an un-fragmented disk, for large files -- tmpfs's performance advantages are mostly for small files. If your performance is very bad, either the disk is horribly fragmented, or (perhaps more likely these days of self-adjusting filesystems) you're not using buffering appropriately (possibly just not buffering enough). (of course, both factors could be in play). Modern devices and filesystems can have awesome performance when just streaming huge buffers to and from memory, after all.
For a given amount of RAM devoted to buffering, you can get better performance (for what from app level look like huge numbers of tiny writes and reads) if that RAM is in userland in your app's address space (rather than under kernel control e.g. in a tmpfs), simply because you'll need fewer context switches -- and switches from user to kernel mode and back tend to dominate runtime when the only other ops performed are copies of small amounts of memory back and forth. When you use very large buffers in your app's stdio library, your "I/O" amounts to userland memory-memory copies within your address space with very rare "streaming" ops that actually transfers those buffers back and forth.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js