Concurrently writing to file while reading it out using mmap

Concurrently writing to file while reading it out using mmap - c++

The situation is this.
A large buffer of data (which shall exceed reasonable RAM
consumption) is being generated by the program.
The program concurrently serves a websocket which will allow a web
client to specify a small subset of this buffer of data to view.
To support the first goal, the file is written to using standard methods (I use portable C-stdio fopen and fwrite because it's been shown to be faster than various "pure C++" methods. Doesn't matter. Data gets appended to file; stdio will buffer the writes and periodically flush them.)
To support the second goal (on BSD, in particular iOS), the file is opened (open from sys/fcntl.h -- not as portable as stdio.h) and memory-mapped (mmap from sys/mman.h -- ditto). By deciding to use memory mapping I have to give up some portability with this code. It seems like Boost is something I could look at to avoid wheel reinvention.
Anyway, my question is about how exactly I'm supposed to do this, because there will be at least two threads: The main program thread appending to the file periodically, and the network (or a worker) thread which responds to web requests and delivers data read out of the memory regions that are mapped to the file on disk.
Supposing the file starts out 1024 bytes in size, mmap is called initially mapping 1024 bytes. As the main thread writes a further 512 bytes into the file, how can the network thread be notified or know anything about the current actual size of the file (so that it can munmap and mmap again with a larger buffer corresponding to the new size)? Furthermore, if I do this naively, I am wary of a situation where the main thread reports that 512 bytes are written, so the other thread now maps 1536 bytes of the file, but not all of the new 512 bytes actually got written to disk yet (OS is still working on writing it, maybe). What happens now? Could there be some garbage that shows up? Will my program crash?
How can I determine when data has been properly flushed? How can I be notified in a timely fashion after the data has been flushed so that I can memory map it?
In particular, is calling fflush the only way to guarantee that the file is now updated w.r.t. the stream, and then can I guarantee (once fflush returns) that the memory map can access the new size without an access violation? What about fsync?

When you are using POSIX API directly in the form of mmap, you should also be using it directly for the writing. POSIX and LibC interfaces just don't play well together.
write is a system call which transfers the data directly to kernel. It would be slow for writing byte-by-byte, but for writing large buffers it is tiny fraction faster because it has less overhead (fwrite ends up calling write under the hood anyway). And it is definitely more efficient that fwrite+fflush, because those may end up being two or more calls to write and if you do direct write, it is just one.
The documentation of mmap is not very clear about it, but it seems you must not request more bytes than the file actually has.

Related

Memory mapped IO concept details

I'm attempting to figure out what the best way is to write files in Windows. For that, I've been running some tests with memory mapping, in an attempt to figure out what is happening and how I should organize things...
Scenario: The file is intended to be used in a single process, in multiple threads. You should see a thread as a worker that works on the file storage; some of them will read, some will write - and in some cases the file will grow. I want my state to survive both process and OS crashes. Files can be large, say: 1 TB.
After reading a lot on MSDN, I whipped up a small test case. What I basically do is the following:
Open a file (CreateFile) using FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH.
Build a mmap file handle (CreateFileMapping) on the file, using some file growth mechanism.
Map the memory regions (MapViewOfFile) using a multiple of the sector size (from STORAGE_PROPERTY_QUERY). The mode I intend to use is READ+WRITE.
So far I've been unable to figure out how to use these construct exactly (tools like diskmon won't work for good reasons) so I decided to ask here. What I basically want to know is: how I can best use these constructs for my scenario?
If I understand correctly, this is more or less the correct approach; however, I'm unsure as to the exact role of CreateFileMapping vs MapViewOfFile and if this will work in multiple threads (e.g. the way writes are ordered when they are flushed to disk).
I intend to open the file once per process as per (1).
Per thread, I intend to create a mmap file handle as per (2) for the entire file. If I need to grow the file, I will estimate how much space I need, close the handle and reopen it using CreateFileMapping.
While the worker is doing its thing, it needs pieces of the file. So, I intend to use MapViewOfFile (which seems limited to 2 GB) for each piece, process it annd unmap it again.
Questions:
Do I understand the concepts correctly?
When is data physically read and written to disk? So, when I have a loop that writes 1 MB of data in (3), will it write that data after the unmap call? Or will it write data the moment I hit memory in another page? (After all, disks are block devices so at some point we have to write a block...)
Will this work in multiple threads? This is about the calls themselves - I'm not sure if they will error if you have -say- 100 workers.
I do understand that (written) data is immediately available in other threads (unless it's a remote file), which means I should be careful with read/write concurrency. If I intend to write stuff, and afterwards update a single-physical-block) header (indicating that readers should use another pointer from now on) - then is it guaranteed that the data is written prior to the header?
Will it matter if I use 1 file or multiple files (assuming they're on the same physical device of course)?

Memory mapped files generally work best for READING; not writing. The problem you face is that you have to know the size of the file before you do the mapping.
You say:
in some cases the file will grow
Which really rules out a memory mapped file.
When you create a memory mapped file on Windoze, you are creating your own page file and mapping a range of memory to that page file. This tends to be the fastest way to read binary data, especially if the file is contiguous.
For writing, memory mapped files are problematic.

Multithreaded Files Reading

I need to read / parse a large binary file (4 ~ 6 GB) that comes in fixed chunks of 8192 bytes. My current solution involves streaming the file chunks using the Single Producer Multiple Consumer (SPMC) pattern.
EDIT
File size = N * 8192 Bytes
All I am required to do is to do something to each of these 8192 bytes. The file is only required to be read once top down.
Having thought that this should be an embarrassingly parallel problem, I would like to have X threads to read at equal ranges of (File Size / X) sizes independently. The threads do not need to communicate with each other at all.
I've tried spawning X threads to open the same file and seek to their respective sections to process, however, this solution seems to have a problem with the due to HDD mechanical seeks and apparently performs worse than the SPMC solution.
Would there be any difference if this method is used on the SSD instead?
Or would it be more straight forward to just memory map the whole file and use #pragma omp parallel for to process the chunks? I suppose I would need sufficient enough RAM to do this?
What would you suggest?

What would you suggest?
Don't use mmap()
Per Linux Torvalds himself:
People love mmap() and other ways to play with the page tables to
optimize away a copy operation, and sometimes it is worth it.
HOWEVER, playing games with the virtual memory mapping is very
expensive in itself. It has a number of quite real disadvantages that
people tend to ignore because memory copying is seen as something very
slow, and sometimes optimizing that copy away is seen as an obvious
improvment.
Downsides to mmap:
quite noticeable setup and teardown costs. And I mean noticeable.
It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the
mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Upsides of mmap:
if the data gets re-used over and over again (within a single map operation), or if you can avoid a lot of other logic by just mapping something in, mmap() is just the greatest thing since sliced bread.
This may be a file that you go over many times (the binary image of an executable is the obvious case here - the code jumps all around the place), or a setup where it's just so convenient to map the whole thing in without regard of the actual usage patterns that mmap() just wins. You may have random access patterns, and use mmap() as a way of keeping track of what data you actually needed.
if the data is large, mmap() is a great way to let the system know what it can do with the data-set. The kernel can forget pages as memory pressure forces the system to page stuff out, and then just automatically re-fetch them again.
And the automatic sharing is obviously a case of this.
But your test-suite (just copying the data once) is probably pessimal
for mmap().
Note the last - just using the data once is a bad use-case for mmap().
For a file on an SSD, since there are no physical head seek movements:
Open the file once, using open() to get a single int file descriptor.
Use pread() per thread to read appropriate 8kB chunks. pread() reads from a specified offset without using lseek(), and does not effect the current offset of the file being read from.
You'll probably need somewhat more threads than CPU cores, since there's going to be significant IO waiting on each thread.
For a file on mechanical disk(s):
You want to minimize head seek(s) on the mechanical disk.
Open the file once, using open() with direct IO (assuming Linux, open( filename, O_RDONLY | O_DIRECT );) to bypass the page cache (since you're going to stream the file and never re-read any portion of it, the page cache does you no good here)
Using a single producer thread, read large chunks (say 64k to 1MB+)
into one of N page-aligned buffers.
When a buffer is read, pass it to the worker threads, then read to fill the next buffer
When all workers are done using their part of the buffer, pass the
buffer back to the reading thread.
You'll need to experiment with the proper read() size, the number of worker threads, and the number of buffers passed around. Larger read()s will be more efficient, but the larger buffer size makes the memory requirements larger and makes the latency of getting that buffer back from the worker threads much more unpredictable. You want to make as few copies of the data as possible, so you'd want the worker threads to work directly on the buffer read from the file.

Even if the processing of each 8K block is significant (short of OCR processing), the i/o is the bottleneck. Unless it can be arranged for parts of the file to be already cached by previous operations....
If the system this is to run on can be dedicated to the problem:
Obtain the file size (fstat)
Allocate a buffer that size.
Open and read the whole file into the buffer.
Figure out how to partition the data per thread and spin off the threads.
Time that algorithm.
Then, revise it using asynchronous reading. See man aio_read and man 7 aio to learn what needs to be done.

Is it better to read an entire file in std::string or to manipulate a file with std::ifstream?

I am actually developing scientific C++ simulation programs which read data, compute lots of values from them and finally store the results in a file. I wanted to know if reading all the data at once at the beginning of the program is faster than keep accessing the file via std::ifstream during the program.
The data I am using are not very big (several MB), but I do not even know what "big" is for a heap allocation...
I guess it depends on the data and so on (and after some test, effectively it depends), but I was wondering on what it was depending and whether there is a kind of general principle we should be following.
Long story short, the question is: does keeping a file opened and using file manipulators is faster than a potentially big heap allocation and using string manipulators?

Take a look at mmap. This API allows you to map a file descriptor into your address space using the same paging mechanism as is used for RAM. This way you should get both the benefit random access to data while not unnecessarily copying unneeded data into RAM.

Is reading all the data at once at the beginning of the program faster than keep accessing the file via std::ifstream during the program? Yes, probably it is. Keep in mind that working memory is fast and expensive, while storage memory (a hard drive) exists precisely to be cheap at the cost of being slow.
What is "big" for a heap allocation? The operating system is going to try to fool your process into thinking that all existing working memory is free. This is not actually true, and the OS will "swap" one type of memory for the other if some process requests too much memory. But in principle, you should think that a heap allocation is big if it is comparable to the total size of working memory.
Does keeping a file opened and using file manipulators is faster than a potentially big heap allocation and using string manipulators? No, it is not faster, but it has another advantage: it is memory-efficient. If you only put the needed data into memory in order to work with them, you are saving memory for all other processes in the machine (which could be other threads of your program, for instance). This is a very interesting property in order to have scalable software.

(Expect this to be closed because it is an "opinion based" question.)
My thoughts:
This sounds like premature optimization. Write it the easy way, then optimize if it is too slow.
Working in memory is generally thousands of times faster. Heap allocations slow down based on the number of allocations, not the size of the allocations. It does not sound like you are working with a whole lot of data though.
If your files are "several MB" then the OS will probably cache it anyway.

Reading data from a file in large chunks is a lot faster than many read requests of small size. For example, 1 read of 10MB is faster than 10 reads of 1MB.
When I optimize file I/O, I read the data into a uint8_t buffer and then parse the buffer. One thorn with this method is reading of text files. There is a possibility that the text encoded data may span across a buffer boundary. For example, you have 4 numbers per text line and only 2 are in the buffer (or only 2 digits of the number are in the buffer). You will have to write code to handle these cases.
If you consider your program as a pipeline, you may be able to optimize futher. You can implement threads: a reading thread, a processing thread, and a writing (output) thread. The reading thread reads into buffers. When there is enough data for processing, the reading thread wakes up the processing thread. The processing thread processes the data that was read, and when there is some output, it stores it into a shared buffer and wakes up the output thread. So with the pipeline model, the data enters the pipe via the reading thread. At some point in the pipe, the processing thread processes the data. The writing thread takes the data from the processing thread and outputs it (exiting the pipeline).
Also, organizing your data so it fits into a processor cache line will also speed up your program.

Understanding buffering behavior of fwrite()

I am using the function call fwrite() to write data to a pipe on Linux.
Earlier, fwrite() was being called for small chunks of data (average 20 bytes) repeatedly and buffering was left to fwrite(). strace on the process showed that 4096 bytes of data was being written at a time.
It turned out that this writing process was the bottleneck in my program. So I decided to buffer data in my code into blocks of 64KB and then write the entire block at a time using fwrite(). I used setvbuf() to set the FILE* pointer to 'No Buffering'.
The performance improvement was not as significant as I'd expected.
More importantly, the strace output showed that data was still being written 4096 bytes at a time. Can someone please explain this behavior to me? If I am calling fwrite() with 64KB of data, why is it writing only 4096 bytes at a time?
Is there an alternative to fwrite() for writing data to a pipe using a FILE* pointer?

The 4096 comes from the Linux machinery that underlies pipelines. There are two places it occurs. One is the capacity of the pipeline. The capacity is one system page on older versions of Linux, which is 4096 bytes on a 32 bit i386 machine. (On more modern versions of Linux the capacity is 64K.)
The other place you'll run into that 4096 bytes problem is in the defined constant PIPE_BUF, the number of bytes that are guaranteed to be treated atomically. On Linux this is 4096 bytes. What this limit means depends on whether you have set the pipeline to blocking or non-blocking. Do a man -S7 pipe for all the gory details.
If you are trying to exchange huge volumes of data at a high rate you might want to rethink your use of pipes. You're on a Linux box, so shared memory is an option. You can use pipes to send relatively small amounts of data as a signaling mechanism.

If you want to change the buffering behavior, you must do so immediately after the fopen (or before any I/O, for the standard filehandles stdin, stdout, stderr). You also do not want to disable buffering and try to manage the buffer yourself; rather, specify your 64K buffer to setvbuf so that it can be used properly.
If you really want to manage the buffering manually, do not use stdio; use the lower level open, write, and close calls.

Speeding up file I/O: mmap() vs. read()

I have a Linux application that reads 150-200 files (4-10GB) in parallel. Each file is read in turn in small, variably sized blocks, typically less than 2K each.
I currently need to maintain over 200 MB/s read rate combined from the set of files. The disks handle this just fine. There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment).
We have implemented two different read systems both make heavy use of posix_advise: first is a mmaped read in which we map the entirety of the data set and read on demand.
The second is a read()/seek() based system.
Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes.
So my question comes to these:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
B: Are there systematic ways for mmap to better deal with very large mapped data?
Mmap-vs-reading-blocks
is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read.

Reads back to what? What is the final destination of this data?
Since it sounds like you are completely IO bound, mmap and read should make no difference. The interesting part is in how you get the data to your receiver.
Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. To do this using zero-copy, try the splice system call. You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout.
if (pid = fork()) {
waitpid(pid, ...);
} else {
dup2(dest, 1);
dup2(source, 0);
execlp("cat", "cat");
}
Update0
If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. Your processing step should accept data from stdin, or a pipe.
To answer your more specific questions:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
That's as good as it gets with regard to telling the kernel what to do from userspace. The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. I'd just go with splicing the files into a pipe.
B: Are there systematic ways for mmap to better deal with very large mapped data?
Yes. The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing):
MAP_HUGETLB
Allocate the mapping using "huge pages."
This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files.
MAP_NORESERVE
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.
This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.**
MAP_POPULATE
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. I suspect this flag is redundant, the VFS likely does this better by default.

Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong).
And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. Can't than be half a megabyte instead?

The problem here doesn't seem to be which api is used. It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access).
mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow.
I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js