I have an image stream coming in from a camera at about 100 frames/second, with each image being about 2 MB. Now just because of the disk write speed I know I can't write each frame, so I'm only trying to save about a third of those frames each second.
The stream is a circular buffer of large char arrays. And right now I'm using fwrite to dump each array to a temporary file as it gets buffered, but it only seems to be writing at about 20-30 MB/s while the hard drive should theoretically go up to 80-100 MB/s
Any thoughts? Is there a faster way to write than fwrite() or a way to optimize it?
More generally what is the fastest way to dump large amounts of a data to a standard hard drive?
What if you'll use memory mapped files limited to, for example, 1GB each? This should provide enough speed and buffer to work with all frames, especially if you'll manage to perform zero-copy frame allocation.
fwrite is buffered, which is what you want. Though with that big files/writes it shouldn't make much or any difference. Maybe experiment with a larger stream buffer with the setbuf call.
Since you are limited by physical disk i/o speeds, as long as you are making it as easy as possible for the system to use each available disk io efficiently there's not really more you can do.
vmstat on linux (other similar tools on other systems) can tell you how many disk i/os your disk is doing, so you can test if your changes help anything.
Asynchronous non-buffered output is a key to success in your case. Buffered IO will only cause double-buffering overhead and sync IO will make HDD heads missing sequential sectors.
Boost.Asio provides a relatively good encapsulation of system-specific APIs for popular platforms.
There are few things to remember:
on most non-Windows platforms you will have to write to raw partitions go get system's bufferization and internal threading out of the way.
keep the write queue non-empty all the time, so the SATA controller can help you by means of NCQ.
pay attention to system-specific requirements to buffer alignment and size for async non-buffered IO to work.
file open mode is also important to make the system to do what you want.
Related
Suppose that you have a file of integers and you want to read them one by one.
You have two options for buffering.
Declare an array buffer of size N and use setvbuf to tell fread which buffer to use. Then when calling the function fread to read an integer you write fread(&myInt, sizeof(myInt), 1, inputFile);
Declare the same array buffer but this time don't use the function setvbuf. Instead work on the buffering by yourself. So call fread(buffer, bufferSize*sizeof(int), 1, inputFile)
From my understanding setvbuf exists to make your life easier, but does it come at a cost? Which method would you use in terms of performance?
I would use neither of your examples. I don't think that part of the I/O is the performance bottleneck.
The vbuf is an area for the input routine to place data before putting it into your destination. It could be used as a cache or as a preformatting buffer.
Most of the time, I/O bottlenecks are related to the quantity of data fetched and the number of fetches. For example, reading one byte at a time is less efficient than reading a block of bytes.
Another I/O related bottleneck is the duration between input requests. I/O devices prefer to keep streaming data, non-stop. Some input devices, like hard drives, have an overhead time between when the request is received and when the data starts transmitting. For hard drives, this would be the disk speed up time.
Your best performance is not to waste development time messing with the C or C++ libraries. You need to use hardware assist. Some platforms have a device called a Direct Memory Access controller (DMA). This device can take data from an input source and deliver it to memory without using the CPU. The CPU can be executing instructions while the DMA is transferring data. In order to use hardware assistance, you need to write code at the OS driver level, or access the OS drivers directly.
The C and C++ I/O libraries are designed for a platform independent concept called streams. There may be execution overhead associated with this (such as extra buffering). If you don't care about different platforms, then access the OS drivers directly.
Don't waste your time messing with the C and C++ libraries. Not much performance gain there. More performance lies in accessing the OS drivers directly (or using your own). How and when you access the I/O will show bigger performance gains than tweaking the C and C++ libraries.
Lastly, using the processors data cache effectively will gain you performance too.
I am unable to find the underlying concept of IO Stream Buffering and what does it mean.
Any tutorials and links will be helpful.
Buffering is a fundamental part of software that handles input and output. The buffer holds data that is in between the software interface and the hardware interface, since hardware and software run at different speeds.
A component which produces data can put it into a buffer, and later the buffer is "flushed" by sending the collected data to the next component. Likewise the other component may be "waiting on the buffer" until a complete piece of data, or enough data to be efficiently processed, is available for input.
In C++, std::basic_filebuf implements a buffer over a filesystem file. It stores up to a fixed number of bytes so the operating system always works with a minimum transaction size, while the program can access individual characters if desired.
See Wikipedia.
Buffering is using memory (users memory) instead of sending the data straight to the OS (i.e. disk). Saves on a context switch.
Here's the concept. Imagine you have an application that needs to write it's data onto the hard drive. Let's say it wants to write something (e.g. update a log file) every half of a second. Is this good? No, and here is the reason.
Software can be very fast, but the speed on which the HDD can operate is limited, and it's much slower than the memory, and your application. To write something, the HDD needs to reposition it's magnetic heads to a specific sector (which probably involves slowing the disc rotation speed), write the data, and reposition back to where it was. So your application could operate very slowly (well, that's a theoretical example of course).
Buffering helps to deal with this. Instead of writing to the disc each time, the data is being accumulated in the buffer somewhere in the memory. Once a sufficient amount of data is gathered, the buffer is flushed: the data from it gets written on the disk. Such approach helps to minimize HDD operations and improve overall speed.
I'm writing streams of images to a hard disk using std::fstream. Since most hard disk drives have a 32MB cache, is it more efficient to create a buffer to accumulate image data up to 32MB and then write to disk, or is it as efficient to just write every image onto the disk?
The cache is used as a read/write cache to alleviate problems due to queuing.... Here are my experiences with disks:
If the disk is not a SSD, then it's better if you write serially, than seek to files.. Seek is a killer for I/O performance.
The disks typically writes in sector sizes. sector sizes are usually 512b or 4k (newer disks). Try to write data one sector at a time.
Bunching I/O is always faster than multiple small I/Os. The simple reason is that the processor on the disk has a smaller queue to flush.
Whatever you can serve from memory, serve. Use disk only if necessary. You can always do an modify/invalidate cache entry on write, depending on your reliability policy. Make sure you don't swap, so your memory cache size must be reasonable, to begin with.
If you're doing this I/O management, make sure you don't double-buffer with your OS page cache. O_DIRECT for this.
Use non-blocking, if reliability isn't an issue. O_NONBLOCK
Every part of your system, from fstream down to the disk driver knows more about specific efficiency than your application even has access to.
You couldn't improve upon the various buffering schemes if you tried, so don't bother.
I have a Linux application that reads 150-200 files (4-10GB) in parallel. Each file is read in turn in small, variably sized blocks, typically less than 2K each.
I currently need to maintain over 200 MB/s read rate combined from the set of files. The disks handle this just fine. There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment).
We have implemented two different read systems both make heavy use of posix_advise: first is a mmaped read in which we map the entirety of the data set and read on demand.
The second is a read()/seek() based system.
Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes.
So my question comes to these:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
B: Are there systematic ways for mmap to better deal with very large mapped data?
Mmap-vs-reading-blocks
is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read.
Reads back to what? What is the final destination of this data?
Since it sounds like you are completely IO bound, mmap and read should make no difference. The interesting part is in how you get the data to your receiver.
Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. To do this using zero-copy, try the splice system call. You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout.
if (pid = fork()) {
waitpid(pid, ...);
} else {
dup2(dest, 1);
dup2(source, 0);
execlp("cat", "cat");
}
Update0
If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. Your processing step should accept data from stdin, or a pipe.
To answer your more specific questions:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
That's as good as it gets with regard to telling the kernel what to do from userspace. The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. I'd just go with splicing the files into a pipe.
B: Are there systematic ways for mmap to better deal with very large mapped data?
Yes. The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing):
MAP_HUGETLB
Allocate the mapping using "huge pages."
This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files.
MAP_NORESERVE
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.
This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.**
MAP_POPULATE
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. I suspect this flag is redundant, the VFS likely does this better by default.
Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong).
And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. Can't than be half a megabyte instead?
The problem here doesn't seem to be which api is used. It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access).
mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow.
I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often.
Assuming the following for...
Output:
The file is opened...
Data is 'streamed' to disk. The data in memory is in a large contiguous buffer. It is written to disk in its raw form directly from that buffer. The size of the buffer is configurable, but fixed for the duration of the stream. Buffers are written to the file, one after another. No seek operations are conducted.
...the file is closed.
Input:
A large file (sequentially written as above) is read from disk from beginning to end.
Are there generally accepted guidelines for achieving the fastest possible sequential file I/O in C++?
Some possible considerations:
Guidelines for choosing the optimal buffer size
Will a portable library like boost::asio be too abstracted to expose the intricacies of a specific platform, or can they be assumed to be optimal?
Is asynchronous I/O always preferable to synchronous? What if the application is not otherwise CPU-bound?
I realize that this will have platform-specific considerations. I welcome general guidelines as well as those for particular platforms.
(my most immediate interest in Win x64, but I am interested in comments on Solaris and Linux as well)
Are there generally accepted guidelines for achieving the fastest possible sequential file I/O in C++?
Rule 0: Measure. Use all available profiling tools and get to know them. It's almost a commandment in programming that if you didn't measure it you don't know how fast it is, and for I/O this is even more true. Make sure to test under actual work conditions if you possibly can. A process that has no competition for the I/O system can be over-optimized, fine-tuned for conditions that don't exist under real loads.
Use mapped memory instead of writing to files. This isn't always faster but it allows the opportunity to optimize the I/O in an operating system-specific but relatively portable way, by avoiding unnecessary copying, and taking advantage of the OS's knowledge of how the disk actually being used. ("Portable" if you use a wrapper, not an OS-specific API call).
Try and linearize your output as much as possible. Having to jump around memory to find the buffers to write can have noticeable effects under optimized conditions, because cache lines, paging and other memory subsystem issues will start to matter. If you have lots of buffers look into support for scatter-gather I/O which tries to do that linearizing for you.
Some possible considerations:
Guidelines for choosing the optimal buffer size
Page size for starters, but be ready to tune from there.
Will a portable library like boost::asio be too abstracted to expose the intricacies
of a specific platform, or can they be assumed to be optimal?
Don't assume it's optimal. It depends on how thoroughly the library gets exercised on your platform, and how much effort the developers put into making it fast. Having said that a portable I/O library can be very fast, because fast abstractions exist on most systems, and it's usually possible to come up with a general API that covers a lot of the bases. Boost.Asio is, to the best of my limited knowledge, fairly fine tuned for the particular platform it is on: there's a whole family of OS and OS-variant specific APIs for fast async I/O (e.g. epoll, /dev/epoll, kqueue, Windows overlapped I/O), and Asio wraps them all.
Is asynchronous I/O always preferable to synchronous? What if the application is not otherwise CPU-bound?
Asynchronous I/O isn't faster in a raw sense than synchronous I/O. What asynchronous I/O does is ensure that your code is not wasting time waiting for the I/O to complete. It is faster in a general way than the other method of not wasting that time, namely using threads, because it will call back into your code when I/O is ready and not before. There are no false starts or concerns with idle threads needing to be terminated.
A general advice is to turn off buffering and read/write in large chunks (but not too large, then you will waste too much time waiting for the whole I/O to complete where otherwise you could start munching away at the first megabyte already. It's trivial to find the sweet spot with this algorithm, there's only one knob to turn: the chunk size).
Beyond that, for input mmap()ing the file shared and read-only is (if not the fastest, then) the most efficient way. Call madvise() if your platform has it, to tell the kernel how you will traverse the file, so it can do readahead and throw out the pages afterwards again quickly.
For output, if you already have a buffer, consider underpinning it with a file (also with mmap()), so you don't have to copy the data in userspace.
If mmap() is not to your liking, then there's fadvise(), and, for the really tough ones, async file I/O.
(All of the above is POSIX, Windows names may be different).
For Windows, you'll want to make sure you use the FILE_FLAG_SEQUENTIAL_SCAN in your CreateFile() call, if you opt to use the platform specific Windows API call. This will optimize caching for the I/O. As far as buffer sizes go, a buffer size that is a multiple of the disk sector size is typically advised. 8K is a nice starting point with little to be gained from going larger.
This article discusses the comparison between async and sync on Windows.
http://msdn.microsoft.com/en-us/library/aa365683(VS.85).aspx
As you noted above it all depends on the machine / system / libraries that you are using. A fast solution on one system may be slow on another.
A general guideline though would be to write in as large of chunks as possible.Typically writing a byte at a time is the slowest.
The best way to know for sure is to code a few different ways and profile them.
You asked about C++, but it sounds like you're past that and ready to get a little platform-specific.
On Windows, FILE_FLAG_SEQUENTIAL_SCAN with a file mapping is probably the fastest way. In fact, your process can exit before the file actually makes it on to the disk. Without an explicitly-blocking flush operation, it can take up to 5 minutes for Windows to begin writing those pages.
You need to be careful if the files are not on local devices but a network drive. Network errors will show up as SEH errors, which you will need to be prepared to handle.
On *nixes, you might get a bit higher performance writing sequentially to a raw disk device. This is possible on Windows too, but not as well supported by the APIs. This will avoid a little filesystem overhead, but it may not amount to enough to be useful.
Loosely speaking, RAM is 1000 or more times faster than disks, and CPU is faster still. There are probably not a lot of logical optimizations that will help, except avoiding movements of the disk heads (seek) whenever possible. A dedicated disk just for this file can help significantly here.
You will get the absolute fastest performance by using CreateFile and ReadFile. Open the file with FILE_FLAG_SEQUENTIAL_SCAN.
Read with a buffer size that is a power of two. Only benchmarking can determine this number. I have seen it to be 8K once. Another time I found it to be 8M! This varies wildly.
It depends on the size of the CPU cache, on the efficiency of OS read-ahead and on the overhead associated with doing many small writes.
Memory mapping is not the fastest way. It has more overhead because you can't control the block size and the OS needs to fault in all pages.
On Linux, buffered reads and writes speed up things a lot up, increasingly with increasing buffers sizes, but the returns are diminishing and you generally want to use BUFSIZ (defined by stdio.h) as larger buffer sizes won't help much.
mmaping provides the fastest access to files, but the mmap call itself is rather expensive. For small files (16KiB) read and write system calls win (see https://stackoverflow.com/a/39196499/1084774 for the numbers on reading through read and mmap).