I'm writing streams of images to a hard disk using std::fstream. Since most hard disk drives have a 32MB cache, is it more efficient to create a buffer to accumulate image data up to 32MB and then write to disk, or is it as efficient to just write every image onto the disk?
The cache is used as a read/write cache to alleviate problems due to queuing.... Here are my experiences with disks:
If the disk is not a SSD, then it's better if you write serially, than seek to files.. Seek is a killer for I/O performance.
The disks typically writes in sector sizes. sector sizes are usually 512b or 4k (newer disks). Try to write data one sector at a time.
Bunching I/O is always faster than multiple small I/Os. The simple reason is that the processor on the disk has a smaller queue to flush.
Whatever you can serve from memory, serve. Use disk only if necessary. You can always do an modify/invalidate cache entry on write, depending on your reliability policy. Make sure you don't swap, so your memory cache size must be reasonable, to begin with.
If you're doing this I/O management, make sure you don't double-buffer with your OS page cache. O_DIRECT for this.
Use non-blocking, if reliability isn't an issue. O_NONBLOCK
Every part of your system, from fstream down to the disk driver knows more about specific efficiency than your application even has access to.
You couldn't improve upon the various buffering schemes if you tried, so don't bother.
Related
I try to pack and compress game client resource data using zlib. If I compress the data, it will reduce Disk I/O as reduced file size but it increases CPU usage when uncompress.
Question1
if a resource used for rendering is compressed, processing (rendering and uncompressing) uses CPU, so i think it seems to be rather slow, is it right?
If no compression, Disk I/O has not changed and an additional CPU usage does not occur. And if you read only a portion of the file, DISK I/O can be reduced by using the CreateFileMapping(), MapViewOfFile() functions.
Question2
In the case of the resource, such as uncompressed image(for example tga, not png) when we have to read whole file (ex. image file), we can't get adventage of CreateFileMapping(), MapViewOfFile(), so i think compressing resource is better, how do you think?
Question3
What do you think about compressing resource data when packing?
Resources for games are not only packed to reduce size, but also to reduce the number of seeks by collapsing many small files into one, which matters a lot more than the size on disk. A single unnecessary seek on a conventional hard disk costs as much time as reading a gigabyte of data. Even if your "compression" consists of only concatenating small files together, you already gain performance.
As a small bonus, having resources packed in an archive somewhat obscures them from computer unsavy people, deterring them from modifying game assets (though admittedly, this is not a very big hurdle!).
Q1: Depending on what compression algorithm you use, you can easily get upwards of 1 GB/s decompression (close to 2 GB/s with a fast CPU). Sequential disk I/O is still around 300-400 MB/s maximum even on solid state (and usually less). Random access disk I/O is 5-20 times slower, depending on the disk and the access pattern.
On the other hand, you can get as little as a few dozen kilobytes per second in decompression speed if you choose a slow algorithm, which is much worse than just loading more data from disk. The secret is to choose an algorithm that compresses reasonably well (not perfectly, just reasonably) and runs at good decompression speed. Compression speed usually does not matter, since this is done offline once. Candidate algorithms are for example LZF, Snappy, or LZ4.
File mapping can generally be used regardless of whether the contents are compressed. Also, filemapping is not only an advantage for very small portions, on the contrary. The larger your reads, the more advantageous it becomes (very small views may actually be faster using conventional reads).
Q2: Uncompressed images do not normally occur in a game. Most of the time you will want to use DXT compression, not so much to reduce disk I/O but to reduce memory and PCIe bandwidth requirements and GPU memory consumption. DXT is a very poor compression, but it works in hardware and has an exactly predictable compression ratio. You can compress DXT-compressed textures again with a conventional general-purpose compressor (with varying rates, depending on what compressor you used, there are some that are especially optimized for that purpose).
Q3: Packing resources is definitively advisable for any non-trivial game.
I have an image stream coming in from a camera at about 100 frames/second, with each image being about 2 MB. Now just because of the disk write speed I know I can't write each frame, so I'm only trying to save about a third of those frames each second.
The stream is a circular buffer of large char arrays. And right now I'm using fwrite to dump each array to a temporary file as it gets buffered, but it only seems to be writing at about 20-30 MB/s while the hard drive should theoretically go up to 80-100 MB/s
Any thoughts? Is there a faster way to write than fwrite() or a way to optimize it?
More generally what is the fastest way to dump large amounts of a data to a standard hard drive?
What if you'll use memory mapped files limited to, for example, 1GB each? This should provide enough speed and buffer to work with all frames, especially if you'll manage to perform zero-copy frame allocation.
fwrite is buffered, which is what you want. Though with that big files/writes it shouldn't make much or any difference. Maybe experiment with a larger stream buffer with the setbuf call.
Since you are limited by physical disk i/o speeds, as long as you are making it as easy as possible for the system to use each available disk io efficiently there's not really more you can do.
vmstat on linux (other similar tools on other systems) can tell you how many disk i/os your disk is doing, so you can test if your changes help anything.
Asynchronous non-buffered output is a key to success in your case. Buffered IO will only cause double-buffering overhead and sync IO will make HDD heads missing sequential sectors.
Boost.Asio provides a relatively good encapsulation of system-specific APIs for popular platforms.
There are few things to remember:
on most non-Windows platforms you will have to write to raw partitions go get system's bufferization and internal threading out of the way.
keep the write queue non-empty all the time, so the SATA controller can help you by means of NCQ.
pay attention to system-specific requirements to buffer alignment and size for async non-buffered IO to work.
file open mode is also important to make the system to do what you want.
I have a Linux application that reads 150-200 files (4-10GB) in parallel. Each file is read in turn in small, variably sized blocks, typically less than 2K each.
I currently need to maintain over 200 MB/s read rate combined from the set of files. The disks handle this just fine. There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment).
We have implemented two different read systems both make heavy use of posix_advise: first is a mmaped read in which we map the entirety of the data set and read on demand.
The second is a read()/seek() based system.
Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes.
So my question comes to these:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
B: Are there systematic ways for mmap to better deal with very large mapped data?
Mmap-vs-reading-blocks
is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read.
Reads back to what? What is the final destination of this data?
Since it sounds like you are completely IO bound, mmap and read should make no difference. The interesting part is in how you get the data to your receiver.
Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. To do this using zero-copy, try the splice system call. You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout.
if (pid = fork()) {
waitpid(pid, ...);
} else {
dup2(dest, 1);
dup2(source, 0);
execlp("cat", "cat");
}
Update0
If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. Your processing step should accept data from stdin, or a pipe.
To answer your more specific questions:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
That's as good as it gets with regard to telling the kernel what to do from userspace. The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. I'd just go with splicing the files into a pipe.
B: Are there systematic ways for mmap to better deal with very large mapped data?
Yes. The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing):
MAP_HUGETLB
Allocate the mapping using "huge pages."
This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files.
MAP_NORESERVE
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.
This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.**
MAP_POPULATE
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. I suspect this flag is redundant, the VFS likely does this better by default.
Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong).
And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. Can't than be half a megabyte instead?
The problem here doesn't seem to be which api is used. It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access).
mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow.
I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often.
I have a C++ program which reads files from the hard disk and does some processing on the data in the files. I am using standard Win32 APIs to read the files. My problem is that this program is blazingly fast some times and then suddenly slows down to 1/6th of the previous speed. If I read the same files again and again over multiple runs, then normally the first run will be the slowest one. Then it maintains the speed until I read some other set of files. So my obvious guess was to profile the disk access time. I used perfmon utility and measured the IO Read Bytes/sec for my program. And as expected there was a huge difference (~ 5 times) in the number of bytes read. My questions are:
(1). Does OS (Windows in my case) cache the recently read files somewhere so that the subsequent loads are faster?
(2). If I can guarantee that all the files I read reside in the same directory then is there any way I can place them in the hard disk so that my disk access time is faster?
Is there anything I can do for this?
1) Windows does cache recently read files in memory. The book Windows Internals includes an excellent description of how this works. Modern versions of Windows also use a technology called SuperFetch which will try to preemptively fetch disk contents into memory based on usage history and ReadyBoost which can cache to a flash drive, which allows faster random access. All of these will increase the speed with which data is accessed from disk after the initial run.
2) Directory really doesn't affect layout on disk. Defragmenting your drive will group file data together. Windows Vista on up will automatically defragment your disk. Ideally, you want to do large sequential reads and minimize your writes. Small random accesses and interleaving writes with reads significantly hurts performance. You can use the Windows Performance Toolkit to profile your disk access.
Your numbered questions seem to be answered already. If you're still wondering what you can do to improve hard drive read speed, here are some tips:
Read with the OS functions (e.g., ReadFile) rather than wrapper libraries (like iostreams or stdio) if possible. Many wrappers introduce more levels of buffering.
Read sequentially, and let Windows know you're going to read sequentially with the FILE_FLAG_SEQUENTIAL_SCAN flag.
If you're only going to read (and not write), be sure to open the file just for reading.
Read in chunks, not bytes or characters.
Ideally the chunks should be multiples of the disk's cluster size.
Read from the disc at cluster-aligned offsets.
Read to memory at page-boundaries. (If you're allocating a big chunk, it's probably page aligned.)
Advanced: If you can start your computation after reading just the beginning the file, then you can used overlapped I/O to try to parallelize the computation and the subsequent reads as much as possible.
Yes, Windows (and most modern OS's) keep recently read file data in otherwise unused RAM so that if that file data is requested again in the near future it will already be available in RAM and disk access can be avoided.
As far as making disk access faster, you could try defragmenting your drive, but I wouldn't expect it to help too much. Drive access is just slow compared to RAM access, which is why RAM caching provides such a nice speedup.
As a diagnostic test, can you accurately measure the time it takes to load the very first time?
Then take that to determine the transfer rate. Then you can take that transfer rate and compare that to what you get when running HD Tune. For what it's worth, I ran this myself and got 44.2 MB/s minimum, 87 MB/s average, 110 MB/s max read speeds with my Western Digital RE3 drive (one of the faster 7200 RPM SATA drives available).
The point of all this is to see if your own application is doing the best it can. In other words, aside from caching you can't really read the files any faster than what your hard drive is capable of. So if you're reaching that limit then there's nothing more to do.
Also, make sure that you are not running out of memory during your tests. Run perfmon and monitor Memory > Available Bytes and PhysicalDisk > Disk Read Bytes/sec for the physical drive you are reading. Monitoring process' I/O is a good idea too. Keep in mind that the latter combines all I/O (network included).
You should expect 50 MB/s for sequential reads from a single average SATA drive. A couple of good striped serial SCSI drive will give you about 220 MB/s. If you are seeing available memory going to near zero, that would be your problem. If it stays flat after you did the first round of reading than it has something to do with your app.
A Microsoft utility called contig can be used to defragment a single file on disk or to create a new unfragmented file.
For the crazy answer, you could try formatting the drive such that you place your info on the fastest portion, and see if that helps any.
Tom's Hardware had a review on how that might be done.
My problem is this: I have a C/C++ app that runs under Linux, and this app receives a constant-rate high-bandwith (~27MB/sec) stream of data that it needs to stream to a file (or files). The computer it runs on is a quad-core 2GHz Xeon running Linux. The filesystem is ext4, and the disk is a solid state E-SATA drive which should be plenty fast for this purpose.
The problem is Linux's too-clever buffering behavior. Specifically, instead of writing the data to disk immediately, or soon after I call write(), Linux will store the "written" data in RAM, and then at some later time (I suspect when the 2GB of RAM starts to get full) it will suddenly try to write out several hundred megabytes of cached data to the disk, all at once. The problem is that this cache-flush is large, and holds off the data-acquisition code for a significant period of time, causing some of the current incoming data to be lost.
My question is: is there any reasonable way to "tune" Linux's caching behavior, so that either it doesn't cache the outgoing data at all, or if it must cache, it caches only a smaller amount at a time, thus smoothing out the bandwidth usage of the drive and improving the performance of the code?
I'm aware of O_DIRECT, and will use that I have to, but it does place some behavioral restrictions on the program (e.g. buffers must be aligned and a multiple of the disk sector size, etc) that I'd rather avoid if I can.
You can use the posix_fadvise() with the POSIX_FADV_DONTNEED advice (possibly combined with calls to fdatasync()) to make the system flush the data and evict it from the cache.
See this article for a practical example.
If you have latency requirements that the OS cache can't meet on its own (the default IO scheduler is usually optimized for bandwidth, not latency), you are probably going to have to manage your own memory buffering. Are you writing out the incoming data immediately? If you are, I'd suggest dropping that architecture and going with something like a ring buffer, where one thread (or multiplexed I/O handler) is writing from one side of the buffer while the reads are being copied into the other side.
At some size, this will be large enough to handle the latency required by a pessimal OS cache flush. Or not, in which case you're actually bandwidth limited and no amount of software tuning will help you until you get faster storage.
You can adjust the page cache settings in /proc/sys/vm, (see /proc/sys/vm/dirty_ratio, /proc/sys/vm/swappiness specifically) to tune the page cache to your liking.
If we are talking about std::fstream (or any C++ stream object)
You can specify your own buffer using:
streambuf* ios::rdbuf ( streambuf* streambuffer);
By defining your own buffer you can customize the behavior of the stream.
Alternatively you can always flush the buffer manually at pre-set intervals.
Note: there is a reson for having a buffer. It is quicker than writting to a disk directly (every 10 bytes). There is very little reason to write to a disk in chunks smaller than the disk block size. If you write too frquently the disk controler will become your bottle neck.
But I have an issue with you using the same thread in the write proccess needing to block the read processes.
While the data is being written there is no reason why another thread can not continue to read data from your stream (you may need to some fancy footwork to make sure they are reading/writting to different areas of the buffer). But I don't see any real potential issue with this as the IO system will go off and do its work asyncroniously (potentially stalling your write thread (depending on your use of the IO system) but not nesacerily your application).
I know this question is old, but we know a few things now we didn't know when this question was first asked.
Part of the problem is that the default values for /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio are not appropriate for newer machines with lots of memory. Linux begins the flush when dirty_background_ratio is reached, and blocks all I/O when dirty_ratio is reached. Lower dirty_background_ratio to start flushing sooner, and raise dirty_ratio to start blocking I/O later. On very large memory systems, (32GB or more) you may even want to use dirty_bytes and dirty_background_bytes, since the minimum increment of 1% for the _ratio settings is too coarse. Read https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ for a more detailed explanation.
Also, if you know you won't need to read the data again, call posix_fadvise with FADV_DONTNEED to ensure cache pages can be reused sooner. This has to be done after linux has flushed the page to disk, otherwise the flush will move the page back to the active list (effectively negating the effect of fadvise).
To ensure you can still read incoming data in the cases where Linux does block on the call to write(), do file writing in a different thread than the one where you are reading.
Well, try this ten pound hammer solution that might prove useful to see if i/o system caching contributes to the problem: every 100 MB or so, call sync().
You could use a multithreaded approach—have one thread simply read data packets and added them to a fifo, and the other thread remove packets from the fifo and write them to disk. This way, even if the write to disk stalls, the program can continue to read incoming data and buffer it in RAM.