Idea/Fact #1
I was reading few post about how the streams are buffered so fwrite() is usually buffered stream. On the other hand write() will not be buffered.
Why the fwrite libc function is faster than the syscall write function?
Idea/Fact #2
I was also looking into the article about disc caching and how Linux uses it heavily to improve the disc performance substantially.
http://www.linuxatemyram.com/play.html
So in the presence of disc buffering which Linux do by default shouldn't fwrite() and write() will render same performance? What fwrite() is doing is a "buffering over already buffered disc"! which should not give huge boost. What am i missing here?
fwrite buffering and disk caching work on two very different levels.
fwrite works on the program level: it buffers numerous small writes and pools them together to make one system call, rather than an individual system call for each small write. This saves you the repeated overhead of switching from user mode to kernel mode and back.
Disk caching works on the kernel level, by pooling disk writes, allowing them to be delayed. Hard disks can be slow, so if you'd have to wait for all the data to be consumed by the disk driver, then your program will be delayed. By utilising cache, which is generally much faster than the drive, you can complete the write much faster and return to the program. While the program continues running, the cache will slowly be emptied onto the disk, without the program having to wait for it.
Related
I have an image stream coming in from a camera at about 100 frames/second, with each image being about 2 MB. Now just because of the disk write speed I know I can't write each frame, so I'm only trying to save about a third of those frames each second.
The stream is a circular buffer of large char arrays. And right now I'm using fwrite to dump each array to a temporary file as it gets buffered, but it only seems to be writing at about 20-30 MB/s while the hard drive should theoretically go up to 80-100 MB/s
Any thoughts? Is there a faster way to write than fwrite() or a way to optimize it?
More generally what is the fastest way to dump large amounts of a data to a standard hard drive?
What if you'll use memory mapped files limited to, for example, 1GB each? This should provide enough speed and buffer to work with all frames, especially if you'll manage to perform zero-copy frame allocation.
fwrite is buffered, which is what you want. Though with that big files/writes it shouldn't make much or any difference. Maybe experiment with a larger stream buffer with the setbuf call.
Since you are limited by physical disk i/o speeds, as long as you are making it as easy as possible for the system to use each available disk io efficiently there's not really more you can do.
vmstat on linux (other similar tools on other systems) can tell you how many disk i/os your disk is doing, so you can test if your changes help anything.
Asynchronous non-buffered output is a key to success in your case. Buffered IO will only cause double-buffering overhead and sync IO will make HDD heads missing sequential sectors.
Boost.Asio provides a relatively good encapsulation of system-specific APIs for popular platforms.
There are few things to remember:
on most non-Windows platforms you will have to write to raw partitions go get system's bufferization and internal threading out of the way.
keep the write queue non-empty all the time, so the SATA controller can help you by means of NCQ.
pay attention to system-specific requirements to buffer alignment and size for async non-buffered IO to work.
file open mode is also important to make the system to do what you want.
Let's say I am using c++ files stream asynchronously. I mean never using std::flush nor std::endl. My application writes a lot of data to a file and abruptly crashes down.
Is the data remaining in the cache system flushed to the disk, or discarded (and lost)?
Complicating this problem is that there are multiple 'caches' in play.
C++ streams have their own internal buffering mechanism. Streams don't ask the OS to write to disk until either (a) you've sent enough data into the buffer that the streams library thinks the write wouldn't be wasted (b) you ask for a flush specifically (c) the stream is in line-buffering mode, and you've sent along the endl. Any data in these buffers are lost when the program crashes.
The OS will buffer writes to make best use of the limited amount of disk IO available. Writes will typically be flushed within five to thirty seconds; sooner if the programmer (or libraries) calls fdatasync(2) or fsync(2) or sync(2) (which asks for all dirty data to be flushed). Any data in the OS buffers are written to disk (eventually) when the program crashes, lost if the kernel crashes.
The hard drive will buffer writes to try to make the best use of its slow head, rotational latency, etc. Data arrives in this buffer when the OS flushes its caches. Data in these buffers are written to disk when the program crashes, will probably be written to disk if the kernel crashes, and might be written to disk if the power is suddenly removed from the drive. (Some have enough power to continue writing their buffers, typically this would take less than a second anyway.)
The stuff in the library buffer (which you flush with std::flush or such) is lost, the data in the OS kernel buffers (which you can flush e.g. with fsync()) is not lost unless the OS itself crashes.
Currently I am working on a MFC application which reads and writes in to the disk. Sometimes this application runs amazingly fast and sometimes it is damn slow. I am guessing that it is because of the disk access involved, hence I want to profile it. These are some questions in this regard:
(1).Currently I am using AQTime profiler to profile the application. Has anybody tried profiling disk access using this? or is there any other tool available which I can use?
(2). What are the most important disk parameters I should be looking at?
(3). If I have multiple threads trying to read and write the data from disk does it affect the performance? i.e. am I better off having a single threaded access to the disk?
You can use the Windows Performance Toolkit for this. You can enable trace providers for disk I/O events and see the I/O time and disk service time for each. It does have a bit of a learning curve though. This will also let you determine which file I/O's actually result in real-access to the disk and aren't handled by the cache manager.
Most important parameters are disk service time and queue length. Disk service time is how long the disk actually took to service the request. Queue length indicates if your disk request is backed up behind other requests.
For many threads w/ reads & writes - Many disks have poor performance in the face of reads with background writes. If you have various threads doing lots of disk I/O to random locations on the disk, you may wind up starving certain requests.
To help you with (2):
Try to batch up your writes to disk to avoid many small calls to write. When you're done flushing your buffer, call commit. commit (aka fsync) is an expensive operation, so becomes even more so when there are lots of small writes.
On windows file handles you can experiment with FILE FLAG WRITE THROUGH to increase write speeds. Supposedly commit doesn't have to be called with handles using this flag.
If data you are writing to disk will also be accessed through reading, consider writing to an in memory structure first, having another thread read from the structure to write it to disk. This will help avoid calls to read data from disk that you have just written.
Hopefully this helps....
What I would do is, if you can't pause all threads at the same time and examine their state, focus on one of them and pause that, while it's being "damn slow". This is a little known but effective technique.
Since it is being extremely slow compared to what it could be, whatever it is waiting for it is waiting for probably 99% of the time, so when you pause it you will see it. That's true whether it's one big wait, or a zillion little ones. Look at the whole call stack. The culprit may be somewhere in the middle of the stack.
If you're not sure, pause it two or three times. The culprit will be on all stack samples.
My problem is this: I have a C/C++ app that runs under Linux, and this app receives a constant-rate high-bandwith (~27MB/sec) stream of data that it needs to stream to a file (or files). The computer it runs on is a quad-core 2GHz Xeon running Linux. The filesystem is ext4, and the disk is a solid state E-SATA drive which should be plenty fast for this purpose.
The problem is Linux's too-clever buffering behavior. Specifically, instead of writing the data to disk immediately, or soon after I call write(), Linux will store the "written" data in RAM, and then at some later time (I suspect when the 2GB of RAM starts to get full) it will suddenly try to write out several hundred megabytes of cached data to the disk, all at once. The problem is that this cache-flush is large, and holds off the data-acquisition code for a significant period of time, causing some of the current incoming data to be lost.
My question is: is there any reasonable way to "tune" Linux's caching behavior, so that either it doesn't cache the outgoing data at all, or if it must cache, it caches only a smaller amount at a time, thus smoothing out the bandwidth usage of the drive and improving the performance of the code?
I'm aware of O_DIRECT, and will use that I have to, but it does place some behavioral restrictions on the program (e.g. buffers must be aligned and a multiple of the disk sector size, etc) that I'd rather avoid if I can.
You can use the posix_fadvise() with the POSIX_FADV_DONTNEED advice (possibly combined with calls to fdatasync()) to make the system flush the data and evict it from the cache.
See this article for a practical example.
If you have latency requirements that the OS cache can't meet on its own (the default IO scheduler is usually optimized for bandwidth, not latency), you are probably going to have to manage your own memory buffering. Are you writing out the incoming data immediately? If you are, I'd suggest dropping that architecture and going with something like a ring buffer, where one thread (or multiplexed I/O handler) is writing from one side of the buffer while the reads are being copied into the other side.
At some size, this will be large enough to handle the latency required by a pessimal OS cache flush. Or not, in which case you're actually bandwidth limited and no amount of software tuning will help you until you get faster storage.
You can adjust the page cache settings in /proc/sys/vm, (see /proc/sys/vm/dirty_ratio, /proc/sys/vm/swappiness specifically) to tune the page cache to your liking.
If we are talking about std::fstream (or any C++ stream object)
You can specify your own buffer using:
streambuf* ios::rdbuf ( streambuf* streambuffer);
By defining your own buffer you can customize the behavior of the stream.
Alternatively you can always flush the buffer manually at pre-set intervals.
Note: there is a reson for having a buffer. It is quicker than writting to a disk directly (every 10 bytes). There is very little reason to write to a disk in chunks smaller than the disk block size. If you write too frquently the disk controler will become your bottle neck.
But I have an issue with you using the same thread in the write proccess needing to block the read processes.
While the data is being written there is no reason why another thread can not continue to read data from your stream (you may need to some fancy footwork to make sure they are reading/writting to different areas of the buffer). But I don't see any real potential issue with this as the IO system will go off and do its work asyncroniously (potentially stalling your write thread (depending on your use of the IO system) but not nesacerily your application).
I know this question is old, but we know a few things now we didn't know when this question was first asked.
Part of the problem is that the default values for /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio are not appropriate for newer machines with lots of memory. Linux begins the flush when dirty_background_ratio is reached, and blocks all I/O when dirty_ratio is reached. Lower dirty_background_ratio to start flushing sooner, and raise dirty_ratio to start blocking I/O later. On very large memory systems, (32GB or more) you may even want to use dirty_bytes and dirty_background_bytes, since the minimum increment of 1% for the _ratio settings is too coarse. Read https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ for a more detailed explanation.
Also, if you know you won't need to read the data again, call posix_fadvise with FADV_DONTNEED to ensure cache pages can be reused sooner. This has to be done after linux has flushed the page to disk, otherwise the flush will move the page back to the active list (effectively negating the effect of fadvise).
To ensure you can still read incoming data in the cases where Linux does block on the call to write(), do file writing in a different thread than the one where you are reading.
Well, try this ten pound hammer solution that might prove useful to see if i/o system caching contributes to the problem: every 100 MB or so, call sync().
You could use a multithreaded approach—have one thread simply read data packets and added them to a fifo, and the other thread remove packets from the fifo and write them to disk. This way, even if the write to disk stalls, the program can continue to read incoming data and buffer it in RAM.
Currently I am working on a MFC application which reads and writes in to the disk. Sometimes this application runs amazingly fast and sometimes it is damn slow. I am guessing that it is because of the disk access involved, hence I want to profile it. These are some questions in this regard:
(1).Currently I am using AQTime profiler to profile the application. Has anybody tried profiling disk access using this? or is there any other tool available which I can use?
(2). What are the most important disk parameters I should be looking at?
(3). If I have multiple threads trying to read and write the data from disk does it affect the performance? i.e. am I better off having a single threaded access to the disk?
You can use the Windows Performance Toolkit for this. You can enable trace providers for disk I/O events and see the I/O time and disk service time for each. It does have a bit of a learning curve though. This will also let you determine which file I/O's actually result in real-access to the disk and aren't handled by the cache manager.
Most important parameters are disk service time and queue length. Disk service time is how long the disk actually took to service the request. Queue length indicates if your disk request is backed up behind other requests.
For many threads w/ reads & writes - Many disks have poor performance in the face of reads with background writes. If you have various threads doing lots of disk I/O to random locations on the disk, you may wind up starving certain requests.
To help you with (2):
Try to batch up your writes to disk to avoid many small calls to write. When you're done flushing your buffer, call commit. commit (aka fsync) is an expensive operation, so becomes even more so when there are lots of small writes.
On windows file handles you can experiment with FILE FLAG WRITE THROUGH to increase write speeds. Supposedly commit doesn't have to be called with handles using this flag.
If data you are writing to disk will also be accessed through reading, consider writing to an in memory structure first, having another thread read from the structure to write it to disk. This will help avoid calls to read data from disk that you have just written.
Hopefully this helps....
What I would do is, if you can't pause all threads at the same time and examine their state, focus on one of them and pause that, while it's being "damn slow". This is a little known but effective technique.
Since it is being extremely slow compared to what it could be, whatever it is waiting for it is waiting for probably 99% of the time, so when you pause it you will see it. That's true whether it's one big wait, or a zillion little ones. Look at the whole call stack. The culprit may be somewhere in the middle of the stack.
If you're not sure, pause it two or three times. The culprit will be on all stack samples.