What's the mechanism of iostreams with buffer? - c++

To start with,Here in cplusplus.com it says every stream object has a associated std::streambuf
And in c++ primer 5th it says:
Each output stream manages a buffer, which it uses to hold the data that the programreads and writes. For example, when the following code is executed
os << "please enter a value: ";
the literal string might be printed immediately, or the operating system might store the data in a buffer to be printed later
There are several conditions that cause the buffer to be flushed—that is, to be written—to the actual output device or file:The program completes normally,using a manipulator such as endl,etc.
In my understanding,the sentence "the operating system might store the data in a buffer(not in the buffer)" in the context above means that both the stream object and OS use their own buffer,that is,one in the process address space,another in the kernel space managed by the OS.
And here is my question,
why does every process/object(like cout) manage its own buffer?Why not just arose a system call and give the data directly to the OS buffer?
Furthermore,is the term 'flushed' acting on object buffer or the OS buffer?I guess the flushed action actually arouse a system call and tell the OS to immediately put the data in OS buffer onto the screen.

why does every process/object(like cout) manage its own buffer?Why not just arose a system call and give the data directly to the OS buffer?
As a bit of a pre-answer, you could always re-write the stream buffer to always flush to a system OS call for output (or input). In fact, your system may already do this -- it just depends on the implementation. This system just allows buffering at the level of the iostreams library, but doesn't necessarily require it as far as I remember.
For buffering, it is not always the most efficient to send out or read data byte by byte. In cases like cout and cin in many systems this may be better handled by the OS, but you could adapt the iostreams to handle input and output streams that are reading sockets (I/O from internet connections). In sockets, you could write each individual character within a single package to your target over the internet, but this could become really slow depending on the type of link and how busy the link is. When you read a socket, the message can be split across packets so you need to buffer the input until you hit 'critical mass'. There are potentially ways to do this buffering at the level of OS, but I found at least I could get much better performance if I handled most of this buffering myself (since usually the size of messages had a large standard deviation across the runtime). So the buffering within iostreams was a useful way to manage the input and output to optimize performance, and this especially helped when you tried to juggle I/O from multiple connections at the same time.
But you can't always assume the OS will do the right thing. I remember once we were using this FUSE module that allowed us to have a distributed file system across multiple computer nodes. It had a really weird problem when writing and reading single characters. Whereas reading or writing a long sequence of single characters would take at most seconds on a normal hard disk using an ext4 system, the same operation would take days on the FUSE system (ignoring for the moment why we did it this way in the first place). Through debugging, we found the hang was at the level of I/O, and reading and writing individual characters exacerbated this run-time problem. We had to re-write the code to buffer our reads and writes. The best we could figure out is that the OS on ext4 did its own buffering but this FUSE file system didn't do a similar buffering when reading and writing to the hard disk.
In any case, the OS may do its own buffering, but there are a number of cases where this buffering is non-existent or minimal. Buffering on the iostream end could help your performance.
Furthermore,is the term 'flushed' acting on object buffer or the OS buffer?I guess the flushed action actually arouse a system call and tell the OS to immediately put the data in OS buffer onto the screen.
I believe most texts will talk about 'flushed' in terms of the standard I/O streams in C++. Your program probably doesn't have direct control over how the OS handles its I/O. But in general I think the I/O of the OS and your program will be in sync for most systems.

Related

Does fwrite block until data has been written to disk?

Does the fwrite() function return after the data to be written to disk has been handed over to the operating system or does it return only after the data is actually physically written to the disk?
For my case, I'm hoping that it's the first case since I don't want to wait until all the data is physically written to the disk. I'm hoping that another OS thread transfers it in the background.
I'm curious about behavior on Windows 10 in this particular case.
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fwrite
There are several places where there is buffering of data in order to improve efficiency when using fwrite(): buffering within the C++ Runtime and buffering in the operating system file system interface and buffering within the actual disk hardware.
The default for these are to delay the actual physical writing of data to disk until there is an actual request to flush buffers or if appropriate indicators are turned on to perform physical writes as the write requests are made.
If you want to change the behavior of fwrite() take a look at the setbuf() function setbuf redirection as well as setbuff() Linux man page and here is the Microsoft documentation on setbuf().
And if you look at the documentation for the underlying Windows CreateFile() function you will see there are a number of flags which include flags as to whether buffering of data should be done or not.
FILE_FLAG_NO_BUFFERING 0x20000000
The file or device is being opened with no system caching for data
reads and writes. This flag does not affect hard disk caching or
memory mapped files.
There are strict requirements for successfully working with files
opened with CreateFile using the FILE_FLAG_NO_BUFFERING flag, for
details see File Buffering.
And see the Microsoft documentation topic File Buffering.
In a simple example, the application would open a file for write
access with the FILE_FLAG_NO_BUFFERING flag and then perform a call to
the WriteFile function using a data buffer defined within the
application. This local buffer is, in these circumstances, effectively
the only file buffer that exists for this operation. Because of
physical disk layout, file system storage layout, and system-level
file pointer position tracking, this write operation will fail unless
the locally-defined data buffers meet certain alignment criteria,
discussed in the following section.
Take a look at this discussion about settings at the OS level for what looks to be Linux https://superuser.com/questions/479379/how-long-can-file-system-writes-be-cached-with-ext4
Does the fwrite(fp,...) function return after the data to be written to disk has been handed over to the operating system or does it return only after the data is actually physically written to the disk?
No. In fact, it does not even (necessarily) wait until data has been handed over to the OS -- fwrite may just put the data in its internal buffer and return immediately without actually writing anything.
To force data to the OS, you need to use fflush(fp) on the FILE pointer, but that still does not necessarily write data to the disk, though it will generally queue it for writing. But it does not wait for those queued writes to finish.
So to guarentee the data is written to disk, you need to do an OS level call to wait until the queued writes complete. On POSIX systems (such as Linux), that is fsync(fileno(fp)). You'll need to study the Windows documentation to figure out how to do the equivalent on Windows.

How do I do stdin.byLine, but with a buffer?

I'm reading multi-gigabyte files and processing them from stdin. I'm reading from stdin like this.
string line;
foreach(line1; stdin.byLine){
line = to!string(line1);
...
}
Is there a faster way to do this? I tried a threading approach with
auto childTid = spawn(&fn, thisTid);
string line;
foreach(line1; stdin.byLine){
line = to!string(line1);
receiveOnly!(int);
send(childTid, line);
}
int x= 0;
send(childTid, x);
That allows it to load at least one more line from disk while my process is running at the cost of a copy operation, but this is still silly, what I need is fgets, or a way to combine stdio.byChunk(4096) with readline. I tried fgets.
char[] buf = new char[4096];
fgets(buf.ptr, 4096, stdio)
but it always fails with stdio is a file and not a stream. Not sure how to make it a stream. Any help would be appreciated with the approach you think best. I'm not very good at D, apologies for any noob mistakes.
There are actually already two layers of buffering under the hood (excluding the hardware itself): the C runtime library and the kernel both do a layer of buffering to minimize I/O costs.
First, the kernel keeps data from disk in its own buffer and will look ahead, loading beyond what you request in a single call if you are following a predictable pattern. This is to mitigate the low-level costs associated with seeking the device and will cache across processes - if you read a file with one program then again with a second, the second will probably get it from the kernel memory cache instead of the physical disk and may be noticeably much faster.
Second, the C library, on which D's std.stdio is built, also keeps a buffer. readln ultimately calls C file I/O functions which read a chunk from the kernel at a time. (Fun fact, writes are also buffered by the C library, default by line if user interactive and by chunk otherwise. Writing is quite slow and doing it by chunk makes a big difference, but sometimes the C lib thinks a pipe isn't interactive when it is and leads to a FAQ: Simple D program Output order is wrong )
These C lib buffers also mitigate the costs of many small reads and writes by batching them up before even sending to the kernel. In the case of readln, it will likely read several kilobytes at once, even if you ask for just one line or one byte, and the rest stays in the buffer for next time.
So your readln loop is already going to be automatically buffered and should get decent I/O performance.
You might be able to do it better yourself with a few techniques though. In that case, you may try using std.mmfile for a memory-mapped file and reading it as if i was an array, but your files are too big to fit in that on 32 bit. Might work on 64 bit though. (Note that a memory mapped file is NOT loaded all at once, it is just mapped to a memory address. When you actually touch part of it, the operating system will load/save on demand.)
Or, of course, you can use the lower level operating system functions like write from import core.sys.posix.unistd or WriteFile from import core.sys.windows.windows, which will bypass the C lib's layers (but, of course, keep the kernel layers, which you want, don't try to bypass them.)
You can look for any win32 or posix system call C tutorials if you want to know more about using those functions. It is the same in D as in C, with minor caveats like the import instead of #include.
Once you load the chunk, you will want to scan it for the newline and slice it in all probability to form the range to pass to the loop or other algorithms. The std.range and std.algorithm modules also have searching, splitting, and chunking functions that might help, but you need to be careful with lines that span the edges of your buffers to keep correctness and efficiency.
But if your performance is good enough as it is, I'd say just leave it - the C lib+kernel's buffering do a pretty good job in most cases.

How does behave linux buffer cache when an application is crashing down?

Let's say I am using c++ files stream asynchronously. I mean never using std::flush nor std::endl. My application writes a lot of data to a file and abruptly crashes down.
Is the data remaining in the cache system flushed to the disk, or discarded (and lost)?
Complicating this problem is that there are multiple 'caches' in play.
C++ streams have their own internal buffering mechanism. Streams don't ask the OS to write to disk until either (a) you've sent enough data into the buffer that the streams library thinks the write wouldn't be wasted (b) you ask for a flush specifically (c) the stream is in line-buffering mode, and you've sent along the endl. Any data in these buffers are lost when the program crashes.
The OS will buffer writes to make best use of the limited amount of disk IO available. Writes will typically be flushed within five to thirty seconds; sooner if the programmer (or libraries) calls fdatasync(2) or fsync(2) or sync(2) (which asks for all dirty data to be flushed). Any data in the OS buffers are written to disk (eventually) when the program crashes, lost if the kernel crashes.
The hard drive will buffer writes to try to make the best use of its slow head, rotational latency, etc. Data arrives in this buffer when the OS flushes its caches. Data in these buffers are written to disk when the program crashes, will probably be written to disk if the kernel crashes, and might be written to disk if the power is suddenly removed from the drive. (Some have enough power to continue writing their buffers, typically this would take less than a second anyway.)
The stuff in the library buffer (which you flush with std::flush or such) is lost, the data in the OS kernel buffers (which you can flush e.g. with fsync()) is not lost unless the OS itself crashes.

All things equal what is the fastest way to output data to disk in C++?

I am running simulation code that is largely bound by CPU speed. I am not interested in pushing data in/out to a user interface, simply saving it to disk as it is computed.
What would be the fastest solution that would reduce overhead? iostreams? printf? I have previously read that printf is faster. Will this depend on my code and is it impossible to get an answer without profiling?
This will be running in Windows and the output data needs to be in text format, tab/comma separated, with formatting/precision options for mostly floating point values.
Construct (large-ish) blocks of data which can be sequentially written and use asynchronous IO.
Accurately Profiling will be painfull, read some papers on the subject: scholar.google.com.
I haven't used them myself, but I've heard memory mapped files offer the best optimisation opportunities to the OS.
Edit: related question, and Wikipedia article on memory mapped files — both mention performance benefits.
My thought is that you are tackling the wrong problem. Why are you writing out vast quantities of text formatted data? If it is because you want it to be human readable, writing a quick browser program to read the data in binary format on the fly - this way the simulation application can quickly write out binary data and the browser can do the grunt work of formatting the data as and when needed. If it is because you are using some stats package to read and analyse text data then write one that inputs binary data.
Scott Meyers' More Effective C++ point 23 "Consider alternate libraries" suggests using stdio over iostream if you prefer speed over safety and extensibility. It's worth checking.
The fastest way is what is fastest for your particular application running on its typical target OS and hardware. The only sensible thing to do do is to try several approaches and time them. You probably don't need a complete profile, and the exercise should only take a few hours. I would test, in this order:
normal C++ stream I/O
normal stream I/O using ostream::write()
use of the C I/O library
use of system calls such as write()
asynch I/O
And I would stop when I found a solution that was fast enough.
Text format means it's for human consumption. The speed at which humans can read is far, far lower than the speed of any reasonable output method. There's a contradiction somewhere. I suspect the "output must be text format".
Therefore, I beleive the correct was is to output binary, and provide a separate viewer to convert individual entries to readable text. Formatting in the viewer need only be as fast as people can read.
Mapping the file to memory (i.e. using a Memory Mapped File) then just memcopy-ing data there is a really fast way of reading/writing.
You can use several threads/cores to write to the data, and the OS/kernel will sync the pages to disk, using the same kind of routines used for virtual memory, which one can expect to be optimized to hell and back, more or less.
Chiefly, there should be few extra copies/buffers in memory when doing this. The writes are caught by interrupts and added to the disk queue once a page has been written.
Open the file in binary mode, and write "unformatted" data to the disc.
fstream myFile;
...
myFile.open ("mydata.bin", ios:: in | ios::out | ios::binary);
...
class Data {
int key;
double value;
char[10] desc;
};
Data x;
myFile.seekp (location1);
myFile.write ((char*)&x, sizeof (Data));
EDIT: The OP added the "Output data needs to be in text format, whether tab or comma separated." constraint.
If your application is CPU bound, the formatting of output is an overhead that you do not need. Binary data is much faster to write and read than ascii, is smaller on the disc (e.g. there are fewer total bytes written with binary than with ascii), and because it is smaller it is faster to move around a network (including a network mounted file system). All indicators point to binary as a good overall optimization.
Viewing the binary data can be done after the run with a simple utility that will dump the data to ascii in whatever format is needed. I would encourage some version information be added to the resulting binary data to ensure that changes in the format of the data can be handled in the dump utility.
Moving from binary to ascii, and then quibbling over the relative performance of printf versus iostreams is likely not the best use of your time.
The fastest way is completion-based asynchronous IO.
By giving the OS a set of data to write, which it hasn't actually written when the call returns, the OS can reorder it to optimise write performance.
The API for doing this is OS specific: on Linux, its called AIO; on Windows its called Completion Ports.
A fast method is to use double buffering and multiple threads (at least two).
One thread is in charge of writing data to the hard drive. This task checks the buffer and if not empty (or another rule perhaps) begins writing to the hard drive.
The other thread writes formatted text to the buffer.
One performance issue with hard drives is the amount of time required to get up to speed and position the head to the correct location. To avoid this from happening, the objective is to continually write to the hard drive so that it doesn't stop. This is tricky and may involve stuff outside of your program's scope (such as other programs running at the same time). The larger the chunk of data written to the hard drive, the better.
Another thorn is finding empty slots on the hard drive to put the data. A fragmented hard drive would be slower than a formatted or defragmented drive.
If portability is not an issue, you can check your OS for some APIs that perform block writes to the hard drive. Or you can go down lower and use the API that writes directly to the drive.
You may also want your program to change it's priority so that it is one of the most important tasks running.

How best to manage Linux's buffering behavior when writing a high-bandwidth data stream?

My problem is this: I have a C/C++ app that runs under Linux, and this app receives a constant-rate high-bandwith (~27MB/sec) stream of data that it needs to stream to a file (or files). The computer it runs on is a quad-core 2GHz Xeon running Linux. The filesystem is ext4, and the disk is a solid state E-SATA drive which should be plenty fast for this purpose.
The problem is Linux's too-clever buffering behavior. Specifically, instead of writing the data to disk immediately, or soon after I call write(), Linux will store the "written" data in RAM, and then at some later time (I suspect when the 2GB of RAM starts to get full) it will suddenly try to write out several hundred megabytes of cached data to the disk, all at once. The problem is that this cache-flush is large, and holds off the data-acquisition code for a significant period of time, causing some of the current incoming data to be lost.
My question is: is there any reasonable way to "tune" Linux's caching behavior, so that either it doesn't cache the outgoing data at all, or if it must cache, it caches only a smaller amount at a time, thus smoothing out the bandwidth usage of the drive and improving the performance of the code?
I'm aware of O_DIRECT, and will use that I have to, but it does place some behavioral restrictions on the program (e.g. buffers must be aligned and a multiple of the disk sector size, etc) that I'd rather avoid if I can.
You can use the posix_fadvise() with the POSIX_FADV_DONTNEED advice (possibly combined with calls to fdatasync()) to make the system flush the data and evict it from the cache.
See this article for a practical example.
If you have latency requirements that the OS cache can't meet on its own (the default IO scheduler is usually optimized for bandwidth, not latency), you are probably going to have to manage your own memory buffering. Are you writing out the incoming data immediately? If you are, I'd suggest dropping that architecture and going with something like a ring buffer, where one thread (or multiplexed I/O handler) is writing from one side of the buffer while the reads are being copied into the other side.
At some size, this will be large enough to handle the latency required by a pessimal OS cache flush. Or not, in which case you're actually bandwidth limited and no amount of software tuning will help you until you get faster storage.
You can adjust the page cache settings in /proc/sys/vm, (see /proc/sys/vm/dirty_ratio, /proc/sys/vm/swappiness specifically) to tune the page cache to your liking.
If we are talking about std::fstream (or any C++ stream object)
You can specify your own buffer using:
streambuf* ios::rdbuf ( streambuf* streambuffer);
By defining your own buffer you can customize the behavior of the stream.
Alternatively you can always flush the buffer manually at pre-set intervals.
Note: there is a reson for having a buffer. It is quicker than writting to a disk directly (every 10 bytes). There is very little reason to write to a disk in chunks smaller than the disk block size. If you write too frquently the disk controler will become your bottle neck.
But I have an issue with you using the same thread in the write proccess needing to block the read processes.
While the data is being written there is no reason why another thread can not continue to read data from your stream (you may need to some fancy footwork to make sure they are reading/writting to different areas of the buffer). But I don't see any real potential issue with this as the IO system will go off and do its work asyncroniously (potentially stalling your write thread (depending on your use of the IO system) but not nesacerily your application).
I know this question is old, but we know a few things now we didn't know when this question was first asked.
Part of the problem is that the default values for /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio are not appropriate for newer machines with lots of memory. Linux begins the flush when dirty_background_ratio is reached, and blocks all I/O when dirty_ratio is reached. Lower dirty_background_ratio to start flushing sooner, and raise dirty_ratio to start blocking I/O later. On very large memory systems, (32GB or more) you may even want to use dirty_bytes and dirty_background_bytes, since the minimum increment of 1% for the _ratio settings is too coarse. Read https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ for a more detailed explanation.
Also, if you know you won't need to read the data again, call posix_fadvise with FADV_DONTNEED to ensure cache pages can be reused sooner. This has to be done after linux has flushed the page to disk, otherwise the flush will move the page back to the active list (effectively negating the effect of fadvise).
To ensure you can still read incoming data in the cases where Linux does block on the call to write(), do file writing in a different thread than the one where you are reading.
Well, try this ten pound hammer solution that might prove useful to see if i/o system caching contributes to the problem: every 100 MB or so, call sync().
You could use a multithreaded approach—have one thread simply read data packets and added them to a fifo, and the other thread remove packets from the fifo and write them to disk. This way, even if the write to disk stalls, the program can continue to read incoming data and buffer it in RAM.