How is a read system call different from the istream::read function? - c++

My Operating Systems professor was talking today about how a read system call is unbuffered while a istream::read function has a buffer. This left me a bit confused as you still make a buffer for the istream::read function when using it.
The only thing I can think of is that there are more than one buffers in the istream::read function call. Why?
What does the istream::read() function do differently from the read() function system call?

The professor was talking about buffers internal to the istream rather than the buffer provided by the calling code where the data ends up after the read.
As an example, say you are reading individual int objects out of an istream, the istream is likely to have an internal buffer where some number of bytes is stored and the next read can be satisfied out of that rather than going to the OS. Note, however, that whatever the istream is hooked to very likely has internal buffers as well. Most OSes have means to perform zero-copy reads (that is, read directly from the I/O source to your buffer), but that facility comes with severe restrictions (read size must be multiple of some particular number of bytes, and if reading from a disk file the file pointer must also be on a multiple of that byte count). Most of the time such zero-copy reads are not worth the hassle.

Related

C++-What is the need of both buffer and stream?

Although i have read about buffer and stream and it's working with files in c++ but i don't know what is the need of a buffer if a stream is there, stream is always there to transfer the data of one file to the program. So why do we use buffers to store data(performing same task that stream does) and what are buffered and unbuffered stream.
Consider a stream that writes to a file. If there were no buffer, if your program wrote a single byte to the stream, you'd have to write a single byte to the file. That's very inefficient. So streams have buffers to decouple operations one on side of the stream from operations on the other side of the stream.
Ok lets lets start from the scratch suppose you want to work with files. For this purpose you would have to manage how the data is entered into your file and if the sending of data into the file was successful or not, and all other basic working problems. Now either you can manage all that on your own which would take a lots a time and hard work or What you can do is you can use a stream.
Yes, you can allocate a stream for such purposes. Streams work with abstraction mechanism i.e. we c++ programmers don't know how they are working but we only know that we are at the one side of a stream (on our program's side) we offer our data to the stream and it has the responsibility to transfer data from one end to the other (file's side)
Eg-
ofstream file("abc.txt"); //Here an object of output file stream is created
file<<"Hello"; //We are just giving our data to stream and it transfers that
file.close(); //The closing of file
Now if you work with files you should know that working with files is a really expensive operation i.e. it takes more time to access file than to access memory and we also don't have to perform file operations every time. So programmers created a new feature called buffer which is a part of computer's memory and stores data temporarily for handling files.
Suppose at the place of reading file every time to read data you just read some memory location where all the data of file is copied temporarily.Now it would be a less expensive task as you are reading memory not file.
Those streams that have a buffer for their working i.e. they open the file and by default copy all the data of file to the buffer are called as buffered streams whereas those streams which don't use any buffer are called as unbuffered streams.
Now if you enter data to a buffered stream then that data will be queued up until the stream is not flushed (flushing means replacing the data of buffer with that of file). Unbuffered streams are faster in working (from the point of user at one end of the stream) as data is not temporarily stored into a buffer and is sent to the file as it comes to the stream.
A buffer and a stream are different concepts.
A buffer is a part of the memory to temporarily store data. It can be implemented and structured in various ways. For instance, if one wants to read a very large file, chunks of the file can be read and stored in the buffer. Once a certain chunk is processed the data can be discarded and the next chunk can be read. A chunk in this case could be a line of the file.
Streams are the way C++ handles input and output. Their implementation uses buffers.
I do agree that stream is probably the poorest written and the most badly udnerstood part of standard library. People use it every day and many of them have not a slightest clue how the constructs they use work. For a little fun, try asking what is std::endl around - you might find that some answers are funny.
On any rate, streams and streambufs have different responsibilities. Streams are supposed to provide formatted input and output - that is, translate an integer to a sequence of bytes (or the other way around), and buffers are responsible of conveying the sequence of bytes to the media.
Unfortunately, this design is not clear from the implementation. For instance, we have all those numerous streams - file stream and string stream for example - while the only difference between those are the buffer. The stream code remains exactly the same. I believe, many people would redesign streams if they had their way, but I am afraid, this is not going to happen.

Boost::knuth_morris_pratt over a std::istream

I would like to use boost::algorithm::knuth_morris_pratt over some huge files (serveral hundred gigabytes). This means I can't just read the whole file into memory nor mmap it, I need to read it in chunks.
knuth_morris_pratt operates on an iterator, so I guess it is possible to make it read input data "lazily" (on-demand), it would be a matter of writing a "lazy" iterator for some file access class like ifstream, or better istream.
I would like to know if there is some adapter available (already written) that adapts istream to Boost's knuth_morris_pratt so that it won't read all file data all at once?
I know there is a boost::spirit::istream_iterator, but it lacks some some methods (like operator+), so it would have to be modified to work.
On StackOverflow there's a implementation of bidirectional_iterator here, but it still requires some work before it can be used with knuth_morris_pratt.
Are there any istream iterators that are already written, tested and working?
Update: I can't do mmap, because my software should work on multiple operating systems, both on 32-bit and 64-bit architectures. Also very often I don't have the files anyway, they're being generated on-the-fly, that's why I search for a solution that involves streams.
You should simply memory map it.
In practice, 64-bit processors usually have 48-bit address space which is enough for 256 terabytes of memory.
Last I checked, Linux allows 128TB of virtual address space per process on x86-64
(from https://superuser.com/a/168121/75721)
Spirit's istream_iterator is actually it's multi_pass_adaptor and it has different design goals. Unless you have a way to flush the buffer, it will grow indefinitely (by default allocating a deque buffer).

Could I read lines from a C++ socket using Ubuntu?

I wonder if I could read several lines from a C++ socket using Ubuntu?
Please note that every line is to be used for a different purpose (e.g. maybe the first is used as a string and the second as a char array).
I.e. Could I put those two lines directly after each other without encountering any problem?
read(socketFileDescriptor, buffer1, BUFFER_SIZE);
read(socketFileDescriptor, buffer2, BUFFER_SIZE);
Thanks in advance,
Regards,
You call read twice in sequence without any problem in itself.
What you get from each call may not correspond to a single line of input though. read basically just does "raw" reading, just about like it does when reading from a file on disk--if data is available, it will read as much data as necessary to fill the buffer you gave it (up to the size you specified).
TCP treats data as a stream, so data you pass to two (or more) separate calls to write could end up being put into a single packet and transmitted together. On the receiving end, all that data could be read by a single call to read--or, depending on the buffer size you pass, it might read only part of one, or might read all of the first and part of the second, etc.
If you want to read the input as "lines", you could (for one example) create a stream buffer that reads data from a socket, and create an iostream object that parses data from that buffer to read lines. This initially seems attractive to many people (it did to me, anyway), but has never worked out very well, at least for me. Iostreams basically assume a synchronous protocol, but sockets are mostly asynchronous. Trying to treat sockets as synchronous tends to lead to more problems rather than to solutions.

How to read blocks of data from a file and then read from that block into a vector?

Suppose I have a file which has x records. One 'block' holds m records. Total number of blocks in file n=x/m. If I know the size of one record, say b bytes (size of one block = b*m), I can read the complete block at once using system command read() (is there any other method?). Now, how do I read each record from this block and put each record as a separate element into a vector.
The reason why I want to do this in the first place is to reduce the disk i/o operations. As the disk i/o operations are much more expensive according to what I have learned.
Or will it take the same amount of time as when I read record by record from file and directly put it into vectors instead of reading block by block? On reading block by block, I will have only n disk I/O's whereas x I/O's if I read record by record.
Thanks.
You should consider using mmap() instead of reading your files using read().
What's nice about mmap is that you can treat file contents as simply mapped into your process space as if you already had a pointer into the file contents. By simply inspecting memory contents and treating it as an array, or by copying data using memcpy() you will implicitly perform read operations, but only as necessary - operating system virtual memory subsystem is smart enough to do it very efficiently.
The only possible reason to avoid mmap maybe if you are running on 32-bit OS and file size exceeds 2 gigabytes (or slightly less than that). In this case OS may have trouble allocating address space to your mmap-ed memory. But on 64-bit OS using mmap should never be a problem.
Also, mmap can be cumbersome if you are writing a lot of data, and size of the data is not known upfront. Other than that, it is always better and faster to use it over the read.
Actually, most modern operating systems rely on mmap extensively. For example, in Linux, to execute some binary, your executable is simply mmap-ed and executed from memory as if it was copied there by read, without actually reading it.
Reading a block at a time won't necessarily reduce the number of I/O operations at all. The standard library already does buffering as it reads data from a file, so you do not (normally) expect to see an actual disk input operation every time you attempt to read from a stream (or anything close).
It's still possible reading a block at a time would reduce the number of I/O operations. If your block is larger than the buffer the stream uses by default, then you'd expect to see fewer I/O operations used to read the data. On the other hand, you can accomplish the same by simply adjusting the size of buffer used by the stream (which is probably a lot easier).

when is a opportune moment to flush the output buffer and some basic c++

I'm reading accelerated c++ and the author writes:
Flushing the output buffers at opportune moments is an important habit when you are writing programs that might take a long time to run. Otherwise, some of the program's output might languish in the systems buffers for a long time between when your program writes it and when you see it
Please correct me if i misunderstand any of these concepts:
Buffer: a block of random access memory that is used to hold input or output temporarily.
Flushing: freeing up random access memory that had been... eh.. assigned to certain ..umm
There is this explanation I found:
Flushing an output device means that all preceding output operations are required to be completed immediately. This is related to the issue of buffering, which is an optimization technique used by the operating system. Roughly speaking, the operating system reserves (and usually exerts) the right to put the data “on stand by” until it decides that it has an amount of data large enough to justify the cost associated to sending the data to the screen. In some cases, however, we need the guarantee that the output operations performed in our program are completed at a given point in the execution of our program, so we flush the output device.
Continuing from that explanation i read that the three events that cause the system to flush the buffer:
Buffer becomes full and will automatically flush
The library might be asked to read from standard input stream *is standard input stream like std::cin >> name ;
The third occasion is when we explicitly tell it to. How do we explicitly tell it to?
Despite I don't feel like a fully grasp the following:
What a output buffer is vs just a buffer and presumable other types of buffers...
What it means to flush a buffer. Does it simply mean to clear the ram?
What is the "output device" refereed to in the above explanation
And finally after all this when are opportune moments to to flush your buffer...ugh that doesn't sound pleasant.
To flush an std::ostream, you use the std::flush manipulator. i.e.
std::cout << std::flush;
Note that std::endl already flushes the stream. So if you are in the habit of ending your insertions with it, you don't need to do anything additional. Note that this means if you are seeing poor performance because you flush too much, you need to switch from inserting std::endl to inserting a newline: '\n'.
A stream is a sequence of characters (i.e. things of type char). An output stream is one you write characters to. Typical applications are writing data to files, printing text on screen, or storing them in a std::string.
Streams often have the feature that writing 1024 characters at once is an order of magnitude (or more!) faster than writing 1 character at a time 1024 times. One of the main purposes of the notion of 'buffering' is to deal with this in a convenient fashion. Rather than writing directly to whatever you actually want the characters to go, you instead write to the buffer. Then, when you're ready, you "flush" the buffer: you move the characters from the buffer to the place where you want them. Or, if you don't care about the precise details, you use a buffer that flush itself automatically. e.g. the buffer used in an std::ofstream is typically fixed size, and will flush whenever its full.
When is it an opportune time to flush, you ask? I say you're optimizing prematurely. :) Rather than looking for the perfect moments to flush, just do it often. Put in enough flushes so that flush frequently enough that you'll never find yourself in a situation where, e.g., you want to look at the data in a file but it's sitting unwritten in a buffer. Then if it really does turn out there are too many flushes hurting performance, that's when you spend time looking into it.
You explicitly flush a stream with your_stream.flush();.
What a output buffer is vs just a buffer and presumable other types of buffers...
A buffer is usually a block of memory used to hold data waiting for processing. One typical use is data that's just been read from a stream, or data waiting to be written to disk. Either way, it's generally more efficient to read/write large blocks of data at a time, so read/write an entire buffer at a time, but the client code can read/write in whatever amount is convenient (e.g., one character or one line at a time).
What it means to flush a buffer. Does it simply mean to clear the ram?
That depends. For an input buffer, yes, it typically means just clearing the contents of the buffer, discarding any data that's been read into the buffer (though it doesn't usually clear the RAM -- it just sets its internal book-keeping to say the buffer is empty).
For an output buffer, flushing the buffer normally means forcing whatever data is in the buffer to be written to the associated stream immediately.
What is the "output device" refereed to in the above explanation
When you're writing data, it's whatever device you're ultimately writing to. That could be a file on the disk, the screen, etc.
And finally after all this when are opportune moments to to flush your buffer...ugh that doesn't sound pleasant.
One obvious opportune moment is right when you finish writing data for a while, and you're going to go back to processing (or whatever) that doesn't produce any output (at least to the same destination) for a while. You don't want to flush the buffer if you're likely to produce more data going the same place right afterward -- but you also don't want to leave the data in the buffer when there's going to be a noticeable delay before you fill the buffer (or whatever) so the data will get written to its destination.
This depends very much on the type of application, but one rule of thumb is to flush after you written one record. For text that is usually after every line, for binary data after every object. If the performance seems to be to slow, then flush every X record you write, and experiment with the X until you find a number when you are happy with the performance and while X is not big enough so you loose too much data in case of a crash.
I think the author means stream buffers. An opportune moment to flush a buffer is really dependent on what your code does, how its constructed and how the buffer is allocated and probably the scope its initialized in.
For stream and output buffers take a look at this.
Yes a standard input stream means using the >> operator. (Mostly)
you can explicitly tell a stream buffer to flush by calling for example ofstream::flush of course other types of buffers have their own explicit flushing methods and some might require a manual implementation.
Taking your questions one by one:
A buffer, in general, is just a block of memory used to temporarily
hold data. When writing to an `std::ofstream`, characters are sent to a
`std::filebuf`, which typically, by default, will simply put them into a
buffer rather than outputting immediately to the system. When using an
`std::ofstream`, there are actually two buffers in play, one in the
`ofstream` (within your process), and one in the OS.
The standard speaks of the underlying data as a sequence of characters
on an external support, with the buffer representing a window into that
sequence; outputting data may only update the image in the buffer, and
flushing "synchronizes" the image in the buffer with the image of the
data the OS has. Which is a reasonably good description if you're
outputting to a real file, but doesn't really fit if you're outputting
directly to a serial port, or something like that, where the OS doesn't
maintain any "image" of the data. Basically, if you've written data
to the stream which hasn't been transfered to the OS, flushing the
buffer will transfer it to the OS (which means that the `ofstream` can
reuse the buffer memory for further buffering). Flushing the buffer
typically (i.e. on all of the implementations I know) only synchronizes
with the OS (which is all that the standard requires); it doesn't ensure
that the data has actually been written to disk. Depending on the
application, this may or may not be an issue.
The "output device" is anything the system wants it to be. A file, a
window on the screen, or in older times or on simpler systems, a printer
or a serial port. And the explination you cite is very misleading (or
rather isn't talking about `ofstream`), because flushing an `ofstream`
doesn't ensure that all preceding output operations are fully finished.
All it ensures is that the data in the stream buffer has been transfered
to (synchronized with) the OS. In most cases (at least under Windows
and Unix), all this means is that the data has been moved from one
buffer (in your process) to another (in the OS).
The opportune moments will depend a lot on what the application is
doing. As a general rule, I'd suggest flushing often, so that if your
program crashes, you can see more or less how far it has gotten.
(Remember, outputting `std::endl` flushes. For most simple use, just
using `std::endl` instead of `'\n'` is sufficient.) There are at least
two cases where you will want to think more about flushing, however; if
you're outputting a very large amount of data in a block (i.e. without
doing much more than formatting between the outputs), excessive flushing
can slow the output down considerably. In such cases, you may want to
consider using `'\n'` instead of `std::endl`. And the other is for
things like logging, where you want the data to appear immediatly, even
if the following data will not be output for a while—in this case,
you want to be sure that the data has been flushed before continuing.
Data will be explicitly flushed if you call std::ostream::flush() or
std::ofstream::close(). (In the latter case, of course, you cannot
write more data later.)
Note too that because the data is not actually "written" until it is
flushed, most possible errors cannot be detected until then. In
particular, something like:
if ( output << data ) {
// succeeded...
}
doesn't actually work; the "success" reported by the ofstream is only
that it has successfully copied the characters into its buffer (which
can hardly fail).
The usual idiom when writing a large block of data, without
interruption, is to just write it, without flushing, then close the file
and check for errors then. This is not appropriate when writing with
interruptions if you want the data to appear immediately, and it has the
disadvantage that if your program crashes, some of the data you've
"written" will have disappeared, which can make debugging harder.