Clarify the difference between input/output stream and input/output buffer

Clarify the difference between input/output stream and input/output buffer - c++

EDITED
Input stream and Input buffer
From what I understand, when a key is pressed on the keyboard, the character go into the input stream (stdin) and get stored in the buffer. Then the scanf (in case of C) or cin(in case of C++) extracts the character from the buffer and places it in the main memory.
Output stream and Output buffer
Similarly, before characters are displayed on the screen, they are first stored in a buffer. Then the printf (in case of C) or cout(in case of C++) extracts the characters from the buffer (when it is full) and sends it to the output (stdout) stream.
Am I right? I've been stuck on this for quite a while now and my logic may be flawed.

Side note: scanf() is not the function to read input, see more here.
Now for your question: When asking about C (and, C++), e.g., the language, you should stay within the abstract concepts the language provides. So, don't start at the keyboard, that's far outside your program.
Start here: The operating system wants to deliver some input to you. Now, your C runtime provides a stream of input to your code. The stream is an abstract concept, it just means something you can continuously read from. This stream can be buffered or unbuffered, and if it's buffered, there are different modes (fully buffered or line buffered) available. You can configure all of that.
If your stream is unbuffered, this means the operating system has to wait until your code wants to read from the input stream. By default, your standard input stream is line buffered, which means your C runtime accepts the input immediately and puts it into a buffer until there is a newline -- your code calling input functions will get a result from that buffer.
Conceptually the same happens with output, just the other way around. If your output stream is for example line buffered, your C runtime will fill a buffer until there is a newline and deliver that whole line to the operating system for output. If the output is unbuffered, every single character is immediately passed to the operating system.
Disclaimer: this is still a lot simplified, but should be enough to start with.
As you ask about the term "buffer overflow" in the comments, mentioning gets() -- this is about a buffer inside your own code. With any input function that reads more than a single value/char, you have to provide your own buffer for it to store the result to. With gets(), there's no way to tell the function how large this buffer is, so it will just overflow it whenever the input available is too large. This is why gets() is ill-defined and meanwhile removed from the language C. It has nothing to do with buffers of your C runtime that are possibly used in the implementation of the I/O streams.

Related

Why think in terms of buffers and not lines? And can fgets() read multiple lines at one time?

Recently I have been reading Unix Network Programming Vol.1. In section 3.9, the last two paragraphs above Figure 3.18 said, here I quote:
...But our advice is to think in terms of butters and not lines, Write your code to read butters of data, and if a line is expected, check the buffer to see if it contains that line.
And in the next paragraph, the authors gave a more specific example, here I quote:
...as we'll see in Section 6.3. System functions like select still won't know about readline's internal buffer, so a carelessly written program could easily find itself waiting in select for data already received and stored in readline's butters.
In section 6.5, the actual problem is "mixing of stdio and select()", which would make the program, here I quote the book, "error-prone". But how?
I know that the authors gave the answer later in the same section and according to my understanding to the book, it is because of the data being hidden from select() and thus select() could not know that the data that has been read is consumed or not.
The answer is literally there, but the first problem here is that I really have a hard time getting it, I cannot imagine what damage would it make to the program, maybe I need a demo program that suffers from the problem to help me understand it.
Still in section 6.5, the authors tried to explain the problem further by giving, here I quote:
... Consider the case when several lines of input are available from the standard input.
select will cause the code at line 20 to read the input using fgets and that, in turn, will read the available lines into a buffer used by stdio. But, fgets only returns a single line and leaves any remaining data sitting in the stdio buffer ...
The "line 20" mentioned above is:
if (Fgets(sendline, MAXLINE, fp) == NULL)
where sendline is an array of char and fp is a pointer to FILE. I looked up into the detailed implementation of Fgets, and it just wrapped fgets() up with some extra error-dealing logic and nothing more.
And here comes my second question, how does fgets manage to, here I quote again, read the available lines? I mean, I looked up the man-page of fgets, it says fgets normally stops on the first newline character. Doesn't this mean that only one line would be read by fgets? More specifically, if I type one line in the terminal and press the enter key, then fgets reads this exact line. I do this again, then the next new line is read by fgets, and the point is one line at a time.
Thanks for your patience in reading all the descriptions, and looking forward to your answers.

One of the main reasons to think about buffers rather than lines (when it comes to network programming) is because TCP is a streaming protocol, where data is just a stream of bytes beginning with a connection and ending with a disconnection.
There are no message boundaries, and there are no "lines", except what the application-level protocol on top of TCP have decided.
That makes it impossible to read a "line" from a TCP connection, there are no such primitive functions for it. You must read using buffers. And because of the streaming and the lack of any kind of boundaries, a single call to receive data may give your application less than you ask for, and it may be a partial application-level message. Or you might get more than a single message, including a partial message at the end.
Another note of importance is that sockets by default are blocking, so a socket that don't have any data ready to be received will cause any read call to block, and wait until there are data. The select call only tells if the read call won't block right now. If you do the read call multiple times in a loop it can (and ultimately will) block when the data to receive is exhausted.
All this makes it really hard to use high-level functions like fgets (after a fdopen call of course) to read data from TCP sockets, as it can block at any time if you use blocking socket. Or it can return with a failure if you use non-blocking sockets and the read call returns with the failure that it would block (yes that is returned as an error).
If you use your own buffering, you can use select in the same loop as read or recv, to make sure that the call won't block. Or of you use non-blocking sockets you can gather data (and append to your buffer) with single read calls, and add detection when you have a full message (either by knowing its length or by detecting the message terminator or separator, like a newline).
As for fgets reading "multiple lines", it can cause the underlying reads to fill the buffers with multiple lines, but the fgets function itself will only fill your supplied buffer with a single line.
fgets will never give you multiple lines.

select is a Linux kernel call. It will tell you if the Linux kernel has data that your process hasn't received yet.
fgets is a C library call. To reduce the number of Linux kernel calls (which are typically slower) the C library uses buffering. It will try to read a big chunk of data from the Linux kernel (typically something like 4096 bytes) and then return just the part you asked for. Next time you call it, it will see if it already read the part you asked for, and then it won't need to read it from the kernel. For example, if it's able to read 5 lines at once from the kernel, it will return the first line, and the other 4 will be stored in the C library and returned in the next 4 calls.
When fgets reads 5 lines, returns 1, and stores 4, the Linux kernel will see that all the data has been read. It doesn't know your program is using the C library to read the data. Therefore, select will say there is no data to read, and your program will get stuck waiting for the next line, even though there already is one.
So how do you resolve this? You basically have two options: don't do buffering at all, or do your own buffering so you get to control how it works.
Option 1 means you read 1 byte at a time until you get a \n and then you stop reading. The kernel knows exactly how much data you have read, and it will be able to accurately tell you whether there's more data. However, making a kernel call for every single byte is relatively slow (measure it) and also, the computer on the other end of the connection could cause your program to freeze simply by not sending a \n at all.
I want to point out that option 1 is completely viable if you are just making a prototype. It does work, it's just not very good. If you try to fix the problems with option 1, you will find the only way to fix them is to do option 2.
Option 2 means doing your own buffering. You keep an array of say 4096 bytes per connection. Whenever select says there is data, you try to fill up the array as much as possible, and you check whether there is a \n in the array. If so, you process that line, remove the line from the array*, and repeat. This means you minimize kernel calls, and you also won't freeze if the other computer doesn't send a \n since the unfinished line will just stay in the array. If all 4096 bytes are used, and there is still no \n, you can either choose to process it as a big line (if this makes sense, e.g. in a chat program) or you can disconnect the connection, since the other computer is breaking the rules. Of course you can choose to use a bigger number than 4096.
* Extra for experts: "removing the line from the array" can be fast if you implement a "circular buffer" data structure.

why stream write file, file size increse by 4k for each time? [duplicate]

In case of buffered stream it said in a book that it wait until the buffer is full to write back to the monitor. For example:
cout << "hi";
What do they mean by "the buffer is full".
cerr << "hi";
It is said in my book that everything sent to cerr is written to the standard error device immediately, what does it mean?
char *ch;
cin>> ch; // I typed "hello world";
In this example ch will be assigned to "hello" and "world" will be ignored does it mean that it still in the buffer and it will affect the results of future statements?

Your book doesn't seem very helpful.
1) The output streams send their bytes to a std::streambuf, which may
contain a buffer; the std::filebuf (derived from streambuf) used by
and std::ofstream will generally be buffered. That means that when
you output a character, it isn't necessarily output immediately; it will
be written to a buffer, and output to the OS only when the buffer is
full, or you explicitly request it in some way, generally by calling
flush() on the stream (directly, or indirectly, by using std::endl).
This can vary, however; output to std::cout is synchronized with
stdout, and most implementations will more or less follow the rules of
stdout for std::cout, changing the buffering strategy if the output
is going to an interactive device.
At any rate, if you're unsure, and you want to be sure that the output
really does leave your program, just add a call to flush.
2) Your book is wrong here.
One of the buffering strategies is unitbuf; this is a flag in the
std::ostream which you can set or reset (std::ios_base::set() and
std::ios_base::unset()—std::ios_base is a base class of
std::ostream, so you can call these functions on an std::ostream
object). When unitbuf is set, std::ostream adds a call to flush()
to the end of every output function, so when you write:
std::cerr << "hello, world";
the stream will be flushed after all of the characters in the string
are output, provided unitbuf is set. On start-up, unitbuf is set
for std::cerr; by default, it is not set on any other file. But you
are free to set or unset it as you wish. I would recommend against
unsetting it on std::cerr, but if std::cout is outputting to an
interactive device, it makes a lot of sense to set it there.
Note that all that is in question here is the buffer in the streambuf.
Typically, the OS also buffers. All flushing the buffer does is
transfer the characters to the OS; this fact means that you cannot use
ofstream directly when transactional integrity is required.
3) When you input to a string or a character buffer using >>, the
std::istream first skips leading white space, and then inputs up to
but not including the next white space. In the formal terms of the
standard, it "extracts" the characters from the stream, so that they
will not be seen again (unless you seek, if the stream supports it).
The next input will pickup where ever the previous left off. Whether
the following characters are in a buffer, or still on disk, is really
irrelevant.
Note that the buffering of input is somewhat complex, in that it occurs
at several different levels, and at the OS level, it takes different
forms depending on the device. Typically, the OS will buffer a file by
sectors, often reading several sectors in advance. The OS will always
return as many characters as were demanded, unless it encounters end of
file. Most OSs will buffer a keyboard by line: not returning from a
read request until a complete line has been entered, and never returning
characters beyond the end of the current line in a read request.
In the same manner as std::ostream uses a streambuf for output,
std::istream uses one to get each individual character. In the case
of std::cin, it will normally be a filebuf; when the istream
requests a character, the filebuf will return one from its buffer if
it has one; if it doesn't, it will attempt to refill the buffer,
requesting e.g. 512 (or whatever its buffer size is) characters from the
OS. Which will respond according to its buffering policy for the
device, as described above.
At any rate, if std::cin is connected to the keyboard, and you've
typed "hello world", all of the characters you've typed will be read
by the stream eventually. (But if you're using >>, there'll be a lot
of whitespace that you won't see.)

streams in C++ are buffer to increase efficiency, that is file and console IO is very slow in comparison to memory operations.
To combat this C++ streams have a buffer (a bank of memory) that contains everything to write to the file or output, when it is full then it flushed to the file. The inverse is true for input, it fetches more when it the buffer is depleted.
This is very import for streams because the following
std::cout << 1 << "hello" << ' ' << "world\n";
Would be 4 writes to a file which is inefficient.
However in the case of std::cout, cin, and cerr then these type actually have buffering turned off by default to ensure that it can be used in conjunction with std::printf and std::puts etc...
To re-enable it (which I recommend doing):
std::ios_base::sync_with_stdio(false);
But don't use C style console output whilst it is set false or bad things may happen.

You can check out the differences yourself with a small app.
#include <iostream>
int main() {
std::cout<<"Start";
//std::cout<<"Start"<<std::endl; // line buffered, endl will flush.
double j = 2;
for(int i = 0; i < 100000000; i++){
j = i / (i+1);
}
std::cout<<j;
return 0;
}
Try out the difference of the two "Start" statments, then change to cerr. The difference you notice is due to buffering.
The for-statement takes about 2 seconds on my rig, you might need to tweak the i < condition on yours.

1) what do they mean by "the buffer is full".
With buffered output there's a region of memory, called a buffer, where the stuff you write out is stored before it is actually written to the output. When you say cout << "hi" the string is probably only copied into that buffer and not written out. cout usually waits until that memory has been filled up before it actually starts writing things out.
The reason for this is because usually the process of starting to actually write data is slow, and so if you do that for every character you get terrible performance. A buffer is used so that the program only has to do that infrequently and you get much better performance.
2) it said in my book that everything sent to cerr is written to the standard error device immediatly, does this mean it send 'h' and then 'i'...?
It just means that no buffer is used. cerr might still send 'h' and 'i' at the same time since it already has both of them.
3)in this example ch will be assigned to "hello" and "world" will be ignored does it mean that it still in the buffer and it will affect the results of future statements ?
This doesn't really have anything to do with buffering. the operator >> for char* is defined to read until it sees whitespace, so it stops at the space between "hello" and "world". But yes, the next time you read you will get "world".
(although if that code isn't just a paraphrase of the actuall code then it has undefined behavior because you're reading the text into an undefined memory location. Instead you should do:
std::string s;
cin >> s;
)

Each call to write to the terminal is slow, so to avoid doing slow things often the data is stored in memory until either a certain amount of data has been entered or the buffer is flushed manually with fflush or std::endl. The result of this is sometimes that text might not be written to the terminal at the moment you expect it to.
Since the timing of error messages is more critical than normal output, the performance hit is ignored and the data is not buffered. However, since a string is passed in a single piece of data, it is written in one call (inside a loop somewhere).
It world would still be in the buffer, but it's quite easy to prove this yourself by trying it in a 3 line program. However, your example will fail since you are attempting to write into unallocated memory. You should be taking input into a std::string instead.

C++ istream::peek - shouldn't it be nonblocking?

It seems well accepted that the istream::peek operation is blocking.
The standard, though arguably a bit ambiguous, leans towards nonblocking behavior. peek calls sgetc in turn, whose behavior is:
"The character at the current position of the controlled input sequence, as a value of type int.
If there are no more characters to read from the controlled input sequence, the function returns the end-of-file value (EOF)."
It doesn't say "If there are no more characters.......wait until there are"
Am I missing something here? Or are the peek implementations we use just kinda wrong?

The controlled input sequence is the file (or whatever) from which you're reading. So if you're at end of file, it returns EOF. Otherwise it returns the next character from the file.
I see nothing here that's ambiguous at all--if it needs a character that hasn't been read from the file, then it needs to read it (and wait till it's read, and return it).
If you're reading from something like a socket, then it's going to wait until data arrives (or the network stack detects EOF, such as the peer disconnecting).

The description from cppreference.com might be clearer than the one in your question:
Ensures that at least one character is available in the input area by [...] reading more data in from the input sequence (if applicable)."
"if applicable" does apply in this case; and "reading data from the input sequence" entails waiting for more data if there is none and the stream is not in an EOF or other error state.

When I get confused about console input I remind myself that console input can be redirected to come from a file, so the behavior of the keyboard more or less mimics the behavior of a file. When you try to read a character from file, you can get one of two results: you get a character, or you get EOF because you've reached the end of the file -- there are no more characters to be read. Same thing for keyboard input: either you get a character, or you get EOF because you've reached the end of the file. With a file, there is no notion of waiting for more characters: either a file has unread characters or it doesn't. Same thing for the keyboard. So if you have't reached EOF on the keyboard, reading a character returns the next character. You reach EOF on the keyboard by typing whatever character your system recognizes as EOF; on Unix systems that's ctrl-D, on Windows (if I remember correctly) that's ctrl-C. If you haven't reached EOF, there are more characters to be read.

c++ flushing the buffer

I know there are many buffer questions on here but I can't seem find a clear answer on this.
std::cout << "write to screen" << std::endl;
I know this code will write to the screen and flush the buffer because of the "endl", but if I wrote this:
std::cout << "write to screen";
Wouldn't the buffer be flushed regardless since the text has been outputted to the screen?

Wouldn't the buffer be flushed regardless since the text has been outputted to the screen?
Assuming that you have seen the text outputted to the screen, then yes, the buffer has been flushed.
I believe the confusion is regarding this line:
std::cout << "write to screen";
The absence of std::endl doesn't mean "don't flush the buffer". It simply means "I'm not saying when to flush the buffer".

There are multiple ways to ensure your std::ostream is flushed:
Manually with std::endl, std::flush, or a direct call to ostream::flush().
Depending on a later used input stream being bound to your ostream: std::basic_ios::tie().
Depending on the tie to C streams: std::ios_base::sync_with_stdio
This means that anything which would flush the corresponding C stream will also flush the C++ stream, like a call to fflush(), or the (maybe automatically) selected buffering strategy.
Like line-buffering.
From C11 draft:
7.21.3 Files
3 When a stream is unbuffered, characters are intended to appear from the source or at the
destination as soon as possible. Otherwise characters may be accumulated and
transmitted to or from the host environment as a block. When a stream is fully buffered,
characters are intended to be transmitted to or from the host environment as a block when
a buffer is filled. When a stream is line buffered, characters are intended to be
transmitted to or from the host environment as a block when a new-line character is
encountered. Furthermore, characters are intended to be transmitted as a block to the host
environment when a buffer is filled, when input is requested on an unbuffered stream, or
when input is requested on a line buffered stream that requires the transmission of
characters from the host environment. Support for these characteristics is
implementation-defined, and may be affected via the setbuf and setvbuf functions.
7 At program startup, three text streams are predefined and need not be opened explicitly
— standard input (for reading conventional input), standard output (for writing
conventional output), and standard error (for writing diagnostic output). As initially
opened, the standard error stream is not fully buffered; the standard input and standard
output streams are fully buffered if and only if the stream can be determined not to refer
to an interactive device.
Waiting on the internal buffer overflowing.
Now, as a general guideline: Don't manually flush your streams, doing so can significantly degrade performance. Unless, of course, it's neccessary for correctness.

std::cout << "Hello" << std::endl;
will write to the screen before executing the next line of code, while
std::cout << "Hello\n";
will print the same, but some time before your program exits normally or you use std::cin (or another instream you tie to std::cout by hand). That means that if your program is terminated abruptly or hangs in an infinite loop, you might not see the output at all.

"Wouldn't the buffer be flushed regardless since the text has been outputted to the screen?"
No! std::endl implies flushing. The underlying buffer won't be flushing (written on the screen),
until hitting a certain watermark (buffer size).
If you want to have it flushed, call cout.flush() explicitely:
std::cout << "write to screen";
std::cout.flush();
The real key to the solution is, what the underlying std::basic_streambuf interface actually implements.
There could be various implementations:
Calling flush() every time the certain watermark of the underlying buffer is hit
Calling flush() every time (not very efficient)
Calling flush() as soon a '\n' had been printed
Calling flush() as guaranteed with std::endl
The internal buffer management shouldn't be your business of concern, unless you're trying to provide your own std::basic_streambuf implementation.

How does the buffer know how many characters to transfer from the external file during a flush operation?

Say I have an input operation:
file >> x;
If the internal buffer of file is empty underflow() will be called to import characters from the external device to the internal buffer of file. It is implementation-defined if the buffer will be partially or completely filled after this flush operation. Taking that into account, is it possible that if x is a string and I am expecting an input value of a certain length, that the buffer is in its right to transfer fewer characters than that? Can this happen?

There is no real constraint on how many characters underflow() makes available. The only real constraint is that a stream which hasn't reached EOF needs to make, at least, one character available. With respect specifically std::filebuf (or std::basic_filebuf<...>) the stream may be unbuffered (if setbuf(0, 0) was called) in which case it would, indeed, make individual characters available. Otherwise, the stream will try to fill its internal buffer and rely on the operating system to have a the underlying operation return a suitable amount of bytes if there are few available, yet.
I'm not sure I quite understand your question: the operation file >> x will return once x is completely read which can happen if the stream indicated by file has reached its end or when a whitespace character is found (and if with "string" you mean char*, a non-zero value stored in file.width() is also taken into account). With respect to the underlying stream buffer, clearly x may require multiple reads to the underlying representation, i.e., it is unpredictable how many calls to underflow() are made. Given that the file's internal buffer is probably matching the disc's block size, I would expect that at most one call to underflow() is made for "normal" strings. However, if the file read is a huge and doesn't contain any spaces many calls to underflow() may be made. Given that the stream needs to find whitespaces it has no way to predict how many characters are needed in the first place.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js