Formatted and unformatted input and output and streams - c++

I had been reading a few articles on some sites about Formatted and Unformatted I/O, however i have my mind more messed up now.
I know this is a very basic question, but i would request anyone can give a link [ to some site or previously answered question on Stackoverflow ] which explains, the idea of streams in C and C++.
Also, i would like to know about Formatted and Unformatted I/O.

The standard doesn't define what these terms mean, it just says which of the functions defined in the standard are formatted IO and which are not. It places some requirements on the implementation of these functions.
Formatted IO is simply the IO done using the << and >> operators. They are meant to be used with text representation of the data, they involve some parsing, analyzing and conversion of the data being read or written. Formatted input skips whitespace:
Each formatted input function begins execution by constructing an object of class sentry with the noskipws (second) argument false.
Unformatted IO reads and writes the data just as a sequence of 'characters' (with possibly applying the codecvt of the imbued locale). It's meant to read and write binary data, or function as a lower-level used by the formatted IO implementation. Unformatted input doesn't skip whitespace:
Each unformatted input function begins execution by constructing an object of class sentry with the default argument noskipws (second) argument true.
And allows you to retrieve the number of characters read by the last input operation using gcount():
Returns: The number of characters extracted by the last unformatted input member function called for the object.

Formatted IO means that your output is determined by a "format string", that means you provide a string with certain placeholders, and you additionally give arguments that should be used to fill these placeholders:
const char *daughter_name = "Lisa";
int daughter_age = 5;
printf("My daughter %s is %d years old\n", daughter_name, daughter_age);
The placeholders in the example are %s, indicating that this shall be substituted using a string, and %d, indicating that this is to be replaced by a signed integer number. There are a lot more options that give you control over how the final string will present itself. It's a convenience for you as the programmer, because it relieves you from the burden of converting the different data types into a string and it additionally relieves you from string appending operations via strcat or anything similar.
Unformatted IO on the other hand means you simply write character or byte sequences to a stream, not using any format string while you are doing so.
Which brings us to your question about streams. The general concept behind "streaming" is that you don't have to load a file or whatever input as a whole all the time. For small data this does work though, but imagine you need to process terabytes of data - no way this will fit into a single byte array without your machine running out of memory. That's why streaming allows you to process data in smaller-sized chunks, one at a time, one after the other, so that at any given time you just have to deal with a fix-sized amount of data. You read the data into a helper variable over and over again and process it, until your underlying stream tells you that you are done and there is no more data left.
The same works on the output side, you write your output step for step, chunk for chunk, rather than writing the whole thing at once.
This concept brings other nice features, too. Because you can nest streams within streams within streams, you can build a whole chain of transformations, where each stream may modify the data until you finally receive the end result, not knowing about the single transformations, because you treat your stream as if there were just one.
This can be very useful, for example C or C++ streams buffer the data that they read natively from e.g. a file to avoid unnecessary calls and to read the data in optimized chunks, so that the overall performance will be much better than if you would read directly from the file system.

Unformatted Input/Output is the most basic form of input/output. Unformatted input/output transfers the internal binary representation of the data directly between memory and the file. Formatted output converts the internal binary representation of the data to ASCII characters which are written to the output file. Formatted input reads characters from the input file and converts them to internal form. Formatted

Related

Dynamic vs Static Nature of Input and Output String Streams

Rogue Wave Standard C++ Library Iostreams and Locale User’s Guide mentions the following:
As with file streams, there are three class templates that implement
string streams: basic_istringstream <charT,traits,Allocator>,
basic_ostringstream <charT,traits,Allocator>, and basic_stringstream
<charT,traits,Allocator> … For convenience, there are the regular
typedefs istringstream, ostringstream, and stringstream…
and
Output string streams are dynamic. The internal buffer is allocated
once an output string stream is constructed. The buffer is
automatically extended during insertion each time the internal buffer
is full.
Input string streams are always static. You can extract as many items
as are available in the string you provided the string stream.
Now if output string streams are dynamic and input string streams are always static, how do we reconcile between the two and realize the (input/ouput) stringstream?

Using Getline on a Binary File

I have read that getline behaves as an unformatted input function. Which I believe should allow it to be used on a binary file. Let's say for example that I've done this:
ofstream ouput("foo.txt", ios_base::binary);
const auto foo = "lorem ipsum";
output.write(foo, strlen(foo) + 1);
output.close();
ifstream input("foo.txt", ios_base::binary);
string bar;
getline(input, bar, '\0');
Is that breaking any rules? It seems to work fine, I think I've just traditionally seen arrays handled by writing the size and then writing the array.
No, it's not breaking any rules that I can see.
Yes, it's more common to write an array with a prefixed size, but using a delimiter to mark the end can work perfectly well also. The big difference is that (like with a text file) you have to read through data to find the next item. With a prefixed size, you can look at the size, and skip directly to the next item if you don't need the current one. Of course, you also need to ensure that if you're using something to mark the end of a field, that it can never occur inside the field (or come up with some way of detecting when it's inside a field, so you can read the rest of the field when it does).
Depending on the situation, that can mean (for example) using Unicode text. This gives you a lot of options for values that can't occur inside the text (because they aren't legal Unicode). That, on the other hand, would also mean that your "binary" file is really a text file, and has to follow some basic text-file rules to make sense.
Which is preferable depends on how likely it is that you'll want to read random pieces of the file rather than reading through it from beginning to end, as well as the difficulty (if any) of finding a unique delimiter and if you don't have one, the complexity of making the delimiter recognizable from data inside a field. If the data is only meaningful if written in order, then having to read it in order doesn't really pose a problem. If you can read individual pieces meaningfully, then being able to do so much more likely to be useful.
In the end, it comes down to a question of what you want out of your file being "binary'. In the typical case, all 'binary" really means is that what end of line markers that might be translated from a new-line character to (for example) a carriage-return/line-feed pair, won't be. Depending on the OS you're using, it might not even mean that much though--for example, on Linux, there's normally no difference between binary and text mode at all.
Well, there are no rules broken and you'll get away with that just fine, except that may miss the precision of reading binary from a stream object.
With binary input, you usually want to know how many characters were read successfully, which you can obtain afterwards with gcount()... Using std::getline will not reflect the bytes read in gcount().
Of cause, you can simply get such info from the size of the string you passed into std::getline. But the stream will no longer encapsulate the number of bytes you consumed in the last Unformatted Operation

Using formatted IO operations in binary mode?

Is there any problem with using the formatted IO operations in binary mode, especially if I'm only dealing with text files?
(1):
For binary files, reading and writing data with the extraction and insertion operators (<< and >>) and functions like getline is not efficient, since we do not need to format any data and data is likely not formatted in lines.
(2):
Normally, for binary file i/o you do not use the conventional text-oriented << and >> operators! It can be done, but that is an advanced topic.
The "advanced topic" nature is what made me question mixing these two. There is a mingw bug with the seek and tell functions which can be resolved by opening up in binary mode. Is there any issue with using << and >> in binary mode compared to text mode or must I always resort to unformatted IO if opening up in binary? As far as I can tell for text files, I just have to account for carriage-returns (\r) which aren't implictly removed/added for me, but is that all there is to account for?
Is there any problem with using the formatted IO operations in binary
mode, especially if I'm only dealing with text files?
I just have to account for carriage-returns (\r) which aren't
implictly removed/added for me
If you want or need \r in your data, you are probably dealing with text / strings. For that you do not need to use binary files. Although you could open textfiles in binary mode to do a quick scan for newlines for example (line count), without having to do a less efficient readline().
Binary files are used to store binary values directly (mostly numbers or data structures), without the need to convert them to text and back to binary again.
Another advantage of binary files is that you don't have to do any parsing. You can access all your data directly, wherever it may be in the file (assuming the data is stored in a well structured manner).
For example: if you need to store records, each containing 5 32-bit numbers, you can write those directly to the binary file in their native binary format (no time wasted with converting and parsing). To later read record nr 1000 for example, you can seek directly to position 5 x 4 x (1000-1), and read your 20-byte record from there. With text files on the other hand, you would need to scan every byte from the beginning of the file, until you have counted 1000 lines (with would also be of different lengths).
You would use read() and write() (or fread() / fwrite()) directly (although << and >> could be used too for serialization of objects with variable lengths).
Binary files should also have a header with some basic information. See my answer here for more information on that.

Is there any way to read characters that satisfy certain conditions only from stdin in C++?

I am trying to read some characters that satisfy certain condition from stdin with iostream library while leave those not satisfying the condition in stdin so that those skipped characters can be read later. Is it possible?
For example, I want characters in a-c only and the input stream is abdddcxa.
First read in all characters in a-c - abca; after this input finished, start read the remaining characters dddx. (This two inputs can't happen simultaneously. They might be in two different functions).
Wouldn't it be simpler to read everything, then split the input into the two parts you need and finally send each part to the function that needs to process it?
Keeping the data in the stdin buffer is akin to using globals, it makes your program harder to understand and leaves the risk of other code (or the user) changing what is in the buffer while you process it.
On the other hand, dividing your program into "the part that reads the data", "the part that parses the data and divides the workload" and the "part that does the work" makes for a better structured program which is easy to understand and test.
You can probably use regex to do the actual split.
What you're asking for is the putback method (for more details see: http://www.cplusplus.com/reference/istream/istream/putback/). You would have to read everything, filter the part that you don't want to keep out, and put it back into the stream. So for instance:
cin >> myString;
// Do stuff to fill putbackBuf[] with characters in reverse order to be put back
pPutbackBuf = &putbackBuf[0];
do{
cin.putback(*(pPutbackBuf++));
while(*pPutbackBuf);
Another solution (which is not exactly what you're asking for) would be to split the input into two strings and then feed the "non-inputted" string into a stringstream and pass that to whatever function needs to do something with the rest of the characters.
What you want to do is not possible in general; ungetc and putback exist, but they're not guaranteed to work for more than one character. They don't actually change stdin; they just push back on an input buffer.
What you could do instead is to explicitly keep a buffer of your own, by reading the input into a string and processing that string. Streams don't let you safely rewind in many cases, though.
No, random access is not possible for streams (except for fstream an stringstream). You will have to read in the whole line/input and process the resulting string (which you could, however, do using iostreams/std::stringstream if you think it is the best tool for that -- I don't think that but iostreams gurus may differ).

Read binary data from std::cin

What is the easiest way to read binary (non-formated) data from std::cin into either a string or a stringstream?
std::cin is not opened with ios_binary. If you must use cin, then you need to reopen it, which isn't part of the standard.
Some ideas here: https://comp.unix.programmer.narkive.com/jeVj1j3I/how-can-i-reopen-std-cin-and-std-cout-in-binary-mode
Once it's binary, you can use cin.read() to read bytes. If you know that in your system, there is no difference between text and binary (and you don't need to be portable), then you can just use read without worrying.
For windows, you can use the _setmode function in conjunction with cin.read(), as already mentioned.
_setmode(_fileno(stdin), _O_BINARY);
cin.read(...);
See solution source here: http://talmai-oliveira.blogspot.com/2011/06/reading-binary-files-from-cin.html
cin.read would store a fixed number of bytes, without any logic searching for delimiters of the type that #Jason mentioned.
However, there may still be translations active on the stream, such as CRLF -> NL, so it still isn't ideal for binary data.
On a Unix/POSIX system, you can use the cin.get() method to read byte-by-byte and save the data into a container like a std::vector<unsigned int>, or you can use cin.read() in order to read a fixed amount of bytes into a buffer. You could also use cin.peek() to check for any end-of-data-stream indicators.
Keep in mind to avoid using the operator>> overload for this type of operation ... using operator>> will cause breaks to occur whenever a delimiter character is observed, and it will also remove the delimiting character from the stream itself. This would include any binary values that are equivalent to a space, tab, etc. Thus the binary data your end up storing from std::cin using that method will not match the input binary stream byte-for-byte.
All predefined iostream objects are obligated to be bound to corresponding C streams:
The object cin controls input from a stream buffer associated with the object stdin, declared in <cstdio>.
http://eel.is/c++draft/narrow.stream.objects
and thus the method of obtaining binary data is same as for C:
Basically, the best you can really do is this:
freopen(NULL, "rb", stdin);
This will reopen stdin to be the same input stream, but in binary
mode. In the normal mode, reading from stdin on Windows will convert
\r\n (Windows newline) to the single character ASCII 10. Using the
"rb" mode disables this conversion so that you can properly read in
binary data.
https://stackoverflow.com/a/1599093/6049796
cplusplus.com:
Unformatted input
Most of the other member functions of the istream
class are used to perform unformatted input, i.e. no interpretation is
made on the characters got form the input. These member functions can
get a determined number of characters from the input character
sequence (get, getline, peek, read, readsome)...
As Lou Franco pointed out, std::cin isn't opened with std::ios_base::binary, but one of those functions might get you close to the behavior you're looking for.
With windows/mingw/msys/bash, if you need to pipe different commands with binary streams in between, you need to manipulate std::cin and std::cout as binary streams.
The _setmode solution from Mikhail works perfectly.
Using MinGW, the neaded headers are the following:
#include <io.h>
#include <fcntl.h>