Inserters and Extractors reading/writing binary data vs text - c++

I've been trying to read up on iostreams and understand them better. Occasionally I find it stressed that inserters (<<) and extractors (>>) are meant to be used in textual serialization. It's a few places, but this article is a good example:
http://spec.winprog.org/streams/
Outside of the <iostream> universe, there are cases where the << and >> are used in a stream-like way yet do not obey any textual convention. For instance, they write binary encoded data when used by Qt's QDataStream:
http://doc.qt.nokia.com/latest/qdatastream.html#details
At the language level, the << and >> operators belong to your project to overload (hence what QDataStream does is clearly acceptable). My question would be whether it is considered a bad practice for those using <iostream> to use the << and >> operators to implement binary encodings and decodings. Is there (for instance) any expectation that if written to a file on disk that the file should be viewable and editable with a text editor?
Should one always be using other method names and base them on read() and write()? Or should textual encodings be considered merely a default behavior that classes integrating with the standard library iostream can elect to ignore?
UPDATE A key terminology issue on this seems to be the distinction of I/O that is "formatted" vs "unformatted" (as opposed to the terms "textual" vs "binary"). I found this question:
writing binary data (std::string) to an std::ofstream?
It has a comment from #TomalakGeret'kal saying "I'd not want to use << for binary data anyway, as my brain reads it as "formatted output" which is not what you're doing. Again, it's perfectly valid, but I just would not confuse my brain like that."
The accepted answer to the question says it's fine as long as you use ios::binary. That seems to bolster the "there's nothing wrong with it" side of the debate...but I still don't see any authoritative source on the issue.

Actually the operators << and >> are bit shift operators; using them for I/O is strictly speaking already a misuse. However that misuse is about as old as operator overloading itself, and I/O today is the most common usage of them, therefore they are widely regarded as I/O insertion/extraction operators. I'm pretty sure if there weren't the precedent of iostreams, nobody would use those operators for I/O (especially with C++11 which has variadic templates, solving the main problem which using those operators solved for iostreams, in a much cleaner way). On the other hand, from the language point of view, overloaded operator<< and operator>> can mean whatever you want them to mean.
So the question boils down to what would be an acceptable use of those operators. For this, I think one has to distinguish two cases: First, new overloads working on iostream classes, and second, new overloads working on other classes, possibly designed to work like iostreams.
Let's consider first new operators on iostream classes. Let me start with the observation that the iostream classes are all about formatting (and the reverse process, which could be called "deformatting"; "lexing" IMHO wouldn't be quite the right term here because the extractors don't determine the type, but only try to interpret the data according to the type given). The classes responsible for the actual I/O of raw data are the streambufs. However note that a proper binary file is not a file where you just dump internal raw data. Just like a text file (actually even more so), a binary file should have a well-specified encoding of the data it contains. Especially if the files are expected to be read on different systems. Therefore the concept of formatted output makes perfect sense also for binary files; just the formatting is different (e.g. writing a pre-determined number of bytes with the most significant one first for an integer value).
The iostreams themselves are classes which are intended to work on text files, that is, on files whose content is interpreted as textual representation of data. A lot of built-in behaviour is optimized for that, and may cause problems if used on binary files. An obvious example is that by default spaces are skipped before any input is attempted. For a binary file, this would be clearly the wrong behaviour. Also the use of locales doesn't make sense for binary files (although one might argue that there could be a "binary locale", but I don't think locales as defined for iostreams provide a suitable interface for that). Therefore I'd say writing binary operator<< or operator>> for iostream classes would be wrong.
The other case is where you define a separate class for binary input/output (possibly reusing the streambuf layer for doing the actual I/O). Since we are now speaking about different classes, the argumentation above doesn't apply any more. So the question now is: Should operator<< and operator>> on I/O be regarded as "text insertion/extraction operators" or more generally as "formatted data insertion/extraction operators"? The standard classes only use them for text, but then, there are no standard classes for binary I/O insertion/extraction at all, so the standard usage cannot distinguish between the two.
I personally would say that binary insertion/extraction is close enough to textual insertion/extraction that this usage is justified. Note that you also could make meaningful binary I/O manipulators, e.g. bigendian, littleendian and intwidth(n) to determine the format in which integers are to be output.
Beyond that there's also the use of those operators for things which are not really I/O (and where you wouldn't even think of using the streambuf layer), like reading from or inserting into a container. In my opinion, that already constitutes misuse of the operators, because there the data isn't translated into or out of a different format. It is just stored in a container.

The abstraction of the iostreams in the standard is that of a textually
formatted stream of data; there is no support for any non-text format.
That is the abstraction of iostreams. There's nothing wrong about
defining a different stream class whose abstraction is a binary format,
but doing so in an iostream will likely break existing code, and not
work.

The overloaded operators >> and << perform formatted IO. The rest IO functions (put, get, read, write, etc) perform unformatted IO. Unformatted IO means that the IO library only accepts a buffer, a sequence of unsigned character for its input. This buffer might contain textual message or a binary content. It’s the application’s responsibility to interpret the buffer. However the formatted IO would take the locale into consideration. In the case of text files, depending on the environment where the application runs, some special character conversion may occur in input/output operations to adapt them to a system-specific text file format. In many environments, such as most UNIX-based systems, it makes no difference to open a file as a text file or a binary file. Note that you could overload the operator >> and << for your own types. That means you are capable of applying the formatted IO without locale information to your own types, though that’s a bit tricky.

Related

Technical background to the C++ fmt::print vs. fmt::format_to naming?

Why is it fmt::format_to(OutputIt, ...) and not fmt::print(OutputIt, ...)??
I'm currently familiarizing myself with {fmt}, a / the modern C++ formatting library.
While browsing the API, I found the naming a bit disjoint, but given my little-to-no experience with the library (and my interest in API design), I would like to get behind these naming choices: (fmt core API reference)
There's fmt::format(...) -> std::string which makes sense, it returns a formatted string.
Then we have void fmt::print([stream, ] ...) which also makes sense naming wise (certainly given the printf legacy).
But then we have fmt::format_to(OutputIt, ...) -> OutputIt which resembles, apart from the return type, what print does with streams.
Now obviously, one can bike shed names all day, but here the question is not on why we have format vs. print (which is quite explainable to me), but why a function that clearly(?) behaves like the write-to-stream-kind has been bundled with the format_... naming style.
So, as the question title already asks, is there a technical difference in how fmt::print(stream, ...) behaves when formatting to a streams vs. how fmt::format_to(OutputIt, ...) behaves when formatting to an output iterator?
Or was/is this purely a style choice? Also, given that the GitHube repo explicitly lists the fmt tag here, I was hoping that we could get a authoritative answer on this from the original API authors.
While it would be possible to name format_to print, the former is closer conceptually to format than to print. format_to is a generalization of format that writes output via an output iterator instead of std::string. Therefore the naming reflects that.
print without a stream argument on the other hand writes to stdout and print with an argument generalizes that to an arbitrary stream. Writing to a stream is fundamentally different from writing to an output iterator because it involves additional buffering, synchronization, etc. Other languages normally use "print" for this sort of functionality so this convention is followed in {fmt}.
This becomes blurry because you can have an output iterator that writes to a stream but even there the current naming reflects what high-level APIs are being used.
In other words format_to is basically a fancy STL algorithm (there is even format_to_n similarly to copy/copy_n) while print is a formatted output function.

scanf on an istream object

NOTE: I've seen the post What is the cin analougus of scanf formatted input? before asking the question and the post doesn't solve my problem here. The post seeks for C++-way to do it, but as I mentioned already, it is inconvenient to just use C++-way to do it sometimes and I have clear examples for that.
I am trying to read data from an istream object, and sometimes it is inconvenient to just use C++-style ways such as operator>>, e.g. the data are in special form 123:456 so you have to imbue to make ':' as space (which is very hacky, as opposed to %d:%d in scanf), or 00123 where you want to read as string and convert decimal instead of octal (as opposed to %d in scanf), and possibly many other cases.
The reason I chose istream as interface is because it can be derived and therefore more flexible. For example, we can create in-memory streams, or some customized streams that generated on the fly, etc. C-style FILE*, on the other hand, is very limited, at least in a standard-compliant way, on creating customized streams.
So my questions is, is there a way to do scanf-like data extraction on istream object? I think fscanf internally read character by character from FILE* using fgetc, while istream also provides such interface. So it is possible by just copying and pasting the code of fscanf and replace the FILE* with the istream object, but that's very hacky. Is there a smarter and cleaner way, or is there some existing work on this?
Thanks.
You should never, under any circumstances, use scanf or its relatives for anything, for three reasons:
Many format strings, including for instance all the simple uses of %s, are just as dangerous as gets.
It is almost impossible to recover from malformed input, because scanf does not tell you how far in characters into the input it got when it hit something unexpected.
Numeric overflow triggers undefined behavior: yes, that means scanf is allowed to crash the entire program if a numeric field in the input has too many digits.
Prior to C++11, the C++ specification defined istream formatted input of numbers in terms of scanf, which means that last objection is very likely to apply to them as well! (In C++11 the specification is changed to use strto* instead and to do something predictable if that detects overflow.)
What you should do instead is: read entire lines of input into std::string objects with getline, hand-code logic to split them up into fields (I don't remember off the top of my head what the C++-string equivalent of strsep is, but I'm sure it exists) and then convert numeric strings to machine numbers with the strtol/strtod family of functions.
I cannot emphasize this enough: THE ONLY 100% RELIABLE WAY TO CONVERT STRINGS TO NUMBERS IN C OR C++, unless you are lucky enough to have a C++ runtime that is already C++11-conformant in this regard, IS WITH THE strto* FUNCTIONS, and you must use them correctly:
errno = 0;
result = strtoX(s, &ends, 10); // omit 10 for floats
if (s == ends || *ends || errno)
parse_error();
(The OpenBSD manpages, linked above, explain why you have to do this fairly convoluted thing.)
(If you're clever, you can use ends and some manual logic to skip that colon, instead of strsep.)
I do not recommend you to mix C++ input output and C input output. No that they are really incompatible but they could just plain interoperate wrong.
For example Oracle docs recommend not to mix it http://www.oracle.com/technetwork/articles/servers-storage-dev/mixingcandcpluspluscode-305840.html
But no one stops you from reading data into the buffer and parsing it with standard c functions like sscanf.
...
string curString;
int a, b;
...
std::getline(inputStream, curString);
int sscanfResult == sscanf(curString.cstr(), "%d:%d", &a, &b);
if (2 != sscanfResult)
throw "error";
...
But it won't help in some situations when your stream is just one long contiguous sequence of symbols(like some string turned into memory stream).
Making your own fscanf from scratch or porting(?) the original CRT function actually isn't the worst possible idea. Just make sure you have tested it thoroughly(low level custom char manipulation was always a source of pain in C).
I've never really tried the boost\spirit and such parsing infrastructure could really be an overkill for your project. But boost libraries are usually well tested and designed. You could at least try to use it.
Based on #tmyklebu's comment, I implemented streamScanf which wraps istream as FILE* via fopencookie: https://github.com/likan999/codejam/blob/master/Common/StreamScanf.cpp

Formatted and unformatted input and output and streams

I had been reading a few articles on some sites about Formatted and Unformatted I/O, however i have my mind more messed up now.
I know this is a very basic question, but i would request anyone can give a link [ to some site or previously answered question on Stackoverflow ] which explains, the idea of streams in C and C++.
Also, i would like to know about Formatted and Unformatted I/O.
The standard doesn't define what these terms mean, it just says which of the functions defined in the standard are formatted IO and which are not. It places some requirements on the implementation of these functions.
Formatted IO is simply the IO done using the << and >> operators. They are meant to be used with text representation of the data, they involve some parsing, analyzing and conversion of the data being read or written. Formatted input skips whitespace:
Each formatted input function begins execution by constructing an object of class sentry with the noskipws (second) argument false.
Unformatted IO reads and writes the data just as a sequence of 'characters' (with possibly applying the codecvt of the imbued locale). It's meant to read and write binary data, or function as a lower-level used by the formatted IO implementation. Unformatted input doesn't skip whitespace:
Each unformatted input function begins execution by constructing an object of class sentry with the default argument noskipws (second) argument true.
And allows you to retrieve the number of characters read by the last input operation using gcount():
Returns: The number of characters extracted by the last unformatted input member function called for the object.
Formatted IO means that your output is determined by a "format string", that means you provide a string with certain placeholders, and you additionally give arguments that should be used to fill these placeholders:
const char *daughter_name = "Lisa";
int daughter_age = 5;
printf("My daughter %s is %d years old\n", daughter_name, daughter_age);
The placeholders in the example are %s, indicating that this shall be substituted using a string, and %d, indicating that this is to be replaced by a signed integer number. There are a lot more options that give you control over how the final string will present itself. It's a convenience for you as the programmer, because it relieves you from the burden of converting the different data types into a string and it additionally relieves you from string appending operations via strcat or anything similar.
Unformatted IO on the other hand means you simply write character or byte sequences to a stream, not using any format string while you are doing so.
Which brings us to your question about streams. The general concept behind "streaming" is that you don't have to load a file or whatever input as a whole all the time. For small data this does work though, but imagine you need to process terabytes of data - no way this will fit into a single byte array without your machine running out of memory. That's why streaming allows you to process data in smaller-sized chunks, one at a time, one after the other, so that at any given time you just have to deal with a fix-sized amount of data. You read the data into a helper variable over and over again and process it, until your underlying stream tells you that you are done and there is no more data left.
The same works on the output side, you write your output step for step, chunk for chunk, rather than writing the whole thing at once.
This concept brings other nice features, too. Because you can nest streams within streams within streams, you can build a whole chain of transformations, where each stream may modify the data until you finally receive the end result, not knowing about the single transformations, because you treat your stream as if there were just one.
This can be very useful, for example C or C++ streams buffer the data that they read natively from e.g. a file to avoid unnecessary calls and to read the data in optimized chunks, so that the overall performance will be much better than if you would read directly from the file system.
Unformatted Input/Output is the most basic form of input/output. Unformatted input/output transfers the internal binary representation of the data directly between memory and the file. Formatted output converts the internal binary representation of the data to ASCII characters which are written to the output file. Formatted input reads characters from the input file and converts them to internal form. Formatted

What's the difference between istringstream, ostringstream and stringstream? / Why not use stringstream in every case?

When would I use std::istringstream, std::ostringstream and std::stringstream and why shouldn't I just use std::stringstream in every scenario (are there any runtime performance issues?).
Lastly, is there anything bad about this (instead of using a stream at all):
std::string stHehe("Hello ");
stHehe += "stackoverflow.com";
stHehe += "!";
Personally, I find it very rare that I want to perform streaming into and out of the same string stream.
Usually I want to either initialize a stream from a string and then parse it; or stream things to a string stream and then extract the result and store it.
If you're streaming to and from the same stream, you have to be very careful with the stream state and stream positions.
Using 'just' istringstream or ostringstream better expresses your intent and gives you some checking against silly mistakes such as accidental use of << vs >>.
There might be some performance improvement but I wouldn't be looking at that first.
There's nothing wrong with what you've written. If you find it doesn't perform well enough, then you could profile other approaches, otherwise stick with what's clearest. Personally, I'd just go for:
std::string stHehe( "Hello stackoverflow.com!" );
A stringstream is somewhat larger, and might have slightly lower performance -- multiple inheritance can require an adjustment to the vtable pointer. The main difference is (at least in theory) better expressing your intent, and preventing you from accidentally using >> where you intended << (or vice versa). OTOH, the difference is sufficiently small that especially for quick bits of demonstration code and such, I'm lazy and just use stringstream. I can't quite remember the last time I accidentally used << when I intended >>, so to me that bit of safety seems mostly theoretical (especially since if you do make such a mistake, it'll almost always be really obvious almost immediately).
Nothing at all wrong with just using a string, as long as it accomplishes what you want. If you're just putting strings together, it's easy and works fine. If you want to format other kinds of data though, a stringstream will support that, and a string mostly won't.
In most cases, you won't find yourself needing both input and output on the same stringstream, so using std::ostringstream and std::istringstream explicitly makes your intention clear. It also prevents you from accidentally typing the wrong operator (<< vs >>).
When you need to do both operations on the same stream you would obviously use the general purpose version.
Performance issues would be the least of your concerns here, clarity is the main advantage.
Finally there's nothing wrong with using string append as you have to construct pure strings. You just can't use that to combine numbers like you can in languages such as perl.
istringstream is for input, ostringstream for output. stringstream is input and output.
You can use stringstream pretty much everywhere.
However, if you give your object to another user, and it uses operator >> whereas you where waiting a write only object, you will not be happy ;-)
PS:
nothing bad about it, just performance issues.
std::ostringstream::str() creates a copy of the stream's content, which doubles memory usage in some situations. You can use std::stringstream and its rdbuf() function instead to avoid this.
More details here: how to write ostringstream directly to cout
To answer your third question: No, that's perfectly reasonable. The advantage of using streams is that you can enter any sort of value that's got an operator<< defined, while you can only add strings (either C++ or C) to a std::string.
Presumably when only insertion or only extraction is appropriate for your operation you could use one of the 'i' or 'o' prefixed versions to exclude the unwanted operation.
If that is not important then you can use the i/o version.
The string concatenation you're showing is perfectly valid. Although concatenation using stringstream is possible that is not the most useful feature of stringstreams, which is to be able to insert and extract POD and abstract data types.
Why open a file for read/write access if you only need to read from it, for example?
What if multiple processes needed to read from the same file?

C++ ofstream vs. C++ cout piped to file

I'm writing a set of unit tests that write calculated values out to files. Each test produces a square matrix that holds anywhere from 50,000 to 500,000 doubles, and I have a total of 128 combinations of test cases.
Is there any significant overhead involved in writing cout statements and then piping that output to files, or would I be better off writing directly to the file using an ofstream?
This is going to be dependent on your system and environment. This likely to be very little difference, but there is only one way to be sure: try both approaches and measure them.
Since the dimensions involved are so large I'm assuming that these files are not meant to be read by a human being? Just make sure you write them out as binary and not human-readable text because that will make so much more difference than the difference between using ofstream or piping cout.
Whether this means you have to use ofstream or not I don't know. I've never written binary to cout so I can't say whether that's possible...
As Charles Bailey said, it's implementation dependent; what follows is mostly for linux implementation with gnu toolchain, but I hardly imagine it being very different in other os.
In libstdc++ 4.4.2:
An fstream contain an underlying stdio_filebuf which is a basic_filebuf. This basic_filebuf contain it's own buffer by inheriting basic_streambuf, and actually contain a __basic_file, itself containing an underlying plain C stdio abstraction (FILE* or std::__c_file*), in which it flush the buffer.
cout, which is an ostream is initialized with a stdio_sync_filebuf itself initialized with the C file abstraction stdout. stdio_sync_filebuf call plain C stdio functions.
Considering only C++, it appear that an fstream may be more efficient thanks to two layers of buffer.
Considering C only, if the process is forked with the stdout file descriptor redirected in a file, there should be no difference between writing to a new opened file (what fstream does at the end) or to stdout since the fd point to a file anyway (what cout does at the end).
If I were you, I would use an fstream since it's your intent.