What Makes a Binary Stream Special?

What Makes a Binary Stream Special? - c++

std::fstream has the option to consider streams as binary, rather than textual. What's the difference?
As far as I know, it all depends on how the file is opened in other programs. If I write A to a binary stream, it'll just get converted to 01000001 (65, A's ASCII code) - the exact same representation. That can be read as the letter "A" in text editors, or the binary sequence 01000001 for other programs.
Am I missing something, or does it not make any difference whether a stream is considered binary or not?

In text streams, newline characters may be translated to and from the \n character; with binary streams, this doesn't happen. The reason is that different OS's have different conventions for storing newlines; Unix uses \n, Windows \r\n, and old-school Macs used \r. To a C++ program using text streams, these all appear as \n.

On Linux/Unix/Android there is no difference.
On a Mac OS/X or later there is no difference, but I think older Macs might change an '\n' to an '\r' on reading, and the reverse on writing (only for a Text stream).
On Windows, for a text stream some characters get treated specially. A '\n' character is written as "\r\n", and a "\r\n" pair is read as '\n'. An '\0x1A' character is treated as "end of file" and terminates reading.
I think Symbian PalmOS/WebOS behave the same as Windows.
A binary stream just writes bytes and won't do any transformation on any platform.

You got it the other way around, it's text streams that are special, specifically because of the \n translation to either \n or \r\n (or even \r..) depending on your system.

The practical difference is the treatment of line-ending sequences on Microsoft operating systems.
Binary streams return the data in the file precisely as it is stored. Text streams normalize line-ending sequences, replacing them with '\n'.

If you open it as text then the C or C++ runtime will perform newline conversions depending on the host (Windows or linux).

Related

How to keep CR when reading file into string? [duplicate]

I tried using fopen in C, the second parameter is the open mode. The two modes "r" and "rb" tend to confuse me a lot. It seems they are the same. But sometimes it is better to use "rb". So, why does "r" exist?
Explain it to me in detail or with examples.
Thank You.

You should use "r" for opening text files. Different operating systems have slightly different ways of storing text, and this will perform the correct translations so that you don't need to know about the idiosyncracies of the local operating system. For example, you will know that newlines will always appear as a simple "\n", regardless of where the code runs.
You should use "rb" if you're opening non-text files, because in this case, the translations are not appropriate.

On Linux, and Unix in general, "r" and "rb" are the same. More specifically, a FILE pointer obtained by fopen()ing a file in in text mode and in binary mode behaves the same way on Unixes. On windows, and in general, on systems that use more than one character to represent "newlines", a file opened in text mode behaves as if all those characters are just one character, '\n'.
If you want to portably read/write text files on any system, use "r", and "w" in fopen(). That will guarantee that the files are written and read properly. If you are opening a binary file, use "rb" and "wb", so that an unfortunate newline-translation doesn't mess your data.
Note that a consequence of the underlying system doing the newline translation for you is that you can't determine the number of bytes you can read from a file using fseek(file, 0, SEEK_END).
Finally, see What's the difference between text and binary I/O? on comp.lang.c FAQs.

use "rb" to open a binary file. Then the bytes of the file won't be encoded when you read them

"r" is the same as "rt" for Translated mode
"rb" is
non-translated mode.
This makes a difference on Windows, at least. See that link for details.

On most POSIX systems, it is ignored. But, check your system to be sure.
XNU
The mode string can also include the letter 'b' either as last character or as a character between the characters in any of the two-character strings described above. This is strictly for compatibility with ISO/IEC 9899:1990 ('ISO C90') and has no effect; the 'b' is ignored.
Linux
The mode string can also include the letter 'b' either as a last
character or as a character between the characters in any of the two-
character strings described above. This is strictly for
compatibility with C89 and has no effect; the 'b' is ignored on all
POSIX conforming systems, including Linux. (Other systems may treat
text files and binary files differently, and adding the 'b' may be a
good idea if you do I/O to a binary file and expect that your program
may be ported to non-UNIX environments.)

What is the behavior of writing a non-printing character in C/C++?

Is the behavior of writing a non-printing character undefined or implementation-defined, if the character is written via printf/fprintf? I am confused because the words in the C standard N1570/5.2.2 only talks about the display semantics for printing characters and alphabetic escape sequences.
In addition, what if the character is written via std::ostream (C++ only)?

The output of ASCII non-printable (control) characters is implementation defined.
Specifically, interpretation is the responsibility of the output device.
Edit 1:
When the output device is opened as a file, it can be opened as binary. When opened as binary the output is not translated (e.g. line endings).

What is the value of '\n' under C compilers for old Mac OS?

Background:
In versions of Mac OS up to version 9, the standard representation for text files used an ASCII CR (carriage return) character, value decimal 13, to mark the end of a line.
Mac OS 10, unlike earlier releases, is UNIX-like, and uses the ASCII LF (line feed) character, value decimal 10, to mark the end of a line.
The question is, what are the values of the character constants '\n' and '\r' in C and C++ compilers for Mac OS releases prior to OS X?
There are (at least) two possible approaches that could have been taken:
Treat '\n' as the ASCII LF character, and convert it to and from CR on output to and input from text streams (similar to the conversion between LF and CR-LF on Windows systems); or
Treat '\n' as the ASCII CR character, which requires no conversion on input or output.
There would be some potential problems with the second approach. One is that code that assumes '\n' is LF could fail. (Such code is inherently non-portable anyway.) The other is that there still needs to be a distinct value for '\r', and on an ASCII-based system CR is the only sensible value. And the C standard doesn't permit '\n' == '\r' (thanks to mafso for finding the citation, 5.2.2 paragraph 3), so some other value would have to be used for '\r'.
What is the output of this C program when compiled and executed under Mac OS N, for N less than 10?
#include <stdio.h>
int main(void) {
printf("'\\n' = %d\n", '\n');
printf("'\\r' = %d\n", '\r');
if ('\n' == '\r') {
printf("Hmm, this could be a problem\n");
}
}
The question applies to both C and C++. I presume the answer would be the same for both.
The answer could also vary from one C compiler to another -- but I would hope that compiler implementers would have maintained consistency with each other.
To be clear, I am not asking what representation old releases of Mac OS used to represent end-of-line in text files. My question is specifically and only about the values of the constants '\n' and '\r' in C or C++ source code. I'm aware that printing '\n' (whatever its value is) to a text stream causes it to be converted to the system's end-of-line representation (in this case, ASCII CR); that behavior is required by the C standard.

The values of the character constants \r and \n was the exact same in Classic Mac OS environments as it was everywhere else: \r was CR was ASCII 13 (0x0d); \n was LF was ASCII 10 (0x0a). The only thing that was different on Classic Mac OS was that \r was used as the "standard" line ending in text editors, just like \n is used on UNIX systems, or \r\n on DOS and Windows systems.
Here's a screenshot of a simple test program running in Metrowerks CodeWarrior on Mac OS 9, for instance:
Keep in mind that Classic Mac OS systems didn't have a system-wide standard C library! Functions like printf() were only present as part of compiler-specific libraries like SIOUX for CodeWarrior, which implemented C standard I/O by writing output to a window with a text field in it. As such, some implementations of standard file I/O may have performed some automatic translation between \r and \n, which may be what you're thinking of. (Many Windows systems do similar things for \r\n if you don't pass the "b" flag to fopen(), for instance.) There was certainly nothing like that in the Mac OS Toolbox, though.

I've done a search and found this page with an old discussion where especially the following can be found:
The Metrowerks MacOS implementation goes a step further by
reversing the significance of CR and LF with regard to
the '\r' and '\n' escapes in i/o involving a file, but not
in any other context. This means that if you open a FILE or
fstream in text mode, every '\r' will be output there as
an LF as well as every '\n' being output as CR, and the same
is true of input - the escape-to-ASCII-binary correspondences
are reversed. They are not reversed however in memory, e.g.
with sprintf() to a buffer or with a std::stringstream.
I find this confusing and, if not non-standard, at least
worse than other implementations.
It turns out there is a workaround with MSL - if you open
the file in binary mode then '\n' always == LF and
'\r' always == CR. This is what I wanted but in getting
this information I also got a lot of justification from
folks over there that this was the "standard" way to get
what I wanted, when I feel like this is more like a workaround
for a bug in their implementation. After all, CR and LF
are 7-bit ASCII values and I'd expect to be able to use
them in a standard way with a file opened in text mode.
(An answer makes clear that this is indeed not a violation of the standard.)
So obviously there was at least one implementation which used \n and \r with the usual ASCII values, but translated them in (non-binary) file output (by just exchanging them).

On older Mac compilers, the roles of \r and \n where reversed: We had '\n' == 13 and '\r' == 10, while today '\n' == 10 and '\r' == 13. Great fun during the transition phase. Write a '\n' to a file with an old compiler, read the file with a new compiler, and get a '\r' (of course, both times you actually had a number 13).

C-language specification:
5.2.2
...
2 Alphabetic escape sequences representing nongraphic characters in the execution character set are intended to produce actions on display devices as follows:
...
\n (new line) Moves the active position to the initial position of the next line.
\r (carriage return) Moves the active position to the initial position of the current line.
so \n represents the appropriate char in that character encoding... in ASCII is the LF char

I don't have an old Mac compiler to check if they follow this, but the numeric value of '\n' should be the same as the ASCII new line character (given that those compilers used ASCII compatible encoding as the execution encoding, which I believe they did). '\r' should have the same numeric value as the ASCII carriage return.
The library or OS functions that handle writing text mode files is responsible for converting the numeric value of '\n' to whatever the OS uses to terminate lines. The numeric values of these characters at runtime are determined entirely by the execution character set.
Thus, since we're still ASCII compatible execution encodings the numeric values should be the same as with classic Mac compilers.

Are fprintf, fputs... functions on windows legal according to the C standard?

I was wondering: why does writing in a file with the standard lib converts your \n into \r\n? Is it standard behaviour according C99 or a "commidity" added by MS? If it is standard, with the old Apple convention (\r), would writing "toto\n" to a file write "toto\r"? Is the current behaviour here so that you could read UNIX file but UNIXes could not read yours?(I love conspiracy theories)

why does writing in a file with the standard lib converts your \n into \r\n?
It's to make code more portable; you can just use \n in your program and have it work on UNIX, Windows, Macs, and (supposedly) everything else.
Is it standard behaviour according C99 or a "commidity" added by MS?
Yes, it's standard.
If it is standard, with the old Apple convention (\r), would writing "toto\n" to a file write "toto\r"?
Yes, translating end-of-line characters is expected.
Is the current behaviour here so that you could read UNIX file but UNIXes could not read yours?
No, there's no conspiracy.
From the C11 spec, §7.21.2 Streams, ¶2:
… Characters may have to be added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment. Thus, there need not be a one-to-one correspondence between the characters in a stream and those in the external representation. …
If you don't want this behaviour, open your file as a binary stream rather than as a text stream.

Why CR LF is changed to LF in Windows?

In Windows when you read characters \r\n from the file(or stdin) in text mode, \r gets deleded and you only read \n.
Is there a standard according to which it should be so?
Could I be sure that it will be true for any compiler on Windows? Will others platform-specifics character combinations will replaced by \n on those platforms too?
I use this code to generate the input and use this code to read it. The results are here. You may note few missed \r's

Yes, this comes from compatibility with C. In C text streams, lines are terminated by a newline character. This is the internal representation of the text stream as seen by the program. The I/O library converts between the internal representation and some external one.
The internal representation is platform-independent, whereas there are different platform-specific conventions for text. That's the point of having a text mode in the stream library; portable text manipulating programs can be written which do not have to contain a pile of #ifdef directives to work on different platforms, or build their own platform-independent text abstraction.
It so happens that the internal representation for C text streams matches the native Unix representation of text files, since the C language and its library originated on Unix. For portability of C programs to other platforms, the text stream abstraction was added which makes text files on non-Unix system look like Unix text files.
In the ISO/IEC 9899:1999 standard ("C99"), we have this:
7.19.2 Streams
[...]
A text stream is an ordered sequence of characters composed into lines, each line
consisting of zero or more characters plus a terminating new-line character. Whether the
last line requires a terminating new-line character is implementation-defined. Characters
may have to be added, altered, or deleted on input and output to conform to differing
conventions for representing text in the host environment. Thus, there need not be a one-to-one correspondence between the characters in a stream and those in the external
representation.
Bold emphasis mine. C++ streams are defined in terms of C streams. There is no explanation of text versus binary mode in the C++ standard, except for a table which maps various stream mode flag combinations to strings suitable as mode arguments to fopen.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js