In Windows when you read characters \r\n from the file(or stdin) in text mode, \r gets deleded and you only read \n.
Is there a standard according to which it should be so?
Could I be sure that it will be true for any compiler on Windows? Will others platform-specifics character combinations will replaced by \n on those platforms too?
I use this code to generate the input and use this code to read it. The results are here. You may note few missed \r's
Yes, this comes from compatibility with C. In C text streams, lines are terminated by a newline character. This is the internal representation of the text stream as seen by the program. The I/O library converts between the internal representation and some external one.
The internal representation is platform-independent, whereas there are different platform-specific conventions for text. That's the point of having a text mode in the stream library; portable text manipulating programs can be written which do not have to contain a pile of #ifdef directives to work on different platforms, or build their own platform-independent text abstraction.
It so happens that the internal representation for C text streams matches the native Unix representation of text files, since the C language and its library originated on Unix. For portability of C programs to other platforms, the text stream abstraction was added which makes text files on non-Unix system look like Unix text files.
In the ISO/IEC 9899:1999 standard ("C99"), we have this:
7.19.2 Streams
[...]
A text stream is an ordered sequence of characters composed into lines, each line
consisting of zero or more characters plus a terminating new-line character. Whether the
last line requires a terminating new-line character is implementation-defined. Characters
may have to be added, altered, or deleted on input and output to conform to differing
conventions for representing text in the host environment. Thus, there need not be a one-to-one correspondence between the characters in a stream and those in the external
representation.
Bold emphasis mine. C++ streams are defined in terms of C streams. There is no explanation of text versus binary mode in the C++ standard, except for a table which maps various stream mode flag combinations to strings suitable as mode arguments to fopen.
Related
I tried using fopen in C, the second parameter is the open mode. The two modes "r" and "rb" tend to confuse me a lot. It seems they are the same. But sometimes it is better to use "rb". So, why does "r" exist?
Explain it to me in detail or with examples.
Thank You.
You should use "r" for opening text files. Different operating systems have slightly different ways of storing text, and this will perform the correct translations so that you don't need to know about the idiosyncracies of the local operating system. For example, you will know that newlines will always appear as a simple "\n", regardless of where the code runs.
You should use "rb" if you're opening non-text files, because in this case, the translations are not appropriate.
On Linux, and Unix in general, "r" and "rb" are the same. More specifically, a FILE pointer obtained by fopen()ing a file in in text mode and in binary mode behaves the same way on Unixes. On windows, and in general, on systems that use more than one character to represent "newlines", a file opened in text mode behaves as if all those characters are just one character, '\n'.
If you want to portably read/write text files on any system, use "r", and "w" in fopen(). That will guarantee that the files are written and read properly. If you are opening a binary file, use "rb" and "wb", so that an unfortunate newline-translation doesn't mess your data.
Note that a consequence of the underlying system doing the newline translation for you is that you can't determine the number of bytes you can read from a file using fseek(file, 0, SEEK_END).
Finally, see What's the difference between text and binary I/O? on comp.lang.c FAQs.
use "rb" to open a binary file. Then the bytes of the file won't be encoded when you read them
"r" is the same as "rt" for Translated mode
"rb" is
non-translated mode.
This makes a difference on Windows, at least. See that link for details.
On most POSIX systems, it is ignored. But, check your system to be sure.
XNU
The mode string can also include the letter 'b' either as last character or as a character between the characters in any of the two-character strings described above. This is strictly for compatibility with ISO/IEC 9899:1990 ('ISO C90') and has no effect; the 'b' is ignored.
Linux
The mode string can also include the letter 'b' either as a last
character or as a character between the characters in any of the two-
character strings described above. This is strictly for
compatibility with C89 and has no effect; the 'b' is ignored on all
POSIX conforming systems, including Linux. (Other systems may treat
text files and binary files differently, and adding the 'b' may be a
good idea if you do I/O to a binary file and expect that your program
may be ported to non-UNIX environments.)
Is the behavior of writing a non-printing character undefined or implementation-defined, if the character is written via printf/fprintf? I am confused because the words in the C standard N1570/5.2.2 only talks about the display semantics for printing characters and alphabetic escape sequences.
In addition, what if the character is written via std::ostream (C++ only)?
The output of ASCII non-printable (control) characters is implementation defined.
Specifically, interpretation is the responsibility of the output device.
Edit 1:
When the output device is opened as a file, it can be opened as binary. When opened as binary the output is not translated (e.g. line endings).
When creating string literals in C++, I would like to know how the strings are encoded -- I can specify the encoding form (UTF-8, 16, or 32), but I want to know how the compiler determines the unspecified parts of the encoding.
For UTF-8 the byte-ordering is not relevant, and I would assume the byte ordering of UTF-16 and UTF-32 is, by default, the system byte-ordering. This leaves the normalization. As an example:
std::string u8foo = u8"Föo";
std::u16string u16foo = u"Föo";
std::u32string u32foo = U"Föo";
In all three cases, there are at least two possible encodings -- decomposed or composed. For more complex characters there might by multiple possible encodings, but I would assume that the compiler would generate one of the normalized forms.
Is this a safe assumption? Can I know in advance in what normalization the text in u8foo and u16foo is stored? Can I specify it somehow?
I am of the impression this is not defined by the standard, and that it is implementation specific. How does GCC handle it? Other compilers?
The interpretation of character strings outside of the basic source character set is implementation-dependent. (Standard quote below.) So there is no definitive answer; an implementation is not even obliged to accept source characters outside of the basic set.
Normalisation involves a mapping of possibly multiple source codepoints to possibly multiple internal codepoints, including the possibility of reordering the source character sequence (if, for example, diacritics are not in the canonical order). Such transformations are more complex than the source→internal transformation anticipated by the standard, and I suspect that a compiler which attempted them would not be completely conformant. In any event, I know of no compiler which does so.
So, in general, you should ensure that the source code you provide to the compiler is normalized as per your desired normalization form, if that matters to you.
In the particular case of GCC, the compiler interprets the source according to the default locale's encoding, unless told otherwise (with the -finput-charset command-line option). It will recode if necessary to Unicode codepoints. But it does not alter the sequence of codepoints. So if you give it a normalized UTF-8 string, that's what you get. And if you give it an unnormalized string, that's also what you get.
In this example on coliru, the first string is composed and the second one decomposed (although they are both in some normalization form). (The rendering of the second example string in coliru seems to be browser-dependent. On my machine, chrome renders them correctly, while firefox shifts the diacritics one position to the left. YMMV.)
The C++ standard defines the basic source character set (in §2.3/1) to be letters, digits, five whitespace characters (space, newline, tab, vertical tab and formfeed) and the symbols:
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ’
It gives the compiler a lot of latitude as to how it interprets the input, and how it handles characters outside of the basic source character set. §2.2 paragraph 1 (from C++14 draft n4527):
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (e.g., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
It's worth adding that diacritics are characters, from the perspective of the C++ standard. So the composed ñ (\u00d1) is one character and the decomposed ñ (\u006e \u0303) is two characters, regardless of how it looks to you.
A close reading of the above paragraph from the standard suggests that normalization or other transformations which are not strictly 1-1 are not permitted, although the compiler may be able to reject an input which contains characters outside the basic source character set.
Microsoft Visual C++ will keep the normalization used in the source file.
The main problem you have when doing this cross-platform is making sure the compilers are using the right encodings. Here is how MSVC handles it:
Source file encoding
The compiler has to read your source file with the right encoding.
MSVC doesn't have an option to specify the encoding on the command line but relies on the BOM to detect encoding, so it can read the following encodings:
UTF-16 with BOM, if the file starts with that BOM
UTF-8, if the file starts with "\xef\xbb\xbf" (the UTF-8 "BOM")
in all other cases, the file is read using an ANSI code page dependent on your system language setting. In practice this means you can only use ASCII characters in your source files.
Output encoding
Your unicode strings will be encoded with some encoding before written to your executable as a byte string.
Wide literals (L"...") are always written as UTF-16.
MSVC 2010 you can use #pragma execution_character_set("utf-8") to have char strings encoded as UTF-8. By default they are encoded in your local code page. That pragma is apparently missing from MSVC 2012 but it's back in MSVC 2013.
#pragma execution_character_set("utf-8")
const char a[] = "ŦεŞŧ";
Support for the Unicode literals (u"..." and friends) was only just now introduced with MSVC 2015.
I was wondering: why does writing in a file with the standard lib converts your \n into \r\n? Is it standard behaviour according C99 or a "commidity" added by MS? If it is standard, with the old Apple convention (\r), would writing "toto\n" to a file write "toto\r"? Is the current behaviour here so that you could read UNIX file but UNIXes could not read yours?(I love conspiracy theories)
why does writing in a file with the standard lib converts your \n into \r\n?
It's to make code more portable; you can just use \n in your program and have it work on UNIX, Windows, Macs, and (supposedly) everything else.
Is it standard behaviour according C99 or a "commidity" added by MS?
Yes, it's standard.
If it is standard, with the old Apple convention (\r), would writing "toto\n" to a file write "toto\r"?
Yes, translating end-of-line characters is expected.
Is the current behaviour here so that you could read UNIX file but UNIXes could not read yours?
No, there's no conspiracy.
From the C11 spec, §7.21.2 Streams, ¶2:
… Characters may have to be added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment. Thus, there need not be a one-to-one correspondence between the characters in a stream and those in the external representation. …
If you don't want this behaviour, open your file as a binary stream rather than as a text stream.
std::fstream has the option to consider streams as binary, rather than textual. What's the difference?
As far as I know, it all depends on how the file is opened in other programs. If I write A to a binary stream, it'll just get converted to 01000001 (65, A's ASCII code) - the exact same representation. That can be read as the letter "A" in text editors, or the binary sequence 01000001 for other programs.
Am I missing something, or does it not make any difference whether a stream is considered binary or not?
In text streams, newline characters may be translated to and from the \n character; with binary streams, this doesn't happen. The reason is that different OS's have different conventions for storing newlines; Unix uses \n, Windows \r\n, and old-school Macs used \r. To a C++ program using text streams, these all appear as \n.
On Linux/Unix/Android there is no difference.
On a Mac OS/X or later there is no difference, but I think older Macs might change an '\n' to an '\r' on reading, and the reverse on writing (only for a Text stream).
On Windows, for a text stream some characters get treated specially. A '\n' character is written as "\r\n", and a "\r\n" pair is read as '\n'. An '\0x1A' character is treated as "end of file" and terminates reading.
I think Symbian PalmOS/WebOS behave the same as Windows.
A binary stream just writes bytes and won't do any transformation on any platform.
You got it the other way around, it's text streams that are special, specifically because of the \n translation to either \n or \r\n (or even \r..) depending on your system.
The practical difference is the treatment of line-ending sequences on Microsoft operating systems.
Binary streams return the data in the file precisely as it is stored. Text streams normalize line-ending sequences, replacing them with '\n'.
If you open it as text then the C or C++ runtime will perform newline conversions depending on the host (Windows or linux).