Lets start from a simple line in C++
char const* hello = "動画、読書な"; // I hope it is not offensive, I dont know what this means ))
And make a point that this line is stored in utf-8 encoded file.
When I pass the file with this line for compilation (result is a binary code), the compile do the following steps:
Reads file (it needs to know what is the file encoding, in case of utf-8 it probably will be easy by using BOM, but what about other encodings?)
Parse the file content using its grammar, build syntax tree, ...
If everything is fine, it start writing binary code, in this stage it saves constans in the code.
The question is how it will store the constant above ("動画、読書な")? Does it convert it somehow?
Or it just reads bytes after " character untill another " from file and store them as it is? Then does it mean that final binary code depends on the original source file encoding?
The source code has to be translated into ASCII in an implementation defined way, preserving characters that are from the original encoding using escape sequences where necessary:
ISO/IEC 14882:2003(E)
2.1.1 Phases of translation
Physical source file characters are mapped, in an implementation-defined manner, to the basic source
character set (introducing new-line characters for end-of-line
indicators) if necessary. Trigraph sequences (2.3) are replaced by
corresponding single-character internal representations. Any source
file character not in the basic source character set (2.2) is replaced
by the universal-character-name that designates that character. (An
implementation may use any internal encoding, so long as an actual
extended character encountered in the source file, and the same
extended character expressed in the source file as a
universal-character-name (i.e. using the \uXXXX notation), are handled
equivalently.)
...
15) The glyphs for the members of the basic
source character set are intended to identify characters from the
subset of ISO/IEC 10646 which corresponds to the ASCII character set.
However, because the mapping from source file characters to the source
character set (described in translation phase 1) is specified as
implementation-defined, an implementation is required to document how
the basic source characters are represented in source files.
Related
As stated in Phases of translation
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character.
src file chars are mapped to basic source character set in an impl-def way, I clearly know it. "Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character" , what should I do when I'm supposed to accept a character which is not UNICODE character? Should I map it to a single basic character set member or to a universal character name(futhurmore, convert it to basic source chars seq "/uxxxxx"), both in myown manner? the latter, as I know , is not refered in standard.
When creating string literals in C++, I would like to know how the strings are encoded -- I can specify the encoding form (UTF-8, 16, or 32), but I want to know how the compiler determines the unspecified parts of the encoding.
For UTF-8 the byte-ordering is not relevant, and I would assume the byte ordering of UTF-16 and UTF-32 is, by default, the system byte-ordering. This leaves the normalization. As an example:
std::string u8foo = u8"Föo";
std::u16string u16foo = u"Föo";
std::u32string u32foo = U"Föo";
In all three cases, there are at least two possible encodings -- decomposed or composed. For more complex characters there might by multiple possible encodings, but I would assume that the compiler would generate one of the normalized forms.
Is this a safe assumption? Can I know in advance in what normalization the text in u8foo and u16foo is stored? Can I specify it somehow?
I am of the impression this is not defined by the standard, and that it is implementation specific. How does GCC handle it? Other compilers?
The interpretation of character strings outside of the basic source character set is implementation-dependent. (Standard quote below.) So there is no definitive answer; an implementation is not even obliged to accept source characters outside of the basic set.
Normalisation involves a mapping of possibly multiple source codepoints to possibly multiple internal codepoints, including the possibility of reordering the source character sequence (if, for example, diacritics are not in the canonical order). Such transformations are more complex than the source→internal transformation anticipated by the standard, and I suspect that a compiler which attempted them would not be completely conformant. In any event, I know of no compiler which does so.
So, in general, you should ensure that the source code you provide to the compiler is normalized as per your desired normalization form, if that matters to you.
In the particular case of GCC, the compiler interprets the source according to the default locale's encoding, unless told otherwise (with the -finput-charset command-line option). It will recode if necessary to Unicode codepoints. But it does not alter the sequence of codepoints. So if you give it a normalized UTF-8 string, that's what you get. And if you give it an unnormalized string, that's also what you get.
In this example on coliru, the first string is composed and the second one decomposed (although they are both in some normalization form). (The rendering of the second example string in coliru seems to be browser-dependent. On my machine, chrome renders them correctly, while firefox shifts the diacritics one position to the left. YMMV.)
The C++ standard defines the basic source character set (in §2.3/1) to be letters, digits, five whitespace characters (space, newline, tab, vertical tab and formfeed) and the symbols:
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ’
It gives the compiler a lot of latitude as to how it interprets the input, and how it handles characters outside of the basic source character set. §2.2 paragraph 1 (from C++14 draft n4527):
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (e.g., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
It's worth adding that diacritics are characters, from the perspective of the C++ standard. So the composed ñ (\u00d1) is one character and the decomposed ñ (\u006e \u0303) is two characters, regardless of how it looks to you.
A close reading of the above paragraph from the standard suggests that normalization or other transformations which are not strictly 1-1 are not permitted, although the compiler may be able to reject an input which contains characters outside the basic source character set.
Microsoft Visual C++ will keep the normalization used in the source file.
The main problem you have when doing this cross-platform is making sure the compilers are using the right encodings. Here is how MSVC handles it:
Source file encoding
The compiler has to read your source file with the right encoding.
MSVC doesn't have an option to specify the encoding on the command line but relies on the BOM to detect encoding, so it can read the following encodings:
UTF-16 with BOM, if the file starts with that BOM
UTF-8, if the file starts with "\xef\xbb\xbf" (the UTF-8 "BOM")
in all other cases, the file is read using an ANSI code page dependent on your system language setting. In practice this means you can only use ASCII characters in your source files.
Output encoding
Your unicode strings will be encoded with some encoding before written to your executable as a byte string.
Wide literals (L"...") are always written as UTF-16.
MSVC 2010 you can use #pragma execution_character_set("utf-8") to have char strings encoded as UTF-8. By default they are encoded in your local code page. That pragma is apparently missing from MSVC 2012 but it's back in MSVC 2013.
#pragma execution_character_set("utf-8")
const char a[] = "ŦεŞŧ";
Support for the Unicode literals (u"..." and friends) was only just now introduced with MSVC 2015.
I have an array defined as follows:
extern const char config_reg[] = {
0x05, //comment
0x00, //comment
0x00, // \\ <-- double backslash
0x01, //comment
0x03
}
As you can see, there is a double backslash inside a comment (the <-- double backslash and preceding spaces do not appear in the actual source file). When I compile this code (minus the "<-- double backslash") it acts as if the following line is non existent - i.e. equivalent to writing:
extern const char config_reg[] = {
0x05, //comment
0x00, //comment
0x00, //
0x03
}
Is this intended C++ behaviour? If so, what is its intended purpose?
I am compiling using the Parallax Propeller Simple IDE to compile my code - not a particularly good compiler, by all accounts. Is it likely that the compiler implementation is causing this behaviour?
That's correct, assuming that the <-- double backslash and preceding spaces aren't actually in the code.
A single backslash would also produce the same effect.
The newline splicing for backslash-newline occurs before comment analysis, so the 0x01 line is part of the same line as the // \\ comment, so it isn't seen when the comment analysis is done.
The ISO/IEC 14882:2011 (C++11) standard says:
2.2 Phases of translation [lex.phases]
¶1 The precedence among the syntax rules of translation is specified by the following phases.11
Physical source file characters are mapped, in an implementation-defined manner, to the basic source
character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical
source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced
by corresponding single-character internal representations. Any source file character not in the basic
source character set (2.3) is replaced by the universal-character-name that designates that character.
(An implementation may use any internal encoding, so long as an actual extended character
encountered in the source file, and the same extended character expressed in the source file as a
universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this
replacement is reverted in a raw string literal.)
Each instance of a backslash character (\) immediately followed by a new-line character is deleted,
splicing physical source lines to form logical source lines. Only the last backslash on any physical
source line shall be eligible for being part of such a splice. If, as a result, a character sequence that
matches the syntax of a universal-character-name is produced, the behavior is undefined. A source file
that is not empty and that does not end in a new-line character, or that ends in a new-line character
immediately preceded by a backslash character before any such splicing takes place, shall be processed
as if an additional new-line character were appended to the file.
The source file is decomposed into preprocessing tokens (2.5) and sequences of white-space characters
(including comments). A source file shall not end in a partial preprocessing token or in a partial comment.12 Each comment is replaced by one space character. New-line characters are retained. Whether
each nonempty sequence of white-space characters other than new-line is retained or replaced by one
space character is unspecified. The process of dividing a source file’s characters into preprocessing tokens
is context-dependent. [ Example: see the handling of < within a #include preprocessing directive.
—end example ]
11) Implementations must behave as if these separate phases occur, although in practice different phases might be folded
together.
12) A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that
requires a terminating sequence of characters, such as a header-name that is missing the closing " or >. A partial comment
would arise from a source file ending with an unclosed /* comment.
Yes, the second phase of translation involves "splicing physical source lines to form logical source lines"; if a line ends with a backslash, the following line is considered to be a continuation of that line. This is the standard behaviour. This occurs before the removal of comments in the third phase, so the fact that the backslash occurs in a comment doesn't change anything.
Line splicing is used quite frequently in C to split macros over multiple lines, since a preprocessor directive extends to the end of the line. It is much rarer in C++, which relies much less on macros than C.
I believe the original purpose in C was to work around restrictions on line length that existed on some now-archaic systems.
A \ at the end of a line escapes the newline character.
Thus in your example, it will extend the comment to the next line. The writer of the snippet probably used \\ instead of just \ for aesthetic purposes. But it doesn't only work with comments. For example this is allowed (but redundant):
int a; \
int b;
Some compilers allow whitespace between the \ and the newline but may issue a warning.
I can't understand what does it mean in c++ standard:
Any source file character not in the basic source character set (2.3)
is replaced by the universal-character-name that designates that
charac- ter. (An implementation may use any internal encoding, so long
as an actual extended character encountered in the source file, and
the same extended character expressed in the source file as a
universal-character-name (i.e., using the \uXXXX notation), are
handled equivalently except where this replacement is reverted in a
raw string literal.)
As I understand, if compiler sees charcter not in the basic character set it's just replaced it with sequence of characters in this format '\uNNNN' or '\UNNNNNNNN'. But I don't get how to obtain this NNNN or NNNNNNNN.
So this is my question: how to do conversion ?
Note the preceding sentence which states:
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary.
That is, it's entirely up to the compiler how it actually interprets the characters or bytes that make up your file. In doing this interpretation, it must decide which of the physical characters belong to the basic source character set and which don't. If a character does not belong, then it is replaced with the universal character name (or at least, the effect is as if it had done).
The point of this is to reduce the source file down to a very small set of characters - there are only 96 characters in the basic source character set. Any character not in the basic source character set has been replaced by \, u or U, and some hexadecimal digits (0-F).
A universal character name is one of:
\uNNNN
\UNNNNNNNN
Where each N is a hexadecimal digit. The meaning of these digits is given in §2.3:
The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. If the hexadecimal value for a universal-character-name corresponds to a surrogate code point (in the range 0xD800–0xDFFF, inclusive), the program is ill-formed.
The ISO/IEC 10646 standard originated before Unicode and defined the Universal Character Set (UCS). It assigned code points to characters and specified how those code points should be encoded. The Unicode Consortium and the ISO group then joined forces to work on Unicode. The Unicode standard specifies much more than ISO/IEC 10646 does (algorithms, functional character specifications, etc.) but both standards are now kept in sync.
So you can think of the NNNN or NNNNNNNN as the Unicode code point for that character.
As an example, consider a line in your source file containing this:
const char* str = "Hellô";
Since ô is not in the basic source character set, that line is internally translated to:
const char* str = "Hell\u00F4";
This will give the same result.
There are only certain parts of your code that a universal-character-name is permitted:
In string literals
In character literals
In identifiers (however, this is not very well supported)
But I don't get how to obtain this NNNN or NNNNNNNN. So this is my question: how to do conversion?
The mapping is implementation-defined (e.g. §2.3 footnote 14). For instance if I save the following file as Latin-1:
#include <iostream>
int main() {
std::cout << "Hallö\n";
}
And compile it with g++ on OS X, I get the following output after running it:
Hell�
… but if I had saved it as UTF-8 I would have gotten this:
Hellö
Because GCC assumes UTF-8 as the input encoding on my system.
Other compilers may perform different mappings.
So, if your file is called Hello°¶.c, the compile would, when using that name internally, e.g. if we do:
cout << __FILE__ << endl;
the compiler would translate Hello°¶.c to Hello\u00b0\u00b6.c.
However, when I just tried this with g++ it doesn't do that...
But the assembler output contains:
.string "Hello\302\260\302\266.c"
The latest draft of C++0x, n3126, says:
Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines.
...
Within the r-char-sequence of a raw string literal, any transformations performed in
phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted.
Technically this means that the C++ preprocessor only recognizes a backslash followed by the newline character, but I know that some C++ implementations also allow Windows- or classic Mac-style line endings as well.
Will conforming implementations of C++0x be required to preserve the newline sequence that immediately followed a backslash character \ within the r-char-sequence of a raw string? Maybe a better question is: would it be expected of a Windows C++0x compiler to undo each line splice with "\\\r\n" instead of "\\\n"?
Translation phase 1 starts with
Physical source file characters are
mapped, in an implementation-defined
manner, to the basic source character
set (introducing newline characters
for end-of-line indicators) if
necessary. Trigraph
sequences (2.3) are replaced [...]
I'd interpret the requirement "any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing)" as explicitly not reverting the transformation from source file characters to the basic source character set. Instead, source characters are later converted to the execution character set, and you get newline characters there.
If you need a specific line ending sequence, you can insert it explicitly, and use string literal concatenation:
char* nitpicky = "I must have a \\r\\n line ending!\r\n"
"Otherwise, some other piece of code will misinterpret this line!";