C++ Compilation. Phase of translation #1. Universal character name - c++

I can't understand what does it mean in c++ standard:
Any source file character not in the basic source character set (2.3)
is replaced by the universal-character-name that designates that
charac- ter. (An implementation may use any internal encoding, so long
as an actual extended character encountered in the source file, and
the same extended character expressed in the source file as a
universal-character-name (i.e., using the \uXXXX notation), are
handled equivalently except where this replacement is reverted in a
raw string literal.)
As I understand, if compiler sees charcter not in the basic character set it's just replaced it with sequence of characters in this format '\uNNNN' or '\UNNNNNNNN'. But I don't get how to obtain this NNNN or NNNNNNNN.
So this is my question: how to do conversion ?

Note the preceding sentence which states:
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary.
That is, it's entirely up to the compiler how it actually interprets the characters or bytes that make up your file. In doing this interpretation, it must decide which of the physical characters belong to the basic source character set and which don't. If a character does not belong, then it is replaced with the universal character name (or at least, the effect is as if it had done).
The point of this is to reduce the source file down to a very small set of characters - there are only 96 characters in the basic source character set. Any character not in the basic source character set has been replaced by \, u or U, and some hexadecimal digits (0-F).
A universal character name is one of:
\uNNNN
\UNNNNNNNN
Where each N is a hexadecimal digit. The meaning of these digits is given in §2.3:
The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. If the hexadecimal value for a universal-character-name corresponds to a surrogate code point (in the range 0xD800–0xDFFF, inclusive), the program is ill-formed.
The ISO/IEC 10646 standard originated before Unicode and defined the Universal Character Set (UCS). It assigned code points to characters and specified how those code points should be encoded. The Unicode Consortium and the ISO group then joined forces to work on Unicode. The Unicode standard specifies much more than ISO/IEC 10646 does (algorithms, functional character specifications, etc.) but both standards are now kept in sync.
So you can think of the NNNN or NNNNNNNN as the Unicode code point for that character.
As an example, consider a line in your source file containing this:
const char* str = "Hellô";
Since ô is not in the basic source character set, that line is internally translated to:
const char* str = "Hell\u00F4";
This will give the same result.
There are only certain parts of your code that a universal-character-name is permitted:
In string literals
In character literals
In identifiers (however, this is not very well supported)

But I don't get how to obtain this NNNN or NNNNNNNN. So this is my question: how to do conversion?
The mapping is implementation-defined (e.g. §2.3 footnote 14). For instance if I save the following file as Latin-1:
#include <iostream>
int main() {
std::cout << "Hallö\n";
}
And compile it with g++ on OS X, I get the following output after running it:
Hell�
… but if I had saved it as UTF-8 I would have gotten this:
Hellö
Because GCC assumes UTF-8 as the input encoding on my system.
Other compilers may perform different mappings.

So, if your file is called Hello°¶.c, the compile would, when using that name internally, e.g. if we do:
cout << __FILE__ << endl;
the compiler would translate Hello°¶.c to Hello\u00b0\u00b6.c.
However, when I just tried this with g++ it doesn't do that...
But the assembler output contains:
.string "Hello\302\260\302\266.c"

Related

Why are string literals parsed for trigraph sequences in Gnu gcc/g++?

Consider this innocuous C++ program:
#include <iostream>
int main() {
std::cout << "(Is this a trigraph??)" << std::endl;
return 0;
}
When I compile it using g++ version 5.4.0, I get the following diagnostic:
me#my-laptop:~/code/C++$ g++ -c test_trigraph.cpp
test_trigraph.cpp:4:36: warning: trigraph ??) ignored, use -trigraphs to enable [-Wtrigraphs]
std::cout << "(Is this a trigraph??)" << std::endl;
^
The program runs, and its output is as expected:
(Is this a trigraph??)
Why are string literals parsed for trigraphs at all?
Do other compilers do this, too?
Trigraphs were handled in translation phase 1 (they are removed in C++17, however). String literal related processing happens in subsequent phases. As the C++14 standard specifies (n4140) [lex.phases]/1.1:
The precedence among the syntax rules of translation is specified by
the following phases.
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary. The set of physical source file characters accepted is
implementation-defined. Trigraph sequences ([lex.trigraph]) are
replaced by corresponding single-character internal representations.
Any source file character not in the basic source character set
([lex.charset]) is replaced by the universal-character-name that
designates that character. (An implementation may use any internal
encoding, so long as an actual extended character encountered in the
source file, and the same extended character expressed in the source
file as a universal-character-name (i.e., using the \uXXXX notation),
are handled equivalently except where this replacement is reverted in
a raw string literal.)
This happened first, because as you were told in comments, the characters that trigraphs stood for needed to be printable as well.
This behavious is inherited from C compilers and the old time when we used serial terminals where only 7 bits were used (the 8th being a parity bit). To allow non English languages with special characters (for example the accented àéèêîïôù in French or ñ in Spanish) the ISO/IEC 646 code pages used some ASCII (7bits) code to represent them. In particular, the codes 0x23, 0x24 (#$ in ASCII) 0x40 (#), 0x5B to 0x5E([\]^), 0x60 (`) and 0x7B to 0x7E ({|}~) could be replaced by national variants1.
As they have special meaning in C, they could be replaced in source code with trigraphs using only the invariant part of the ISO 646.
For compatibility reasons, this has been kept up to the C++14, when only dinosaurs still remember of the (not so good) days of ISO646 and 7 bits only code pages.
1 For example, the French variant used: 0x23 £, 0x40 à 0x5B-0x5D °ç§, 0x60 µ, 0x7B-0x7E éùè¨

String characters interpretation by compiler

Lets start from a simple line in C++
char const* hello = "動画、読書な"; // I hope it is not offensive, I dont know what this means ))
And make a point that this line is stored in utf-8 encoded file.
When I pass the file with this line for compilation (result is a binary code), the compile do the following steps:
Reads file (it needs to know what is the file encoding, in case of utf-8 it probably will be easy by using BOM, but what about other encodings?)
Parse the file content using its grammar, build syntax tree, ...
If everything is fine, it start writing binary code, in this stage it saves constans in the code.
The question is how it will store the constant above ("動画、読書な")? Does it convert it somehow?
Or it just reads bytes after " character untill another " from file and store them as it is? Then does it mean that final binary code depends on the original source file encoding?
The source code has to be translated into ASCII in an implementation defined way, preserving characters that are from the original encoding using escape sequences where necessary:
ISO/IEC 14882:2003(E)
2.1.1 Phases of translation
Physical source file characters are mapped, in an implementation-defined manner, to the basic source
character set (introducing new-line characters for end-of-line
indicators) if necessary. Trigraph sequences (2.3) are replaced by
corresponding single-character internal representations. Any source
file character not in the basic source character set (2.2) is replaced
by the universal-character-name that designates that character. (An
implementation may use any internal encoding, so long as an actual
extended character encountered in the source file, and the same
extended character expressed in the source file as a
universal-character-name (i.e. using the \uXXXX notation), are handled
equivalently.)
...
15) The glyphs for the members of the basic
source character set are intended to identify characters from the
subset of ISO/IEC 10646 which corresponds to the ASCII character set.
However, because the mapping from source file characters to the source
character set (described in translation phase 1) is specified as
implementation-defined, an implementation is required to document how
the basic source characters are represented in source files.

identifier character set (clang)

I never use clang.
And I accidentally discovered that this piece of code:
#include <iostream>
void функция(int переменная)
{
std::cout << переменная << std::endl;
}
int main()
{
int русская_переменная = 0;
функция(русская_переменная);
}
will compiles fine: http://rextester.com/NFXBL38644 (clang 3.4 (clang++ -Wall -std=c++11 -O2)).
Is it a clang extension ?? And why ?
Thanks.
UPD: I'm more asking why clang make such decision ? Because I never found the discussion that someone want more characters then c++ standard have now (2.3, rev. 3691)
It's not so much an extension as it is Clang's interpretation of the Multibyte characters part of the standard. Clang supports UTF-8 source code files.
As to why, I guess "why not?" is the only real answer; it seems useful and reasonable to me to support a larger character set.
Here are the relevant parts of the standard (C11 draft):
5.2.1 Character sets
1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.
2 In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters. A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.
3 Both the basic source and basic execution character sets shall have the following members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab, vertical tab, and form feed. The representation of each member of the source and execution basic character sets shall fit in a byte. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. In source files, there shall be some way of indicating the end of each line of text; this International Standard treats such an end-of-line indicator as if it were a single new-line character. In the basic execution character set, there shall be control characters representing alert, backspace, carriage return, and new line. If any other characters are encountered in a source file (except in an identifier, a character constant, a string literal, a header name, a comment, or a preprocessing token that is never
converted to a token), the behavior is undefined.
4 A letter is an uppercase letter or a lowercase letter as defined above; in this International Standard the term does not include other characters that are letters in other alphabets.
5 The universal character name construct provides a way to name other characters.
And also:
5.2.1.2 Multibyte characters
1 The source character set may contain multibyte characters, used to represent members of the extended character set. The execution character set may also contain multibyte characters, which need not have the same encoding as for the source character set. For both character sets, the following shall hold:
— The basic character set shall be present and each character shall be encoded as a single byte.
— The presence, meaning, and representation of any additional members is locale- specific.
— A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte characters are encountered in the sequence. While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state. The interpretation for subsequent bytes in the sequence is a function of the current shift state.
— A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character.
2 For source files, the following shall hold:
— An identifier, comment, string literal, character constant, or header name shall begin and end in the initial shift state.
— An identifier, comment, string literal, character constant, or header name shall consist of a sequence of valid multibyte characters.
Given clang's usage of UTF-8 as the source encoding, this behavior is mandated by the standard:
C++ defines an identifier as the following:
identifier:
identifier-nondigit
identifier identifier-nondigit
identifier digit
identifier-nondigit:
nondigit
universal-character-name
other implementation-defined characters
The important part here is that identifiers can include unversal-character-names. The specifications also lists allowed UCNs:
Annex E (normative)
Universal character names for identifier characters [charname]
E.1 Ranges of characters allowed [charname.allowed]
00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6, 00F8-00FF
0100-167F, 1681-180D, 180F-1FFF
200B-200D, 202A-202E, 203F-2040, 2054, 2060-206F
2070-218F, 2460-24FF, 2776-2793, 2C00-2DFF, 2E80-2FFF
3004-3007, 3021-302F, 3031-303F
3040-D7FF
F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD
10000-1FFFD, 20000-2FFFD, 30000-3FFFD, 40000-4FFFD, 50000-5FFFD,
60000-6FFFD, 70000-7FFFD, 80000-8FFFD, 90000-9FFFD, A0000-AFFFD,
B0000-BFFFD, C0000-CFFFD, D0000-DFFFD, E0000-EFFFD
The cyrillic characters in your identifier are in the range 0100-167F.
The C++ specification further mandates that characters encoded in the source encoding be handled identically to UCNs:
Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently
— n3337 §2.2 Phases of translation [lex.phases]/1
So given clang's choice of UTF-8 as the source encoding, the spec mandates that these characters be converted to UCNs (or that clang's behavior be indistinguishable from performing such a conversion), and these UCNs are permitted by the spec to appear in identifiers.
It goes even further. Emoji characters happen to be in the ranges allowed by the C++ spec, so if you've seen some of those examples of Swift code with emoji identifiers and were surprised by such capability you might be even more surprised to know that C++ has exactly the same capability:
http://rextester.com/EPYJ41676
http://imgur.com/EN6uanB
Another fact that may be surprising is that this behavior isn't new with C++11; C++ has mandated this behavior since C++98. It's just that compilers ignored this for a long time: Clang implemented this feature in version 3.3 in 2013. According to this documentation Microsoft Visual Studio supports this in 2015.
Even today GCC 6.1 only supports UCNs in identifiers when they are written literally, and does not obey the mandate that any character in its extended source character set must be treated identically with the corresponding universal-character-name. E.g. gcc allows int \u043a\u043e\u0448\043a\u0430 = 10; but will not allow int кошка = 10; even with -finput-charset=utf-8.

Printing universal characters

Can anyone explain why universal character literals (eg "\u00b1") are being encoded into char strings as UTF-8? Why does the following print the plus/minus symbol?
#include <iostream>
#include <cstring>
int main()
{
std::cout << "\u00b1" << std::endl;
return 0;
}
Is this related to my current locale?
2.13.2. [...]
5/ A universal-character-name is translated to the encoding, in
the execution character set, of the character named. If there is no
such encoding, the universal-character-name is translated to an
implementation defined encoding. [Note: in translation phase 1, a
universal-character-name is introduced whenever an actual extended
character is encountered in the source text. Therefore, all extended
characters are described in terms of universal-character-names.
However, the actual compiler implementation may use its own native
character set, so long as the same results are obtained. ]
and
2.2. [...] The values of the members of the execution character sets
are implementation-defined, and any additional members are
locale-specific.
In short, the answer to your question is in your compiler documentation. However:
2.2. 2/ The character designated by the universal-character-name
\UNNNNNNNN is that character whose character short name in ISO/IEC
10646 is NNNNNNNN; the character designated by the
universal-character-name \uNNNN is that character whose character
short name in ISO/IEC 10646 is 0000NNNN. If the hexadecimal value for
a universal character name is less than 0x20 or in the range 0x7F-0x9F
(inclusive), or if the universal character name designates a character
in the basic source character set, then the program is illformed.
so you are guaranteed that the character you name is translated into an implementation defined encoding, possibly locale specific.
\u00b1 is the ± symbol as that is the correct unicode representation regardless of locale.
Your code at ideone, see here.
String literals e.g. "abcdef" are simple byte arrays (of type const char[]). Compiler encodes non-ASCII characters in them into something that is implementation-defined. Rumors say Visual C++ uses current Windows' ANSI codepage, and GCC uses UTF-8, so you're probably on GCC :)
So, \uABCD is interpreted by compiler at compile time and converted into the corresponding value in that encoding. I.e. it can put one or more bytes into the byte array:
sizeof("\uFE58z") == 3 // visual C++ 2010
sizeof("\uFE58z") == 5 // gcc 4.4 mingw
And yet, how cout will print the byte array, depends on locale settings. You can change stream's locale via std::ios_base::imbue() call.
C++ Character Sets
With the standardization of C++, it's useful to review some of the mechanisms included in the language for dealing with character sets. This might seem like a very simple issue, but there are some complexities to contend with.
The first idea to consider is the notion of a "basic source character set" in C++. This is defined to be:
all ASCII printing characters 041 - 0177, save for # $ ` DEL
space
horizontal tab
vertical tab
form feed
newline
or 96 characters in all. These are the characters used to compose a C++ source program.
Some national character sets, such as the European ISO-646 one, use some of these character positions for other letters. The ASCII characters so affected are:
[ ] { } | \
To get around this problem, C++ defines trigraph sequences that can be used to represent these characters:
[ ??(
] ??)
{ ??<
} ??>
| ??!
\ ??/
# ??=
^ ??'
~ ??-
Trigraph sequences are mapped to the corresponding basic source character early in the compilation process.
C++ also has the notion of "alternative tokens", that can be used to replace tokens with others. The list of tokens and their alternatives is this:
{ <%
} %>
[ <:
] :>
# %:
## %:%:
&& and
| bitor
|| or
^ xor
~ compl
& bitand
&= and_eq
|= or_eq
^= xor_eq
! not
!= not_eq
Another idea is the "basic execution character set". This includes all of the basic source character set, plus control characters for alert, backspace, carriage return, and null. The "execution character set" is the basic execution character set plus additional implementation-defined characters. The idea is that a source character set is used to define a C++ program itself, while an execution character set is used when a C++ application is executing.
Given this notion, it's possible to manipulate additional characters in a running program, for example characters from Cyrillic or Greek. Character constants can be expressed using any of:
\137 octal
\xabcd hexadecimal
\u12345678 universal character name (ISO/IEC 10646)
\u1234 -> \u00001234
This notation uses the source character set to define execution set characters. Universal character names can be used in identifiers (if letters) and in character literals:
'\u1234'
L'\u2345'
The above features may not yet exist in your local C++ compiler. They are important to consider when developing internationalized applications.

Why does stringizing an euro sign within a string literal using UTF8 not produce an UCN?

The spec says that at phase 1 of compilation
Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character.
And at phase 4 it says
Preprocessing directives are executed, macro invocations are expanded
At phase 5, we have
Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set
For the # operator, we have
a \ character is inserted before each " and \ character of a character literal or string literal (including the delimiting " characters).
Hence I conducted the following test
#define GET_UCN(X) #X
GET_UCN("€")
With an input character set of UTF-8 (matching my file's encoding), I expected the following preprocessing result of the #X operation: "\"\\u20AC\"". GCC, Clang and boost.wave don't transform the € into a UCN and instead yield "\"€\"". I feel like I'm missing something. Can you please explain?
It's simply a bug. §2.1/1 says about Phase 1,
(An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)
This is not a note or footnote. C++0x adds an exception for raw string literals, which might solve your problem at hand if you have one.
This program clearly demonstrates the malfunction:
#include <iostream>
#define GET_UCN(X) L ## #X
int main() {
std::wcout << GET_UCN("€") << '\n' << GET_UCN("\u20AC") << '\n';
}
http://ideone.com/lb9jc
Because both strings are wide, the first is required to be corrupted into several characters if the compiler fails to interpret the input multibyte sequence. In your given example, total lack of support for UTF-8 could cause the compiler to slavishly echo the sequence right through.
"and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set"
used to be
"or universal-character-name in character literals and string literals is converted to a member of the execution character set"
Maybe you need a future version of g++.
I'm not sure where you got that citation for translation phase 1—the C99 standard says this about translation phase 1 in §5.1.1.2/1:
Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
So in this case, the Euro character € (represented as the multibyte sequence E2 82 AC in UTF-8) is mapped into the execution character set, which also happens to be UTF-8, so its representation remains the same. It doesn't get converted into a universal character name because, well, there's nothing that says that it should.
I suspect you'll find that the euro sign does not satisfy the condition Any source file character not in the basic source character set so the rest of the text you quote doesn't apply.
Open your test file with your favourite binary editor and check what value is used to represent the euro sign in GET_UCN("€")