I was reading the GCC C preprocessor -> Tokenization, in which it is mentioned that
Preprocessing tokens fall into five broad classes:
identifiers
preprocessing numbers
string literals
punctuators
other.
Any other single character is considered “other”.
It is passed on to the preprocessor's output unmolested.
The C compiler will almost certainly reject source code containing “other” tokens.
In ASCII, the only other characters are ‘#’, ‘$’, ‘`’, and control characters other
than NUL (all bits zero).
I was also browsing the web and I came across 'C Character Set' in which they have mentioned '#' as one of the character.
Is the article which is mentioning '#' as one of the 'C Character Set' is wrong? or my understanding is wrong?
Thanks.
There are some compilers that allow "extra" characters, such as # or $ as part of identifiers. This is not part of the standard, but extensions. From memory, it is mentioned in the C++ standard in a way that indicates that "a compiler may add extra characters".
Section 2.3:
The basic source character set consists of 96 characters: the space
character, the control characters repre- senting horizontal tab,
vertical tab, form feed, and new-line, plus the following 91 graphical
characters:(14)
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ∼ ! = , \ " ’
(Note 14: The glyphs for the members of the basic source character set
are intended to identify characters from the subset of ISO/IEC 10646
which corresponds to the ASCII character set. However, because the
mapping from source file characters to the source character set
(described in translation phase 1) is specified as
implementation-defined, an implementation is required to document how
the basic source characters are represented in source files.
I'm not sure that your question is completely clear. Both the
C and the C++ standards require the compiler to support all of
the characters in Unicode, although not necessarily in
a transparent fashion: how the compiler maps input into its
internal character set is implementation defined. But by this
definition, all compilers are required to accept #, $,
etc.
What you can do with any specific character is a different
question, and there are a lot of characters (like # and $)
which can only appear in a comment, a string literal or
a character literal (which resolves to a preprocessor number in
the text you quote). Symbols, for example, may only contain _
and characters for which the Unicode type is a letter or a digit
(roughly speaking—the standard specifies exactly what
characters are and are not allowed).
Since how the implementation maps the characters in the
input to the source character set is implementation defined,
a compiler can map 0x40 (which would be a # in ASCII, Latin-1
or Unicode) to some other character, which is allowed in
a symbol. I don't know of any which take this route; I suspect,
in fact, that a compiler which wanted to allow # or $ in
a symbol would simply choose to be non-conformant, rather than
make it impossible to have the character in a string literal.
I presume you mean the character set you get when you set LANG=C?
That's not the same thing at all. That's a locale that basically just says "use ASCII" with no special extras. It requires no extra translation files or terminal support. It just means you get the default output from everything.
Alternatively, maybe you really did mean the set of characters that may appear in a C program?
Don't forget that C programs may use those characters inside quotes. Just because they don't have a meaning in any language keyword or variable doesn't mean they can't exist in the file. On the other hand, it may be an error to include UTF-8 characters inside a C string, for example.
Just because a character is valid inside a C program, doesn't mean it's valid everywhere. The if keyword is not valid outside a function, for instance.
Related
I want to extensively use the ##-operator and enum magic to handle a huge bunch of similar access-operations, error handling and data flow.
If applying the ## and # preprocessor operators results in an invalid pp-token, the behavior is undefined in C.
The order of preprocessor operation in general is not defined (*) in C90 (see The token pasting operator). Now in some cases it happens (said so in different sources, including the MISRA Committee, and the referenced page) that the order of multiple ##/#-Operators influences the occurrence of undefined behavior. But I have a really hard time to understand the examples of these sources and pin down the common rule.
So my questions are:
What are the rules for valid pp-tokens?
Are there difference between the different C and C++ Standards?
My current problem: Is the following legal with all 2 operator orders?(**)
#define test(A) test_## A ## _THING
int test(0001) = 2;
Comments:
(*) I don't use "is undefined" because this has nothing to do with undefined behavior yet IMHO, but rather unspecified behavior. More than one ## or # operator being applied do not in general render the program to be erroneous. There is obviously an order — we just can't predict which — so the order is unspecified.
(**) This is no actual application for the numbering. But the pattern is equivalent.
What are the rules for valid pp-tokens?
These are spelled out in the respective standards; C11 §6.4 and C++11 §2.4. In both cases, they correspond to the production preprocessing-token. Aside from pp-number, they shouldn't be too surprising. The remaining possibilities are identifiers (including keywords), "punctuators" (in C++, preprocessing-op-or-punc), string and character literals, and any single non-whitespace character which doesn't match any other production.
With a few exceptions, any sequence of characters can be decomposed into a sequence of valid preprocessing-tokens. (One exception is unmatched quotes and apostrophes: a single quote or apostrophe is not a valid preprocessing-token, so a text including an unterminated string or character literal cannot be tokenised.)
In the context of the ## operator, though, the result of the concatenation must be a single preprocessing-token. So an invalid concatenation is a concatenation whose result is a sequence of characters which comprise multiple preprocessing-tokens.
Are there differences between C and C++?
Yes, there are slight differences:
C++ has user defined string and character literals, and allows "raw" string literals. These literals will be tokenized differently in C, so they might be multiple preprocessing-tokens or (in the case of raw string literals) even invalid preprocessing-tokens.
C++ includes the symbols ::, .* and ->*, all of which would be tokenised as two punctuator tokens in C. Also, in C++, some things which look like keywords (eg. new, delete) are part of preprocessing-op-or-punc (although these symbols are valid preprocessing-tokens in both languages.)
C allows hexadecimal floating point literals (eg. 1.1p-3), which are not valid preprocessing-tokens in C++.
C++ allows apostrophes to be used in integer literals as separators (1'000'000'000). In C, this would probably result in unmatched apostrophes.
There are minor differences in the handling of universal character names (eg. \u0234).
In C++, <:: will be tokenised as <, :: unless it is followed by : or >. (<::: and <::> are tokenised normally, using the longest-match rule.) In C, there are no exceptions to the longest-match rule; <:: is always tokenised using the longest-match rule, so the first token will always be <:.
Is it legal to concatenate test_, 0001, and _THING, even though concatenation order is unspecified?
Yes, that is legal in both languages.
test_ ## 0001 => test_0001 (identifier)
test_0001 ## _THING => test_0001_THING (identifier)
0001 ## _THING => 0001_THING (pp-number)
test_ ## 0001_THING => test_0001_THING (identifier)
What are examples of invalid token concatenation?
Suppose we have
#define concat3(a, b, c) a ## b ## c
Now, the following are invalid regardless of concatenation order:
concat3(., ., .)
.. is not a token even though ... is. But the concatenation must proceed in some order, and .. would be a necessary intermediate value; since that is not a single token, the concatenation would be invalid.
concat3(27,e,-7)
Here, -7 is two tokens, so it cannot be concatenated.
And here is a case in which concatenation order matters:
concat3(27e, -, 7)
If this is concatenated left-to-right, it will result in 27e- ## 7, which is the concatenation of two pp-numbers. But - cannot be concatenated with 7, because (as above) -7 is not a single token.
What exactly is a pp-number?
In general terms, a pp-number is a superset of tokens which might be converted into (single) numeric literals or might be invalid. The definition is intentionally broad, partly in order to allow (some) token concatenations, and partly to insulate the preprocessor from the periodic changes in numeric formats. The precise definition can be found in the respective standards, but informally a token is a pp-number if:
It starts with a decimal digit or a period (.) followed by a decimal digit.
The rest of the token is letters, numbers and periods, possibly including sign characters (+, -) if preceded by an exponent symbol. The exponent symbol can be E or e in both languages; and also P and p in C since C99.
In C++, a pp-number can also include (but not start with) an apostrophe followed by a letter or digit.
Note: Above, letter includes underscore. Also, universal character names can be used (except following an apostrophe in C++).
Once preprocessing is terminated, all pp-numbers will be converted to numeric literals if possible. If the conversion is not possible (because the token doesn't correspond to the syntax for any numeric literal), the program is invalid.
#define test(A) test_## A ## _THING
int test(0001) = 2;
This is legal with both LTR and RTL evaluation, since both test_0001 and 0001_THING are valid preprocessor-tokens. The former is an identifier, while the latter is a pp-number; pp-numbers are not checked for suffix correctness until a later stage of compilation; think e.g. 0001u an unsigned octal literal.
A few examples to show that the order of evaluation does matter:
#define paste2(a,b) a##b
#define paste(a,b) paste2(a,b)
#if defined(LTR)
#define paste3(a,b,c) paste(paste(a,b),c)
#elif defined(RTL)
#define paste3(a,b,c) paste(a,paste(b,c))
#else
#define paste3(a,b,c) a##b##c
#endif
double a = paste3(1,.,e3), b = paste3(1e,+,3); // OK LTR, invalid RTL
#define stringify2(x) #x
#define stringify(x) stringify2(x)
#define stringify_paste3(a,b,c) stringify(paste3(a,b,c))
char s[] = stringify_paste3(%:,%,:); // invalid LTR, OK RTL
If your compiler uses a consistent order of evaluation (either LTR or RTL) and presents an error on generation of an invalid pp-token, then precisely one of these lines will generate an error. Naturally, a lax compiler could well allow both, while a strict compiler might allow neither.
The second example is rather contrived; because of the way the grammar is constructed it's very difficult to find a pp-token that is valid when build RTL but not when built LTR.
There are no significant differences between C and C++ in this regard; the two standards have identical language (up to section headers) describing the process of macro replacement. The only way the language could influence the process would be in the valid preprocessing-tokens: C++ (especially recently) has more forms of valid preprocessing-tokens, such as user-defined string literals.
The C++ standard mentions multiple different character sets. In particular, it mentions the following character sets:
In 2.2 [lex.phases] bullet 1 physical source file characters and their mapping to the basic source character set is mentioned.
In 2.2 [lex.phases] bullet 2 execution character set is mentioned.
In 2.3 [lex.charset] paragraph 3 a basic execution character set and a basic execution wide-character set are mentioned.
The same section 2.3 [lex.charset] 3 also mentions an execution character set and an execution wide-character set.
When reading or writing files these use some other character set.
What are all those different character set used for, how are conversions between them done, and which of these values are locale dependent? In particular, how are string literals represented?
Here is a break down of the different character sets used by the compiler itself (all reference to the standard are for C++14, actually):
The physical source file characters are those used in the C++ source. Most likely these are now encoded using some Unicode encoding, e.g., UTF-8 or UTF-16. If you are from a European or an American background you may be using ASCII whose characters are conveniently encoded identically in UTF-8 (every ASCII file is a UTF-8 file but not the other way around). The physical source file characters_ may also be something unusual like EBCDIC.
The basic source character set is what the compiler, at least conceptually, consumes. It is produced from the physical source file characters and either mapping them to their respective basic character or to a sequence of basic characters representing the physical source character using a universal character name (see 2.2 [lex.phases] paragraph 1). The basic source character set is a just a set of 96 character (2.3 [lex.charset] paragraph 1):
a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ’
and the 5 special characters space (' '), horizontal tab (\t), vertical tab (\v), form feed (\f), and newline (\n)
The mapping between the physical source character set and the basic character set is implementation defined.
The basic execution character set and the basic execution wide-character set are characters set capable of representing the basic source character set expanded by a few special character:
alert ('\a'), backspace ('\b'), carriage return ('\r'), and a null character ('\0')
The difference between the non-wide and the wide version is whether the characters are represented using char or wchar_t.
The execution character set and the execution wide-character set are implementation defined extensions of the basic character set and the basic wide-character set. In 2.3 [lex.charset] paragraph 3 it is stated that the additional members and the values of the additional members of execution character set are locale specific. It isn't clear which locale is referred to but I suspect the locale used during compilation is meant. In any case, the execution character sets are implementation defined (also according to 2.3 [lex.charset] paragraph 3).
Character and string literals are originally represented using the basic source character set with some characters possibly using universal character names. All of these are converted at compile time into the execution character set. According to 2.14.3 [lex.ccon] character literals representable as one char in the execution character set just work. If multiple chars are needed the character literals may be conditionally supported (and they'd have type int). For string literals the conversion is described in 2.14.5 [lex.string]. Paragraph 9 states that UTF-8 string literals (e.g. u8"hello") result in a sequence of values corresponding to the code units of the UTF-8 string. Otherwise the translation of characters and universal character names is the same as that for character literals (in particular, it is implementation defined) although characters resulting in multi-byte sequences for narrow string just result in multiple characters (this case isn't necessary support for character literals).
So far, only the result of compilation is considered. Any character which isn't part of a character or a string literal is used to specify what the code does. The interesting question is what happened to the literals? The literals are all basically translated into an implementation defined representation. That is implementation defined means that it is somewhere documented what is supposed to happen but it can differ between different implementations.
How does that help when dealing with characters or strings coming from somewhere? Well, any character or string which is read is converted to the corresponding execution character set. In particular, when a file is read, all characters are transformed to this common representation. Of course, for this transformation to work, the locale used for reading a file needs to be setup according to the encoding of that file. If the locale isn't explicitly mentioned the global locale is used which is initially determined by the system is used. The initial global locale is probably set somehow based on user preferences, e.g., based no environment variables. If a a file is read which uses a different encoding than this global locale, a corresponding different locale matching the encoding of the file needs to be used.
Correspondingly, when writing characters using one of the execution character sets, these are converted according to the encoding specified by the current locale. Again, it may be necessary to replace the locale if a specific encoding is needed.
All this effectively means that internally to a program all string and character processing happens using the implementation defined execution character set. All characters being read by a program need to be converted to this character set and all characters written start as characters in this execution character set and need to be converted appropriately to the external encoding. If course, in an ideal set up the conversion between the execution character set and the external representation is the identity, e.g., because the execution character set uses UTF-8 and the external representation also uses UTF-8. Correspondingly for the execution wide-character set except in this case UTF-16 would be used (one of the two variations as UTF-16 can either use big endian or little endian representation).
I never use clang.
And I accidentally discovered that this piece of code:
#include <iostream>
void функция(int переменная)
{
std::cout << переменная << std::endl;
}
int main()
{
int русская_переменная = 0;
функция(русская_переменная);
}
will compiles fine: http://rextester.com/NFXBL38644 (clang 3.4 (clang++ -Wall -std=c++11 -O2)).
Is it a clang extension ?? And why ?
Thanks.
UPD: I'm more asking why clang make such decision ? Because I never found the discussion that someone want more characters then c++ standard have now (2.3, rev. 3691)
It's not so much an extension as it is Clang's interpretation of the Multibyte characters part of the standard. Clang supports UTF-8 source code files.
As to why, I guess "why not?" is the only real answer; it seems useful and reasonable to me to support a larger character set.
Here are the relevant parts of the standard (C11 draft):
5.2.1 Character sets
1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.
2 In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters. A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.
3 Both the basic source and basic execution character sets shall have the following members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab, vertical tab, and form feed. The representation of each member of the source and execution basic character sets shall fit in a byte. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. In source files, there shall be some way of indicating the end of each line of text; this International Standard treats such an end-of-line indicator as if it were a single new-line character. In the basic execution character set, there shall be control characters representing alert, backspace, carriage return, and new line. If any other characters are encountered in a source file (except in an identifier, a character constant, a string literal, a header name, a comment, or a preprocessing token that is never
converted to a token), the behavior is undefined.
4 A letter is an uppercase letter or a lowercase letter as defined above; in this International Standard the term does not include other characters that are letters in other alphabets.
5 The universal character name construct provides a way to name other characters.
And also:
5.2.1.2 Multibyte characters
1 The source character set may contain multibyte characters, used to represent members of the extended character set. The execution character set may also contain multibyte characters, which need not have the same encoding as for the source character set. For both character sets, the following shall hold:
— The basic character set shall be present and each character shall be encoded as a single byte.
— The presence, meaning, and representation of any additional members is locale- specific.
— A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte characters are encountered in the sequence. While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state. The interpretation for subsequent bytes in the sequence is a function of the current shift state.
— A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character.
2 For source files, the following shall hold:
— An identifier, comment, string literal, character constant, or header name shall begin and end in the initial shift state.
— An identifier, comment, string literal, character constant, or header name shall consist of a sequence of valid multibyte characters.
Given clang's usage of UTF-8 as the source encoding, this behavior is mandated by the standard:
C++ defines an identifier as the following:
identifier:
identifier-nondigit
identifier identifier-nondigit
identifier digit
identifier-nondigit:
nondigit
universal-character-name
other implementation-defined characters
The important part here is that identifiers can include unversal-character-names. The specifications also lists allowed UCNs:
Annex E (normative)
Universal character names for identifier characters [charname]
E.1 Ranges of characters allowed [charname.allowed]
00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6, 00F8-00FF
0100-167F, 1681-180D, 180F-1FFF
200B-200D, 202A-202E, 203F-2040, 2054, 2060-206F
2070-218F, 2460-24FF, 2776-2793, 2C00-2DFF, 2E80-2FFF
3004-3007, 3021-302F, 3031-303F
3040-D7FF
F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD
10000-1FFFD, 20000-2FFFD, 30000-3FFFD, 40000-4FFFD, 50000-5FFFD,
60000-6FFFD, 70000-7FFFD, 80000-8FFFD, 90000-9FFFD, A0000-AFFFD,
B0000-BFFFD, C0000-CFFFD, D0000-DFFFD, E0000-EFFFD
The cyrillic characters in your identifier are in the range 0100-167F.
The C++ specification further mandates that characters encoded in the source encoding be handled identically to UCNs:
Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently
— n3337 §2.2 Phases of translation [lex.phases]/1
So given clang's choice of UTF-8 as the source encoding, the spec mandates that these characters be converted to UCNs (or that clang's behavior be indistinguishable from performing such a conversion), and these UCNs are permitted by the spec to appear in identifiers.
It goes even further. Emoji characters happen to be in the ranges allowed by the C++ spec, so if you've seen some of those examples of Swift code with emoji identifiers and were surprised by such capability you might be even more surprised to know that C++ has exactly the same capability:
http://rextester.com/EPYJ41676
http://imgur.com/EN6uanB
Another fact that may be surprising is that this behavior isn't new with C++11; C++ has mandated this behavior since C++98. It's just that compilers ignored this for a long time: Clang implemented this feature in version 3.3 in 2013. According to this documentation Microsoft Visual Studio supports this in 2015.
Even today GCC 6.1 only supports UCNs in identifiers when they are written literally, and does not obey the mandate that any character in its extended source character set must be treated identically with the corresponding universal-character-name. E.g. gcc allows int \u043a\u043e\u0448\043a\u0430 = 10; but will not allow int кошка = 10; even with -finput-charset=utf-8.
Can anyone explain why universal character literals (eg "\u00b1") are being encoded into char strings as UTF-8? Why does the following print the plus/minus symbol?
#include <iostream>
#include <cstring>
int main()
{
std::cout << "\u00b1" << std::endl;
return 0;
}
Is this related to my current locale?
2.13.2. [...]
5/ A universal-character-name is translated to the encoding, in
the execution character set, of the character named. If there is no
such encoding, the universal-character-name is translated to an
implementation defined encoding. [Note: in translation phase 1, a
universal-character-name is introduced whenever an actual extended
character is encountered in the source text. Therefore, all extended
characters are described in terms of universal-character-names.
However, the actual compiler implementation may use its own native
character set, so long as the same results are obtained. ]
and
2.2. [...] The values of the members of the execution character sets
are implementation-defined, and any additional members are
locale-specific.
In short, the answer to your question is in your compiler documentation. However:
2.2. 2/ The character designated by the universal-character-name
\UNNNNNNNN is that character whose character short name in ISO/IEC
10646 is NNNNNNNN; the character designated by the
universal-character-name \uNNNN is that character whose character
short name in ISO/IEC 10646 is 0000NNNN. If the hexadecimal value for
a universal character name is less than 0x20 or in the range 0x7F-0x9F
(inclusive), or if the universal character name designates a character
in the basic source character set, then the program is illformed.
so you are guaranteed that the character you name is translated into an implementation defined encoding, possibly locale specific.
\u00b1 is the ± symbol as that is the correct unicode representation regardless of locale.
Your code at ideone, see here.
String literals e.g. "abcdef" are simple byte arrays (of type const char[]). Compiler encodes non-ASCII characters in them into something that is implementation-defined. Rumors say Visual C++ uses current Windows' ANSI codepage, and GCC uses UTF-8, so you're probably on GCC :)
So, \uABCD is interpreted by compiler at compile time and converted into the corresponding value in that encoding. I.e. it can put one or more bytes into the byte array:
sizeof("\uFE58z") == 3 // visual C++ 2010
sizeof("\uFE58z") == 5 // gcc 4.4 mingw
And yet, how cout will print the byte array, depends on locale settings. You can change stream's locale via std::ios_base::imbue() call.
C++ Character Sets
With the standardization of C++, it's useful to review some of the mechanisms included in the language for dealing with character sets. This might seem like a very simple issue, but there are some complexities to contend with.
The first idea to consider is the notion of a "basic source character set" in C++. This is defined to be:
all ASCII printing characters 041 - 0177, save for # $ ` DEL
space
horizontal tab
vertical tab
form feed
newline
or 96 characters in all. These are the characters used to compose a C++ source program.
Some national character sets, such as the European ISO-646 one, use some of these character positions for other letters. The ASCII characters so affected are:
[ ] { } | \
To get around this problem, C++ defines trigraph sequences that can be used to represent these characters:
[ ??(
] ??)
{ ??<
} ??>
| ??!
\ ??/
# ??=
^ ??'
~ ??-
Trigraph sequences are mapped to the corresponding basic source character early in the compilation process.
C++ also has the notion of "alternative tokens", that can be used to replace tokens with others. The list of tokens and their alternatives is this:
{ <%
} %>
[ <:
] :>
# %:
## %:%:
&& and
| bitor
|| or
^ xor
~ compl
& bitand
&= and_eq
|= or_eq
^= xor_eq
! not
!= not_eq
Another idea is the "basic execution character set". This includes all of the basic source character set, plus control characters for alert, backspace, carriage return, and null. The "execution character set" is the basic execution character set plus additional implementation-defined characters. The idea is that a source character set is used to define a C++ program itself, while an execution character set is used when a C++ application is executing.
Given this notion, it's possible to manipulate additional characters in a running program, for example characters from Cyrillic or Greek. Character constants can be expressed using any of:
\137 octal
\xabcd hexadecimal
\u12345678 universal character name (ISO/IEC 10646)
\u1234 -> \u00001234
This notation uses the source character set to define execution set characters. Universal character names can be used in identifiers (if letters) and in character literals:
'\u1234'
L'\u2345'
The above features may not yet exist in your local C++ compiler. They are important to consider when developing internationalized applications.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
C++ source in unicode
I just discovered this line of code in a project:
string überwachung;
I was surprised, because actually I thought you are not allowed to use umlauts like 'äöü' in C++ code other than in strings and so on, and it would result in a compiler error. But this compiles just fine with visual studio 2008.
Is this a special microsoft feature, or are umlauts allowed with other compilers too?
Are there any potential problems with that (portability,system language settings..)?
I can clearly remember this was not allowed. When did it change?
Kind regards for any clarification
P.S.: the tool cppcheck will even mark this usage as an error, even though it compiles
GCC complains on it:
codepad
: error: stray '\303' in program
The C++ language standard itself limits the basic source character set to 91 printable characters plus tabs, form feed and new-line, which are all within ASCII. However, there's a nice footnote:
The glyphs for the members of the basic source character set are intended to identify characters from the subset of
ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the
source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to
document how the basic source characters are represented in source files.
.. translation phase 1 is (emphasis mine)
Physical source file characters are mapped, in an implementation-defined manner, to the basic source
character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical
source file characters accepted is implementation-defined.
Generally, you shouldn't use umlauts or other special characters in your code. If may work, but if it does, it's a compiler-specific feature.
See section E/2 of the C++03 standard:
1 This clause lists the complete set of hexadecimal code values that are valid in universal-character-names in C++ identifiers (2.10).
…
Latin: 00c0–00d6, 00d8–00f6, 00f8–01f5, 01fa–0217, 0250–02a8, 1e00–1e9a, 1ea0–1ef9
This includes most accented letters.
The problem is that C++03 didn't specify UTF-8 as the input format. Even C++11 maintains compatibility with EBCDIC.
So, you can certainly create an identifier with an umlaut; the problem is getting a text editor that will interpret the universal-character-name and display it properly. Otherwise you're stuck inputting Unicode directly in hexadecimal format \uXXXX, e.g. \u00FC for ü.
A compiler which accepts UTF-8 in string constants but not in identifiers suffers from shortsighted implementation. Clang, at least, properly translates UTF-8 to universal-character-names in Phase 1.
I believe this is the clause that applies...
2.2 Character Sets
The basic source character set
consists of 96 characters: the space
character, the control characters
representing horizontal tab, vertical
tab, form feed, and new-line, plus the
following 91 graphical characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ˆ & | ˜ ! = , \ " ’
So the use of the unlaut would appear to be a compiler-specific extension.
The compiler is free to support any characters in identifiers it desires. Your compiler apparently supports umlauts. However, it is not guaranteed by the language standard. You can't use umlauts if you expect your program to be standard-compliant.
For another example, some compilers allow using $ character in identifiers, while the language specification does not support it.
This would be allowed by the standard if and only if your editor was translating from the character with an umlaut (or other diacritical) into one of the allowed characters. In particular, an identifier in C++ is defined as:
identifier:
nondigit
identifier nondigit
identifier digit
nondigit: one of
universal-character-name
_ a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
As far as I can see, that doesn't allow characters with diacriticals (except as a UCN). It looks to me like a compiler is required to issue at least one diagnostic for a program that contains any character other than those above (though it is still allowed to translate the program). Doing a quick check, I haven't been able to find a compiler flag that gets VC++ to issue a diagnostic for this code. At least IMO, it fails to conform in this respect.
On the other hand, this could just be viewed as VC++ implementing one of the new features of C++11. At least as of N3242, the new C++ draft adds a new item after the table above: "other implementation-defined characters". This gives the compiler permission to accept any other characters it wants to (though it is supposed to document what they are).