What guarantees does C++ make about the ordering of character literals? - c++

What guarantees does C++ make about the ordering of character literals? Is there a definite ordering of characters in the basic source character set? (e.g. is 'a' < 'z' guaranteed to be true? How about 'A' < 'z'?)

The standard only provides a guarantee for ordering of the decimal digits 0 to 9, from the draft C++11 standard section 2.3 [lex.charset]:
In both the source and execution basic character sets, the value of
each character after 0 in the above list of decimal digits shall be
one greater than the value of the previous.
and otherwise says (emphasis mine):
The basic execution character set and the basic execution
wide-character set shall each contain all the members of the basic
source character set, plus control characters representing alert,
backspace, and carriage return, plus a null character (respectively,
null wide character), whose representation has all zero bits. For each
basic execution character set, the values of the members shall be
non-negative and distinct from one another.
Note, EBCDIC has a non-consecutive character set.

Related

What is the "initial shift state"?

In the standard, the term of the "initial shift state" is often cited, also seemingly in various contexts, such as multibyte characters(strings) and files. But the standard missed an explanation of what this exactly shall be.
What is that? And what is a "shift" here in general?
Also:
Because of the term for me seems to be used in different contexts( in the context of characters, in the context of strings and in the context of files), I will point to a few text phrases from the standard (especially ISO/IEC:9899/2018 (C18)) which include the term of "initial shift state":
§ 5.2.1.2 - Multibyte characters
— A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte characters are encountered in the sequence.
— An identifier, comment, string literal, character constant, or header name shall begin and end in the initial shift state.
§ 7.21.3 - Files
"— A file need not begin nor end in the initial shift state.274)"
"274)Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream (because of possible trailing null characters) or for any stream with state-dependent encoding that does not assuredly end in the initial shift state."
§7.21.6.2 - The fscanf function
For the s conversion specifier:
"If an l length modifier is present, the input shall be a sequence of multibyte characters that begins in the initial shift state."
What is meant by the "inital shift state"? What is that?
What is a "shift" in context?
Is it in the context of strings the double quotation mark " which is the beginning and end of a format string?
Thanks in advance.
A shift state refers to a state which informs the interpretation of some byte sequence as characters, this is encoding dependent.
From https://www.gnu.org/software/libc/manual/html_node/Shift-State.html
In some multibyte character codes, the meaning of any particular byte
sequence is not fixed; it depends on what other sequences have come
earlier in the same string. Typically there are just a few sequences
that can change the meaning of other sequences; these few are called
shift sequences and we say that they set the shift state for other
sequences that follow.
To illustrate shift state and shift sequences, suppose we decide that
the sequence 0200 (just one byte) enters Japanese mode, in which pairs
of bytes in the range from 0240 to 0377 are single characters, while
0201 enters Latin-1 mode, in which single bytes in the range from 0240
to 0377 are characters, and interpreted according to the ISO Latin-1
character set. This is a multibyte code that has two alternative shift
states (“Japanese mode” and “Latin-1 mode”), and two shift sequences
that specify particular shift states.
The initial shift state is just the shift state initially, i.e. at the start of processing; in the example above it would be whichever of ISO Latin-1 or Japanese the sequence in question begins in.

identifier character set (clang)

I never use clang.
And I accidentally discovered that this piece of code:
#include <iostream>
void функция(int переменная)
{
std::cout << переменная << std::endl;
}
int main()
{
int русская_переменная = 0;
функция(русская_переменная);
}
will compiles fine: http://rextester.com/NFXBL38644 (clang 3.4 (clang++ -Wall -std=c++11 -O2)).
Is it a clang extension ?? And why ?
Thanks.
UPD: I'm more asking why clang make such decision ? Because I never found the discussion that someone want more characters then c++ standard have now (2.3, rev. 3691)
It's not so much an extension as it is Clang's interpretation of the Multibyte characters part of the standard. Clang supports UTF-8 source code files.
As to why, I guess "why not?" is the only real answer; it seems useful and reasonable to me to support a larger character set.
Here are the relevant parts of the standard (C11 draft):
5.2.1 Character sets
1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.
2 In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters. A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.
3 Both the basic source and basic execution character sets shall have the following members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab, vertical tab, and form feed. The representation of each member of the source and execution basic character sets shall fit in a byte. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. In source files, there shall be some way of indicating the end of each line of text; this International Standard treats such an end-of-line indicator as if it were a single new-line character. In the basic execution character set, there shall be control characters representing alert, backspace, carriage return, and new line. If any other characters are encountered in a source file (except in an identifier, a character constant, a string literal, a header name, a comment, or a preprocessing token that is never
converted to a token), the behavior is undefined.
4 A letter is an uppercase letter or a lowercase letter as defined above; in this International Standard the term does not include other characters that are letters in other alphabets.
5 The universal character name construct provides a way to name other characters.
And also:
5.2.1.2 Multibyte characters
1 The source character set may contain multibyte characters, used to represent members of the extended character set. The execution character set may also contain multibyte characters, which need not have the same encoding as for the source character set. For both character sets, the following shall hold:
— The basic character set shall be present and each character shall be encoded as a single byte.
— The presence, meaning, and representation of any additional members is locale- specific.
— A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte characters are encountered in the sequence. While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state. The interpretation for subsequent bytes in the sequence is a function of the current shift state.
— A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character.
2 For source files, the following shall hold:
— An identifier, comment, string literal, character constant, or header name shall begin and end in the initial shift state.
— An identifier, comment, string literal, character constant, or header name shall consist of a sequence of valid multibyte characters.
Given clang's usage of UTF-8 as the source encoding, this behavior is mandated by the standard:
C++ defines an identifier as the following:
identifier:
identifier-nondigit
identifier identifier-nondigit
identifier digit
identifier-nondigit:
nondigit
universal-character-name
other implementation-defined characters
The important part here is that identifiers can include unversal-character-names. The specifications also lists allowed UCNs:
Annex E (normative)
Universal character names for identifier characters [charname]
E.1 Ranges of characters allowed [charname.allowed]
00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6, 00F8-00FF
0100-167F, 1681-180D, 180F-1FFF
200B-200D, 202A-202E, 203F-2040, 2054, 2060-206F
2070-218F, 2460-24FF, 2776-2793, 2C00-2DFF, 2E80-2FFF
3004-3007, 3021-302F, 3031-303F
3040-D7FF
F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD
10000-1FFFD, 20000-2FFFD, 30000-3FFFD, 40000-4FFFD, 50000-5FFFD,
60000-6FFFD, 70000-7FFFD, 80000-8FFFD, 90000-9FFFD, A0000-AFFFD,
B0000-BFFFD, C0000-CFFFD, D0000-DFFFD, E0000-EFFFD
The cyrillic characters in your identifier are in the range 0100-167F.
The C++ specification further mandates that characters encoded in the source encoding be handled identically to UCNs:
Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently
— n3337 §2.2 Phases of translation [lex.phases]/1
So given clang's choice of UTF-8 as the source encoding, the spec mandates that these characters be converted to UCNs (or that clang's behavior be indistinguishable from performing such a conversion), and these UCNs are permitted by the spec to appear in identifiers.
It goes even further. Emoji characters happen to be in the ranges allowed by the C++ spec, so if you've seen some of those examples of Swift code with emoji identifiers and were surprised by such capability you might be even more surprised to know that C++ has exactly the same capability:
http://rextester.com/EPYJ41676
http://imgur.com/EN6uanB
Another fact that may be surprising is that this behavior isn't new with C++11; C++ has mandated this behavior since C++98. It's just that compilers ignored this for a long time: Clang implemented this feature in version 3.3 in 2013. According to this documentation Microsoft Visual Studio supports this in 2015.
Even today GCC 6.1 only supports UCNs in identifiers when they are written literally, and does not obey the mandate that any character in its extended source character set must be treated identically with the corresponding universal-character-name. E.g. gcc allows int \u043a\u043e\u0448\043a\u0430 = 10; but will not allow int кошка = 10; even with -finput-charset=utf-8.

What's the value of characters in execution character set?

Quote from C++03 2.2 Character sets:
"The basic execution character set and the basic execution
wide-character set shall each contain all the members of the basic
source character set..The values of the members of the execution
character sets are implementation-defined, and any additional members
are locale-specific."
According to this, 'A', which belongs to the execution character set, its value is implementation-defined. So it's not 65(ASCII code of 'A' in decimal), what?!
// Not always 65?
printf ("%d", 'A');
Or I've a misunderstanding as to the value of a character in execution character set?
Of course it can be ASCII's 65, if the execution character set is ASCII or a superset (such as UTF-8).
It doesn't say "it can't be ASCII", it says that it is something called "the execution character set".
So, the standard allows that the "execution character set" is other things than ASCII or ASCII derivatives. One example would be the EBCDIC character set that IBM used for a long time (there's probably still machines about using EBCDIC, but I suspect anything built in the last 10-15 years wouldn't be using that). The encoding of characters in EBCDIC is completely different from ASCII.
So, expecting, in code, that the value of 'A' is any particular value is not portable. There are also a whole heap of other "common assumptions" that will fail - that there are no "holes" between A-Z, and that 'A'-'a' == 32 are both false in EBCDIC. At least the characters A-Z are in the correct order! ;)

C++11 character literal '\xC4' standard type with UTF-8 execution character set?

Consider a C++11 compiler that has an execution character set of UTF-8 (and is compliant with the x86-64 ABI which requires the char type be a signed 8-bit byte).
The letter Ä (umlaut) has unicode code point of 0xC4, and has a 2 code unit UTF-8 representation of {0xC3, 0x84}
The compiler assigns the character literal '\xC4' a type of int with a value of 0xC4.
Is the compiler standard-compliant and ABI-compliant? What is your reasoning?
Relevant quotes from C++11 standard:
2.14.3.1
An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than
one c-char is a multicharacter literal. A multicharacter literal has type int and implementation-defined
value.
2.14.3.4
The escape \xhhh consists of the backslash followed by x followed by
one or more hexadecimal digits that are taken to specify the value of the desired character. The value of a character
literal is implementation-defined if it falls outside of the implementation-defined range defined for char
§2.14.3 paragraph 1 is undoubtedly the relevant text in the (C++11) standard. However, there were several defects in the original text, and the latest version contains the following text, emphasis added:
A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
Although this has been accepted as a defect, it does not actually form part of any standard. However, it stands as a recommendation and I suspect that many compilers will implement it.
From 2.1.14.3p4:
The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char
x86 compilers historically (and as you point out, that practice is now an official standard of some sort) have signed chars. \xc7 is out of range for that, so the implementation is required to document the literal value it will produce.
It looks like your implementation promotes out-of-range char literals specified with \x escapes to (in-range) integer literals.
You're mixing apples, oranges, pears and kumquats :)
Yes, "\xc4" is a legal character literal. Specifically, what the standard calls a "narrow character literal".
From the C++ standard:
The glyphs for the members of the basic source character set are
intended to identify characters from the subset of ISO/IEC 10646 which
corresponds to the ASCII character set. However, because the mapping
from source file characters to the source character set (described in
translation phase 1) is specified as implementation-defined, an
implementation is required to document how the basic source characters
are represented in source files.
This might help clarify:
Rules for C++ string literals escape character
This will might also help, if you're not familiar with it:
The absolute minimum every software developer should know about Unicode
Here is another good, concise - and illuminating - reference:
IBM Developerworks: Character literals

Is the character set of a char literal guaranteed to be ASCII?

Coming from a discussion started here, does the standard specify values for characters? So, is '0' guaranteed to be 48? That's what ASCII would tell us, but is it guaranteed? If not, have you seen any compiler where '0' isn't 48?
No. There's no requirement for the either the source or execution character sets to use an encoding with an ASCII subset. I haven't seen any non-ASCII implementations but I know someone who knows someone who has. (It is required that '0' - '9' have contiguous integer values, but that's a duplicate question somewhere else on SO.)
The encoding used for the source character set controls how the bytes of your source code are interpreted into the characters used in the C++ language. The standard describes the members of the execution character set as having values. It is the encoding that maps these characters to their corresponding values the determines the integer value of '0'.
Although at least all of the members of the basic source character set plus some control characters and a null character with value zero must be present (with appropriate values) in the execution character set, there is no requirement for the encoding to be ASCII or to use ASCII values for any particular subset of characters (other than the null character).
No, the Standard is very careful not to specify what the source character encoding is.
C and C++ compilers run on EBCDIC computers too, you know, where '0' != 0x30.
However, I believe it is required that '1' == '0' + 1.
It's 0xF0 in EBCDIC. I've never used an EBCDIC compiler, but I'm told that they were all the rage at IBM for a while.
There's no requirement in the C++ standard that the source or execution encodings are ASCII-based. It is guaranteed that '0' == '1' - 1 (and in general that the digits are contiguous and in order). It is not guaranteed that the letters are contiguous, and indeed in EBCDIC 'J' != 'I' + 1 and 'S' != 'R' + 1.
According to the C++11 stardard N3225
The glyphs for the members of the basic source character set are
intended to identify characters from the subset of ISO/IEC 10646 which
corresponds to the ASCII character set. However, because the mapping
from source file characters to the source character set (described in
translation phase 1) is specified as implementation-defined, an
implementation is required to document how the basic source characters
are represented in source files
In short, the character set is not required to be mapped to the ASCII table, even though I've never heard about any different implementation