Quote from C++03 2.2 Character sets:
"The basic execution character set and the basic execution
wide-character set shall each contain all the members of the basic
source character set..The values of the members of the execution
character sets are implementation-defined, and any additional members
are locale-specific."
According to this, 'A', which belongs to the execution character set, its value is implementation-defined. So it's not 65(ASCII code of 'A' in decimal), what?!
// Not always 65?
printf ("%d", 'A');
Or I've a misunderstanding as to the value of a character in execution character set?
Of course it can be ASCII's 65, if the execution character set is ASCII or a superset (such as UTF-8).
It doesn't say "it can't be ASCII", it says that it is something called "the execution character set".
So, the standard allows that the "execution character set" is other things than ASCII or ASCII derivatives. One example would be the EBCDIC character set that IBM used for a long time (there's probably still machines about using EBCDIC, but I suspect anything built in the last 10-15 years wouldn't be using that). The encoding of characters in EBCDIC is completely different from ASCII.
So, expecting, in code, that the value of 'A' is any particular value is not portable. There are also a whole heap of other "common assumptions" that will fail - that there are no "holes" between A-Z, and that 'A'-'a' == 32 are both false in EBCDIC. At least the characters A-Z are in the correct order! ;)
Related
In the standard, the term of the "initial shift state" is often cited, also seemingly in various contexts, such as multibyte characters(strings) and files. But the standard missed an explanation of what this exactly shall be.
What is that? And what is a "shift" here in general?
Also:
Because of the term for me seems to be used in different contexts( in the context of characters, in the context of strings and in the context of files), I will point to a few text phrases from the standard (especially ISO/IEC:9899/2018 (C18)) which include the term of "initial shift state":
§ 5.2.1.2 - Multibyte characters
— A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte characters are encountered in the sequence.
— An identifier, comment, string literal, character constant, or header name shall begin and end in the initial shift state.
§ 7.21.3 - Files
"— A file need not begin nor end in the initial shift state.274)"
"274)Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream (because of possible trailing null characters) or for any stream with state-dependent encoding that does not assuredly end in the initial shift state."
§7.21.6.2 - The fscanf function
For the s conversion specifier:
"If an l length modifier is present, the input shall be a sequence of multibyte characters that begins in the initial shift state."
What is meant by the "inital shift state"? What is that?
What is a "shift" in context?
Is it in the context of strings the double quotation mark " which is the beginning and end of a format string?
Thanks in advance.
A shift state refers to a state which informs the interpretation of some byte sequence as characters, this is encoding dependent.
From https://www.gnu.org/software/libc/manual/html_node/Shift-State.html
In some multibyte character codes, the meaning of any particular byte
sequence is not fixed; it depends on what other sequences have come
earlier in the same string. Typically there are just a few sequences
that can change the meaning of other sequences; these few are called
shift sequences and we say that they set the shift state for other
sequences that follow.
To illustrate shift state and shift sequences, suppose we decide that
the sequence 0200 (just one byte) enters Japanese mode, in which pairs
of bytes in the range from 0240 to 0377 are single characters, while
0201 enters Latin-1 mode, in which single bytes in the range from 0240
to 0377 are characters, and interpreted according to the ISO Latin-1
character set. This is a multibyte code that has two alternative shift
states (“Japanese mode” and “Latin-1 mode”), and two shift sequences
that specify particular shift states.
The initial shift state is just the shift state initially, i.e. at the start of processing; in the example above it would be whichever of ISO Latin-1 or Japanese the sequence in question begins in.
This is an extended question of this one: Is std::string suppose to have only Ascii characters
I want to build a simple console application that take an input from the user as set of characters. Those characters include 0->9 digits and a->z letters.
I am dealing with input supposing that it is an Ascii. For example, I am using something like : static_cast<unsigned int>(my_char - '0') to get the number as unsigned int.
How can I make this code cross-platform? How can tell that I want the input to be Ascii always? Or I have missed a lot of concepts and static_cast<unsigned int>(my_char - '0') is just a bad way?
P.S. In Ascii (at least) digits have sequenced order. However, in others encoding, I do not know if they have. (I am pretty sure that they are but there is no guarantee, right?)
How can force the user/OS to input an Ascii string
You cannot, unless you let the user specify the numeric values of such ASCII input.
It all depends how the terminal implementation used to serve std::cin translates key strokes like 0 to a specific number, and what your toolchain expects to match that number with it's intrinsic translation for '0'.
You simply shouldn't expect ASCII values explicitly (e.g. using magic numbers), but char literals to provide portable code. The assumption that my_char - '0' will result in the actual digits value is true for all character sets. The C++ standard states in [lex.charset]/3 that
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.[...]
emphasis mine
You can't force or even verify that beforehand . "Evil user" can always sneak a UTF-8 encoded string into your application, with no characters above U+7F. And such string happens to be also Ascii-encoded.
Also, whatever platform specific measure you take, user can pipe a UTF-16LE encoded file. Or /dev/urandom
Your mistakes string encoding with some magic property of an input stream - and it is not. It is, well, encoding, like JPEG or AVI, and must be handled exactly the same way - read an input, match with format, report errors on parsing failure.
For your case, if you want to accept only ASCII, read input stream byte by byte and throw/exit with error if you ever encounters a byte with the value outside ASCII domain.
However, if later you encounter a terminal providing data with some incompatible encoding, like UTF16LE, you have no choice but to write a detection (based on byte order mark) and a conversion routine.
What guarantees does C++ make about the ordering of character literals? Is there a definite ordering of characters in the basic source character set? (e.g. is 'a' < 'z' guaranteed to be true? How about 'A' < 'z'?)
The standard only provides a guarantee for ordering of the decimal digits 0 to 9, from the draft C++11 standard section 2.3 [lex.charset]:
In both the source and execution basic character sets, the value of
each character after 0 in the above list of decimal digits shall be
one greater than the value of the previous.
and otherwise says (emphasis mine):
The basic execution character set and the basic execution
wide-character set shall each contain all the members of the basic
source character set, plus control characters representing alert,
backspace, and carriage return, plus a null character (respectively,
null wide character), whose representation has all zero bits. For each
basic execution character set, the values of the members shall be
non-negative and distinct from one another.
Note, EBCDIC has a non-consecutive character set.
The C++ standard mentions multiple different character sets. In particular, it mentions the following character sets:
In 2.2 [lex.phases] bullet 1 physical source file characters and their mapping to the basic source character set is mentioned.
In 2.2 [lex.phases] bullet 2 execution character set is mentioned.
In 2.3 [lex.charset] paragraph 3 a basic execution character set and a basic execution wide-character set are mentioned.
The same section 2.3 [lex.charset] 3 also mentions an execution character set and an execution wide-character set.
When reading or writing files these use some other character set.
What are all those different character set used for, how are conversions between them done, and which of these values are locale dependent? In particular, how are string literals represented?
Here is a break down of the different character sets used by the compiler itself (all reference to the standard are for C++14, actually):
The physical source file characters are those used in the C++ source. Most likely these are now encoded using some Unicode encoding, e.g., UTF-8 or UTF-16. If you are from a European or an American background you may be using ASCII whose characters are conveniently encoded identically in UTF-8 (every ASCII file is a UTF-8 file but not the other way around). The physical source file characters_ may also be something unusual like EBCDIC.
The basic source character set is what the compiler, at least conceptually, consumes. It is produced from the physical source file characters and either mapping them to their respective basic character or to a sequence of basic characters representing the physical source character using a universal character name (see 2.2 [lex.phases] paragraph 1). The basic source character set is a just a set of 96 character (2.3 [lex.charset] paragraph 1):
a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ’
and the 5 special characters space (' '), horizontal tab (\t), vertical tab (\v), form feed (\f), and newline (\n)
The mapping between the physical source character set and the basic character set is implementation defined.
The basic execution character set and the basic execution wide-character set are characters set capable of representing the basic source character set expanded by a few special character:
alert ('\a'), backspace ('\b'), carriage return ('\r'), and a null character ('\0')
The difference between the non-wide and the wide version is whether the characters are represented using char or wchar_t.
The execution character set and the execution wide-character set are implementation defined extensions of the basic character set and the basic wide-character set. In 2.3 [lex.charset] paragraph 3 it is stated that the additional members and the values of the additional members of execution character set are locale specific. It isn't clear which locale is referred to but I suspect the locale used during compilation is meant. In any case, the execution character sets are implementation defined (also according to 2.3 [lex.charset] paragraph 3).
Character and string literals are originally represented using the basic source character set with some characters possibly using universal character names. All of these are converted at compile time into the execution character set. According to 2.14.3 [lex.ccon] character literals representable as one char in the execution character set just work. If multiple chars are needed the character literals may be conditionally supported (and they'd have type int). For string literals the conversion is described in 2.14.5 [lex.string]. Paragraph 9 states that UTF-8 string literals (e.g. u8"hello") result in a sequence of values corresponding to the code units of the UTF-8 string. Otherwise the translation of characters and universal character names is the same as that for character literals (in particular, it is implementation defined) although characters resulting in multi-byte sequences for narrow string just result in multiple characters (this case isn't necessary support for character literals).
So far, only the result of compilation is considered. Any character which isn't part of a character or a string literal is used to specify what the code does. The interesting question is what happened to the literals? The literals are all basically translated into an implementation defined representation. That is implementation defined means that it is somewhere documented what is supposed to happen but it can differ between different implementations.
How does that help when dealing with characters or strings coming from somewhere? Well, any character or string which is read is converted to the corresponding execution character set. In particular, when a file is read, all characters are transformed to this common representation. Of course, for this transformation to work, the locale used for reading a file needs to be setup according to the encoding of that file. If the locale isn't explicitly mentioned the global locale is used which is initially determined by the system is used. The initial global locale is probably set somehow based on user preferences, e.g., based no environment variables. If a a file is read which uses a different encoding than this global locale, a corresponding different locale matching the encoding of the file needs to be used.
Correspondingly, when writing characters using one of the execution character sets, these are converted according to the encoding specified by the current locale. Again, it may be necessary to replace the locale if a specific encoding is needed.
All this effectively means that internally to a program all string and character processing happens using the implementation defined execution character set. All characters being read by a program need to be converted to this character set and all characters written start as characters in this execution character set and need to be converted appropriately to the external encoding. If course, in an ideal set up the conversion between the execution character set and the external representation is the identity, e.g., because the execution character set uses UTF-8 and the external representation also uses UTF-8. Correspondingly for the execution wide-character set except in this case UTF-16 would be used (one of the two variations as UTF-16 can either use big endian or little endian representation).
Coming from a discussion started here, does the standard specify values for characters? So, is '0' guaranteed to be 48? That's what ASCII would tell us, but is it guaranteed? If not, have you seen any compiler where '0' isn't 48?
No. There's no requirement for the either the source or execution character sets to use an encoding with an ASCII subset. I haven't seen any non-ASCII implementations but I know someone who knows someone who has. (It is required that '0' - '9' have contiguous integer values, but that's a duplicate question somewhere else on SO.)
The encoding used for the source character set controls how the bytes of your source code are interpreted into the characters used in the C++ language. The standard describes the members of the execution character set as having values. It is the encoding that maps these characters to their corresponding values the determines the integer value of '0'.
Although at least all of the members of the basic source character set plus some control characters and a null character with value zero must be present (with appropriate values) in the execution character set, there is no requirement for the encoding to be ASCII or to use ASCII values for any particular subset of characters (other than the null character).
No, the Standard is very careful not to specify what the source character encoding is.
C and C++ compilers run on EBCDIC computers too, you know, where '0' != 0x30.
However, I believe it is required that '1' == '0' + 1.
It's 0xF0 in EBCDIC. I've never used an EBCDIC compiler, but I'm told that they were all the rage at IBM for a while.
There's no requirement in the C++ standard that the source or execution encodings are ASCII-based. It is guaranteed that '0' == '1' - 1 (and in general that the digits are contiguous and in order). It is not guaranteed that the letters are contiguous, and indeed in EBCDIC 'J' != 'I' + 1 and 'S' != 'R' + 1.
According to the C++11 stardard N3225
The glyphs for the members of the basic source character set are
intended to identify characters from the subset of ISO/IEC 10646 which
corresponds to the ASCII character set. However, because the mapping
from source file characters to the source character set (described in
translation phase 1) is specified as implementation-defined, an
implementation is required to document how the basic source characters
are represented in source files
In short, the character set is not required to be mapped to the ASCII table, even though I've never heard about any different implementation