What is the "initial shift state"? - c++

In the standard, the term of the "initial shift state" is often cited, also seemingly in various contexts, such as multibyte characters(strings) and files. But the standard missed an explanation of what this exactly shall be.
What is that? And what is a "shift" here in general?
Also:
Because of the term for me seems to be used in different contexts( in the context of characters, in the context of strings and in the context of files), I will point to a few text phrases from the standard (especially ISO/IEC:9899/2018 (C18)) which include the term of "initial shift state":
§ 5.2.1.2 - Multibyte characters
— A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte characters are encountered in the sequence.
— An identifier, comment, string literal, character constant, or header name shall begin and end in the initial shift state.
§ 7.21.3 - Files
"— A file need not begin nor end in the initial shift state.274)"
"274)Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream (because of possible trailing null characters) or for any stream with state-dependent encoding that does not assuredly end in the initial shift state."
§7.21.6.2 - The fscanf function
For the s conversion specifier:
"If an l length modifier is present, the input shall be a sequence of multibyte characters that begins in the initial shift state."
What is meant by the "inital shift state"? What is that?
What is a "shift" in context?
Is it in the context of strings the double quotation mark " which is the beginning and end of a format string?
Thanks in advance.

A shift state refers to a state which informs the interpretation of some byte sequence as characters, this is encoding dependent.
From https://www.gnu.org/software/libc/manual/html_node/Shift-State.html
In some multibyte character codes, the meaning of any particular byte
sequence is not fixed; it depends on what other sequences have come
earlier in the same string. Typically there are just a few sequences
that can change the meaning of other sequences; these few are called
shift sequences and we say that they set the shift state for other
sequences that follow.
To illustrate shift state and shift sequences, suppose we decide that
the sequence 0200 (just one byte) enters Japanese mode, in which pairs
of bytes in the range from 0240 to 0377 are single characters, while
0201 enters Latin-1 mode, in which single bytes in the range from 0240
to 0377 are characters, and interpreted according to the ISO Latin-1
character set. This is a multibyte code that has two alternative shift
states (“Japanese mode” and “Latin-1 mode”), and two shift sequences
that specify particular shift states.
The initial shift state is just the shift state initially, i.e. at the start of processing; in the example above it would be whichever of ISO Latin-1 or Japanese the sequence in question begins in.

Related

How can force the user/OS to input an Ascii string

This is an extended question of this one: Is std::string suppose to have only Ascii characters
I want to build a simple console application that take an input from the user as set of characters. Those characters include 0->9 digits and a->z letters.
I am dealing with input supposing that it is an Ascii. For example, I am using something like : static_cast<unsigned int>(my_char - '0') to get the number as unsigned int.
How can I make this code cross-platform? How can tell that I want the input to be Ascii always? Or I have missed a lot of concepts and static_cast<unsigned int>(my_char - '0') is just a bad way?
P.S. In Ascii (at least) digits have sequenced order. However, in others encoding, I do not know if they have. (I am pretty sure that they are but there is no guarantee, right?)
How can force the user/OS to input an Ascii string
You cannot, unless you let the user specify the numeric values of such ASCII input.
It all depends how the terminal implementation used to serve std::cin translates key strokes like 0 to a specific number, and what your toolchain expects to match that number with it's intrinsic translation for '0'.
You simply shouldn't expect ASCII values explicitly (e.g. using magic numbers), but char literals to provide portable code. The assumption that my_char - '0' will result in the actual digits value is true for all character sets. The C++ standard states in [lex.charset]/3 that
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.[...]
emphasis mine
You can't force or even verify that beforehand . "Evil user" can always sneak a UTF-8 encoded string into your application, with no characters above U+7F. And such string happens to be also Ascii-encoded.
Also, whatever platform specific measure you take, user can pipe a UTF-16LE encoded file. Or /dev/urandom
Your mistakes string encoding with some magic property of an input stream - and it is not. It is, well, encoding, like JPEG or AVI, and must be handled exactly the same way - read an input, match with format, report errors on parsing failure.
For your case, if you want to accept only ASCII, read input stream byte by byte and throw/exit with error if you ever encounters a byte with the value outside ASCII domain.
However, if later you encounter a terminal providing data with some incompatible encoding, like UTF16LE, you have no choice but to write a detection (based on byte order mark) and a conversion routine.

What guarantees does C++ make about the ordering of character literals?

What guarantees does C++ make about the ordering of character literals? Is there a definite ordering of characters in the basic source character set? (e.g. is 'a' < 'z' guaranteed to be true? How about 'A' < 'z'?)
The standard only provides a guarantee for ordering of the decimal digits 0 to 9, from the draft C++11 standard section 2.3 [lex.charset]:
In both the source and execution basic character sets, the value of
each character after 0 in the above list of decimal digits shall be
one greater than the value of the previous.
and otherwise says (emphasis mine):
The basic execution character set and the basic execution
wide-character set shall each contain all the members of the basic
source character set, plus control characters representing alert,
backspace, and carriage return, plus a null character (respectively,
null wide character), whose representation has all zero bits. For each
basic execution character set, the values of the members shall be
non-negative and distinct from one another.
Note, EBCDIC has a non-consecutive character set.

identifier character set (clang)

I never use clang.
And I accidentally discovered that this piece of code:
#include <iostream>
void функция(int переменная)
{
std::cout << переменная << std::endl;
}
int main()
{
int русская_переменная = 0;
функция(русская_переменная);
}
will compiles fine: http://rextester.com/NFXBL38644 (clang 3.4 (clang++ -Wall -std=c++11 -O2)).
Is it a clang extension ?? And why ?
Thanks.
UPD: I'm more asking why clang make such decision ? Because I never found the discussion that someone want more characters then c++ standard have now (2.3, rev. 3691)
It's not so much an extension as it is Clang's interpretation of the Multibyte characters part of the standard. Clang supports UTF-8 source code files.
As to why, I guess "why not?" is the only real answer; it seems useful and reasonable to me to support a larger character set.
Here are the relevant parts of the standard (C11 draft):
5.2.1 Character sets
1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.
2 In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters. A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.
3 Both the basic source and basic execution character sets shall have the following members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab, vertical tab, and form feed. The representation of each member of the source and execution basic character sets shall fit in a byte. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. In source files, there shall be some way of indicating the end of each line of text; this International Standard treats such an end-of-line indicator as if it were a single new-line character. In the basic execution character set, there shall be control characters representing alert, backspace, carriage return, and new line. If any other characters are encountered in a source file (except in an identifier, a character constant, a string literal, a header name, a comment, or a preprocessing token that is never
converted to a token), the behavior is undefined.
4 A letter is an uppercase letter or a lowercase letter as defined above; in this International Standard the term does not include other characters that are letters in other alphabets.
5 The universal character name construct provides a way to name other characters.
And also:
5.2.1.2 Multibyte characters
1 The source character set may contain multibyte characters, used to represent members of the extended character set. The execution character set may also contain multibyte characters, which need not have the same encoding as for the source character set. For both character sets, the following shall hold:
— The basic character set shall be present and each character shall be encoded as a single byte.
— The presence, meaning, and representation of any additional members is locale- specific.
— A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte characters are encountered in the sequence. While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state. The interpretation for subsequent bytes in the sequence is a function of the current shift state.
— A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character.
2 For source files, the following shall hold:
— An identifier, comment, string literal, character constant, or header name shall begin and end in the initial shift state.
— An identifier, comment, string literal, character constant, or header name shall consist of a sequence of valid multibyte characters.
Given clang's usage of UTF-8 as the source encoding, this behavior is mandated by the standard:
C++ defines an identifier as the following:
identifier:
identifier-nondigit
identifier identifier-nondigit
identifier digit
identifier-nondigit:
nondigit
universal-character-name
other implementation-defined characters
The important part here is that identifiers can include unversal-character-names. The specifications also lists allowed UCNs:
Annex E (normative)
Universal character names for identifier characters [charname]
E.1 Ranges of characters allowed [charname.allowed]
00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6, 00F8-00FF
0100-167F, 1681-180D, 180F-1FFF
200B-200D, 202A-202E, 203F-2040, 2054, 2060-206F
2070-218F, 2460-24FF, 2776-2793, 2C00-2DFF, 2E80-2FFF
3004-3007, 3021-302F, 3031-303F
3040-D7FF
F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD
10000-1FFFD, 20000-2FFFD, 30000-3FFFD, 40000-4FFFD, 50000-5FFFD,
60000-6FFFD, 70000-7FFFD, 80000-8FFFD, 90000-9FFFD, A0000-AFFFD,
B0000-BFFFD, C0000-CFFFD, D0000-DFFFD, E0000-EFFFD
The cyrillic characters in your identifier are in the range 0100-167F.
The C++ specification further mandates that characters encoded in the source encoding be handled identically to UCNs:
Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently
— n3337 §2.2 Phases of translation [lex.phases]/1
So given clang's choice of UTF-8 as the source encoding, the spec mandates that these characters be converted to UCNs (or that clang's behavior be indistinguishable from performing such a conversion), and these UCNs are permitted by the spec to appear in identifiers.
It goes even further. Emoji characters happen to be in the ranges allowed by the C++ spec, so if you've seen some of those examples of Swift code with emoji identifiers and were surprised by such capability you might be even more surprised to know that C++ has exactly the same capability:
http://rextester.com/EPYJ41676
http://imgur.com/EN6uanB
Another fact that may be surprising is that this behavior isn't new with C++11; C++ has mandated this behavior since C++98. It's just that compilers ignored this for a long time: Clang implemented this feature in version 3.3 in 2013. According to this documentation Microsoft Visual Studio supports this in 2015.
Even today GCC 6.1 only supports UCNs in identifiers when they are written literally, and does not obey the mandate that any character in its extended source character set must be treated identically with the corresponding universal-character-name. E.g. gcc allows int \u043a\u043e\u0448\043a\u0430 = 10; but will not allow int кошка = 10; even with -finput-charset=utf-8.

What's the value of characters in execution character set?

Quote from C++03 2.2 Character sets:
"The basic execution character set and the basic execution
wide-character set shall each contain all the members of the basic
source character set..The values of the members of the execution
character sets are implementation-defined, and any additional members
are locale-specific."
According to this, 'A', which belongs to the execution character set, its value is implementation-defined. So it's not 65(ASCII code of 'A' in decimal), what?!
// Not always 65?
printf ("%d", 'A');
Or I've a misunderstanding as to the value of a character in execution character set?
Of course it can be ASCII's 65, if the execution character set is ASCII or a superset (such as UTF-8).
It doesn't say "it can't be ASCII", it says that it is something called "the execution character set".
So, the standard allows that the "execution character set" is other things than ASCII or ASCII derivatives. One example would be the EBCDIC character set that IBM used for a long time (there's probably still machines about using EBCDIC, but I suspect anything built in the last 10-15 years wouldn't be using that). The encoding of characters in EBCDIC is completely different from ASCII.
So, expecting, in code, that the value of 'A' is any particular value is not portable. There are also a whole heap of other "common assumptions" that will fail - that there are no "holes" between A-Z, and that 'A'-'a' == 32 are both false in EBCDIC. At least the characters A-Z are in the correct order! ;)

COBOL to C++ data conversion

I have requirement in which i need to convert -
MOVE HIGH-VALUES TO W005-TEMP1.
MOVE LOW-VALUES TO W005-TEMP2.
How can i code these two in C++ ?
Thanks
Akshay
In COBOL, HIGH-VALUES represents one or more occurrences of the character that has the highest ordinal position in the collating sequence used. Similarly, LOW-VALUES represents the character having the lowest ordinal position in the collating sequence used.
The key here is "the collating sequence used". The SPECIAL-NAMES paragraph may be used to specify a customized collating sequence, but this is generally not done (still check it out). In the absence of a custom collating sequence, HIGH-VALUES is equal to X'FF' and LOW-VALUES is X'00' for both the EBCDIC and ASCII character sets.
To set W005-TEMP1 to HIGH-VALUES, you need to fill each byte it occupies with X'FF'. To set W005-TEMP2 to LOW-VALUES, you need to fill each byte it occupies with X'00'.