COBOL to C++ data conversion - c++

I have requirement in which i need to convert -
MOVE HIGH-VALUES TO W005-TEMP1.
MOVE LOW-VALUES TO W005-TEMP2.
How can i code these two in C++ ?
Thanks
Akshay

In COBOL, HIGH-VALUES represents one or more occurrences of the character that has the highest ordinal position in the collating sequence used. Similarly, LOW-VALUES represents the character having the lowest ordinal position in the collating sequence used.
The key here is "the collating sequence used". The SPECIAL-NAMES paragraph may be used to specify a customized collating sequence, but this is generally not done (still check it out). In the absence of a custom collating sequence, HIGH-VALUES is equal to X'FF' and LOW-VALUES is X'00' for both the EBCDIC and ASCII character sets.
To set W005-TEMP1 to HIGH-VALUES, you need to fill each byte it occupies with X'FF'. To set W005-TEMP2 to LOW-VALUES, you need to fill each byte it occupies with X'00'.

Related

C++ test for validation UTF-8

I need to write unit tests for UTF-8 validation, but I don't know how to write incorrect UTF-8 cases in C++:
TEST(validation, Tests)
{
std::string str = "hello";
EXPECT_TRUE(validate_utf8(str));
// I need incorrect UTF-8 cases
}
How can I write incorrect UTF-8 cases in C++?
You can specify individual bytes in the string with the \x escape sequence in hexadecimal form or the \000 escape sequence in octal form.
For example:
std::string str = "\xD0";
which is incomplete UTF8.
Have a look at https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt for valid and malformed UTF8 test cases.
In UTF-8 any character having a most significant bit of 0 is an ordinary ASCII character, any other one is part of a multi-byte sequence (MBS).
If second most significant one is yet another one then this is the first byte of a MBS, otherwise it is one of the follow-up bytes.
In the first byte of a MBS the number of subsequent highest significant one-bits gives you the number of bytes of the entire sequence, e. g. 0b110xxxxx with arbitrary values for x is the start byte of a two-byte sequence.
Theoretically you could now produce sequences up to seven bytes, currently they are limited to four or five bytes (not fully sure here, you need to look up).
You can now produce arbitrary code points by defining appropriate sequences, e.g. "\xc8\x85" would represent the sequence 0b11001000 0b10000101 which is a legal pattern and represents code point 0b 01000 000101 (note how the leading bits representing the UTF-8 headers are cut away) corresponding to a value of 0x405 or 1029. If that's a valid code point at all you need to look up, I just formed an arbitrary bit pattern as an example.
The same way you can now represent longer valid sequences by increasing the number of most significant one-bits joined with the appropriate number of follow-up bytes (note again: number of initial one-bits is total number of bytes including the first byte of the MSB).
Similarly you now produce invalid sequences such that the total number of bytes of the sequence does not match (too many or too few) the number of initial one-bits.
So far you can produce arbitrary valid or invalid sequences where the valid one represent arbitrary code points. You now might need to look up which of these code points are actually valid ones.
Finally you might additionally consider composed characters (with diacritics) – they can be represented as a character (not byte!) or a normalised single character – if you want to go that far then you'd need to look up in the standard which combinations are legal and conform to which normalised code points.

What is the "initial shift state"?

In the standard, the term of the "initial shift state" is often cited, also seemingly in various contexts, such as multibyte characters(strings) and files. But the standard missed an explanation of what this exactly shall be.
What is that? And what is a "shift" here in general?
Also:
Because of the term for me seems to be used in different contexts( in the context of characters, in the context of strings and in the context of files), I will point to a few text phrases from the standard (especially ISO/IEC:9899/2018 (C18)) which include the term of "initial shift state":
§ 5.2.1.2 - Multibyte characters
— A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte characters are encountered in the sequence.
— An identifier, comment, string literal, character constant, or header name shall begin and end in the initial shift state.
§ 7.21.3 - Files
"— A file need not begin nor end in the initial shift state.274)"
"274)Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream (because of possible trailing null characters) or for any stream with state-dependent encoding that does not assuredly end in the initial shift state."
§7.21.6.2 - The fscanf function
For the s conversion specifier:
"If an l length modifier is present, the input shall be a sequence of multibyte characters that begins in the initial shift state."
What is meant by the "inital shift state"? What is that?
What is a "shift" in context?
Is it in the context of strings the double quotation mark " which is the beginning and end of a format string?
Thanks in advance.
A shift state refers to a state which informs the interpretation of some byte sequence as characters, this is encoding dependent.
From https://www.gnu.org/software/libc/manual/html_node/Shift-State.html
In some multibyte character codes, the meaning of any particular byte
sequence is not fixed; it depends on what other sequences have come
earlier in the same string. Typically there are just a few sequences
that can change the meaning of other sequences; these few are called
shift sequences and we say that they set the shift state for other
sequences that follow.
To illustrate shift state and shift sequences, suppose we decide that
the sequence 0200 (just one byte) enters Japanese mode, in which pairs
of bytes in the range from 0240 to 0377 are single characters, while
0201 enters Latin-1 mode, in which single bytes in the range from 0240
to 0377 are characters, and interpreted according to the ISO Latin-1
character set. This is a multibyte code that has two alternative shift
states (“Japanese mode” and “Latin-1 mode”), and two shift sequences
that specify particular shift states.
The initial shift state is just the shift state initially, i.e. at the start of processing; in the example above it would be whichever of ISO Latin-1 or Japanese the sequence in question begins in.

How can force the user/OS to input an Ascii string

This is an extended question of this one: Is std::string suppose to have only Ascii characters
I want to build a simple console application that take an input from the user as set of characters. Those characters include 0->9 digits and a->z letters.
I am dealing with input supposing that it is an Ascii. For example, I am using something like : static_cast<unsigned int>(my_char - '0') to get the number as unsigned int.
How can I make this code cross-platform? How can tell that I want the input to be Ascii always? Or I have missed a lot of concepts and static_cast<unsigned int>(my_char - '0') is just a bad way?
P.S. In Ascii (at least) digits have sequenced order. However, in others encoding, I do not know if they have. (I am pretty sure that they are but there is no guarantee, right?)
How can force the user/OS to input an Ascii string
You cannot, unless you let the user specify the numeric values of such ASCII input.
It all depends how the terminal implementation used to serve std::cin translates key strokes like 0 to a specific number, and what your toolchain expects to match that number with it's intrinsic translation for '0'.
You simply shouldn't expect ASCII values explicitly (e.g. using magic numbers), but char literals to provide portable code. The assumption that my_char - '0' will result in the actual digits value is true for all character sets. The C++ standard states in [lex.charset]/3 that
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.[...]
emphasis mine
You can't force or even verify that beforehand . "Evil user" can always sneak a UTF-8 encoded string into your application, with no characters above U+7F. And such string happens to be also Ascii-encoded.
Also, whatever platform specific measure you take, user can pipe a UTF-16LE encoded file. Or /dev/urandom
Your mistakes string encoding with some magic property of an input stream - and it is not. It is, well, encoding, like JPEG or AVI, and must be handled exactly the same way - read an input, match with format, report errors on parsing failure.
For your case, if you want to accept only ASCII, read input stream byte by byte and throw/exit with error if you ever encounters a byte with the value outside ASCII domain.
However, if later you encounter a terminal providing data with some incompatible encoding, like UTF16LE, you have no choice but to write a detection (based on byte order mark) and a conversion routine.

What guarantees does C++ make about the ordering of character literals?

What guarantees does C++ make about the ordering of character literals? Is there a definite ordering of characters in the basic source character set? (e.g. is 'a' < 'z' guaranteed to be true? How about 'A' < 'z'?)
The standard only provides a guarantee for ordering of the decimal digits 0 to 9, from the draft C++11 standard section 2.3 [lex.charset]:
In both the source and execution basic character sets, the value of
each character after 0 in the above list of decimal digits shall be
one greater than the value of the previous.
and otherwise says (emphasis mine):
The basic execution character set and the basic execution
wide-character set shall each contain all the members of the basic
source character set, plus control characters representing alert,
backspace, and carriage return, plus a null character (respectively,
null wide character), whose representation has all zero bits. For each
basic execution character set, the values of the members shall be
non-negative and distinct from one another.
Note, EBCDIC has a non-consecutive character set.

Is the character set of a char literal guaranteed to be ASCII?

Coming from a discussion started here, does the standard specify values for characters? So, is '0' guaranteed to be 48? That's what ASCII would tell us, but is it guaranteed? If not, have you seen any compiler where '0' isn't 48?
No. There's no requirement for the either the source or execution character sets to use an encoding with an ASCII subset. I haven't seen any non-ASCII implementations but I know someone who knows someone who has. (It is required that '0' - '9' have contiguous integer values, but that's a duplicate question somewhere else on SO.)
The encoding used for the source character set controls how the bytes of your source code are interpreted into the characters used in the C++ language. The standard describes the members of the execution character set as having values. It is the encoding that maps these characters to their corresponding values the determines the integer value of '0'.
Although at least all of the members of the basic source character set plus some control characters and a null character with value zero must be present (with appropriate values) in the execution character set, there is no requirement for the encoding to be ASCII or to use ASCII values for any particular subset of characters (other than the null character).
No, the Standard is very careful not to specify what the source character encoding is.
C and C++ compilers run on EBCDIC computers too, you know, where '0' != 0x30.
However, I believe it is required that '1' == '0' + 1.
It's 0xF0 in EBCDIC. I've never used an EBCDIC compiler, but I'm told that they were all the rage at IBM for a while.
There's no requirement in the C++ standard that the source or execution encodings are ASCII-based. It is guaranteed that '0' == '1' - 1 (and in general that the digits are contiguous and in order). It is not guaranteed that the letters are contiguous, and indeed in EBCDIC 'J' != 'I' + 1 and 'S' != 'R' + 1.
According to the C++11 stardard N3225
The glyphs for the members of the basic source character set are
intended to identify characters from the subset of ISO/IEC 10646 which
corresponds to the ASCII character set. However, because the mapping
from source file characters to the source character set (described in
translation phase 1) is specified as implementation-defined, an
implementation is required to document how the basic source characters
are represented in source files
In short, the character set is not required to be mapped to the ASCII table, even though I've never heard about any different implementation