What assumption is safe for a C++ implementation's character set? - c++

In The C++ Programming Language 6.2.3, it says:
It is safe to assume that the implementation character set includes
the decimal digits, the 26 alphabetic characters of English, and some
of the basic punctuation characters. It is not safe to assume
that:
There are no more than 127 characters in an 8-bit character set (e.g., some sets provide 255 characters).
There are no more alphabetic characters than English provides (most European
languages provide more, e.g., æ, þ, and ß).
The alphabetic characters are contiguous (EBCDIC leaves a gap between 'i' and 'j').
Every character used to write C++ is available (e.g.,
some national character sets do not provide {, }, [, ], |, and
\).
A char fits in 1 byte. There are embedded processors
without byte accessing hardware for which a char is 4 bytes. Also, one
could reasonably use a 16-bit Unicode encoding for the basic chars.
I'm not sure I understand the last two statements.
In section 2.3 of the standard, it says:
The basic source character set consists of 96 characters: the space
character, the control characters representing horizontal tab,
vertical tab, form feed, and new-line, plus the following 91 graphical
characters:
a b c d e f g h i j k l m n o p q r s t u v w x y
z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2
3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ! = , \ " ' ...
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic
source character set, plus control characters representing alert,
backspace, and carriage return, plus a null character (respectively,
null wide character), whose representation has all zero bits.
We can see that it is stated by the standard that characters like { } [ ] | \ are part of the basic execution character set. Then why TC++PL says it's not safe to assume that those characters are available in the implementation's character set?
And for the size of a char, in section 5.3.3 of the standard:
The sizeof operator yields the number of bytes in the object
representation of its operand. ... ... sizeof(char), sizeof(signed
char) and sizeof(unsigned char) are 1.
We can see that the standard states that a char is of 1 byte. What is the point TC++PL trying to make here?

The word "byte" seems to be used sloppily in the first quote. As far as C++ is concerned, a byte is always a char, but the number of bits it holds is platform-dependent (and available in CHAR_BITS). Sometimes you want to say "a byte is eight bits", in which case you get a different meaning, and that may have been the intended meaning in the phrase "a char has four bytes".
The execution character set may very well be larger than or incompatible with the input character set provided by the environment. Trigraphs and alternate tokens exist to allow the representation of execution-set characters with fewer input characters on such restricted platforms (e.g. not is identical for all purposes to !, and the latter is not available in all character sets or keyboard layouts).

It used to be the case that some national variants of ASCII, such as the Scandinavian languages, used accented alphabetic characters for the code points where US ASCII has punctuation such as [, ], {, }. These are the reason that C89 included trigraphs — they allow code to be written in the 'invariant subset' of ISO 646. See the chart of the characters used in the national variants on the Wikipedia page.
For example, someone in Scandinavia might have to read:
#include <stdio.h>
int main(int argc, char **argv)
Å
for (int i = 1; i < argc; i++)
printf("%s\n", argvÆiØ);
return 0;
ø
instead of:
#include <stdio.h>
int main(int argc, char **argv)
{
for (int i = 1; i < argc; i++)
printf("%s\n", argv[i]);
return 0;
}
Using trigraphs, you might write:
??=include <stdio.h>
int main(int argc, char **argv)
??<
for (int i = 1; i < argc; i++)
printf("%s??/n", argv??(i??));
return 0;
??>
which is equally ghastly in any language.
I'm not sure how much of an issue this still is, but that's why the comments are there.

Related

How to make std::regex match Utf8

I would like a pattern like ".c", match "." with any utf8 followed by 'c' using std::regex.
I've tried under Microsoft C++ and g++. I get the same result, each time the "." only matches a single byte.
here's my test case:
#include <stdio.h>
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main(int argc, char** argv)
{
// make a string with 3 UTF8 characters
const unsigned char p[] = { 'a', 0xC2, 0x80, 'c', 0 };
string tobesearched((char*)p);
// want to match the UTF8 character before c
string pattern(".c");
regex re(pattern);
std::smatch match;
bool r = std::regex_search(tobesearched, match, re);
if (r)
{
// m.size() will be bytes, and we expect 3
// expect 0xC2, 0x80, 'c'
string m = match[0];
cout << "match length " << m.size() << endl;
// but we only get 2, we get the 0x80 and the 'c'.
// so it's matching on single bytes and not utf8
// code here is just to dump out the byte values.
for (int i = 0; i < m.size(); ++i)
{
int c = m[i] & 0xff;
printf("%02X ", c);
}
printf("\n");
}
else
cout << "not matched\n";
return 0;
}
I wanted the pattern ".c" to match 3 bytes of my tobesearched string, where the first two are a 2-byte utf8 character followed by 'c'.
Some regex flavours support \X which will match a single unicode character, which may consist of a number of bytes depending on the encoding. It is common practice for regex engines to get the bytes of the subject string in an encoding the engine is designed to work with, so you shouldn't have to worry about the actual encoding (whether it is US-ASCII, UTF-8, UTF-16 or UTF-32).
Another option is the \uFFFF where FFFF refers to the unicode character at that index in the unicode charset. With that, you could create a ranged match inside a character class i.e. [\u0000-\uFFFF]. Again, it depends on what the regex flavour supports. There is another variant of \u in \x{...} which does the same thing, except the unicode character index must be supplied inside curly braces, and need not be padded e.g. \x{65}.
Edit: This website is amazing for learning more about regex across various flavours https://www.regular-expressions.info
Edit 2: To match any Unicode-exclusive character, i.e. excluding characters in the ASCII table / 1 byte characters, you can try "[\x{80}-\x{FFFFFFFF}]" i.e. any character that has a value of 128-4,294,967,295 which is from the first character outside the ASCII range to the last unicode charset index which currently uses up to a 4-byte representation (was originally to be 6, and may change in future).
A loop through the individual bytes would be more efficient, though:
If the lead bit is 0, i.e. if its signed value is > -1, it is a 1 byte char representation. Skip to the next byte and start again.
Else if the lead bits are 11110 i.e. if its signed value is > -17, n=4.
Else if the lead bits are 1110 i.e. if its signed value is > -33, n=3.
Else if the lead bits are 110 i.e. if its signed value is > -65, n=2.
Optionally, check that the next n bytes each start with 10, i.e. for each byte, if it has a signed value < -63, it is invalid UTF-8 encoding.
You now know that the previous n bytes constitute a unicode-exclusive character. So, if the NEXT character is 'c' i.e. == 99, you can say it matched - return true.

Difference between converting int to char by (char) and by ASCII

I have an example:
int var = 5;
char ch = (char)var;
char ch2 = var+48;
cout << ch << endl;
cout << ch2 << endl;
I had some other code. (char) returned wrong answer, but +48 didn't. When I changed ONLY (char) to +48, then my code got corrected.
What is the difference between converting int to char by using (char) and +48 (ASCII) in C++?
char ch=(char)var; has the same effect as char ch=var; and assigns the numeric value 5 to ch. You're using ASCII (supported by all modern systems) and ASCII character code 5 represents Enquiry 'ENQ' an old terminal control code. Perhaps some old timer has a clue what it did!
char ch2 = var+48; assigns the numeric value 53 to ch2 which happens to represent the ASCII character for the digit '5'. ASCII 48 is zero (0) and the digits all appear in the ASCII table in order after that. So 48+5 lands on 53 (which represents the character '5').
In C++ char is a integer type. The value is interpreted as representing an ASCII character but it should be thought of as holding a number.
Its numeric range is either [-128,127] or [0,255]. That's because C++ requires sizeof(char)==1 and all modern platforms have 8 bit bytes.
NB: C++ doesn't actually mandate ASCII, but again that will be the case on all modern platforms.
PS: I think its an unfortunate artifact of C (inherited by C++) that sizeof(char)==1 and there isn't a separate fundamental type called byte.
A char is simply the base integral denomination in c++. Output statements, like cout and printf map char integers to the corresponding character mapping. On Windows computers this is typically ASCII.
Note that the 5th in ASCII maps to the Enquiry character which has no printable character, while the 53rd character maps to the printable character 5.
A generally accepted hack to store a number 0-9 in a char is to do: const char ch = var + '0' It's important to note the shortcomings here:
If your code is running on some non-ASCII character mapping then characters 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 may not be laid out in order in which case this wouldn't work
If var is outside the 0 - 9 range this var + '0' will map to something other than a numeric character mapping
A guaranteed way to get the most significant digit of a number independent of 1 or 2 is to use:
const auto ch = to_string(var).front()
Generally char represents a number as int does. Casting an int value to char doesn't provide it's ASCII representation.
The ASCII codes as numbers for digits range from 48 (== '0') to 58 (== '9'). So to get the printable digit you have to add '0' (or 48).
The difference is that casting to char (char) explicitly converts the digit to a char and adding 48 do not.
Its important to note that an int is typically 32 bit and char is typically 8 bit. This means that the number you can store in a char is from -127 to +127(or 0 to 255-(2^8-1) if you use unsigned char) and in an int from −2,147,483,648 (−231) to 2,147,483,647 (231 − 1)(or 0 to 2^32 -1 for unsigned).
Adding 48 to a value is not changing the type to char.

VS Intellisense shows escaped characters for some (not all) byte constants

In Visual Studio C++ I have defined a series of channelID constants with decimal values from 0 to 15. I have made them of type uint8_t for reasons having to do with the way they are used in the embedded context in which this code runs.
When hovering over one of these constants, I would like intellisense to show me the numeric value of the constant. Instead, it shows me the character representation. For some non-printing values it shows an escaped character representing some ASCII value, and for others it shows an escaped octal value for that character.
const uint8_t channelID_Observations = 1; // '\001'
const uint8_t channelID_Events = 2; // '\002'
const uint8_t channelID_Wav = 3; // '\003'
const uint8_t channelID_FFT = 4; // '\004'
const uint8_t channelID_Details = 5; // '\005'
const uint8_t channelID_DebugData = 6; // '\006'
const uint8_t channelID_Plethysmography = 7; // '\a'
const uint8_t channelID_Oximetry = 8; // 'b'
const uint8_t channelID_Position = 9; // ' ' ** this is displayed as a space between single quotes
const uint8_t channelID_Excursion = 10; // '\n'
const uint8_t channelID_Motion = 11; // '\v'
const uint8_t channelID_Env = 12; // '\f'
const uint8_t channelID_Cmd = 13; // '\r'
const uint8_t channelID_AudioSnore = 14; // '\016'
const uint8_t channelID_AccelSnore = 15; // '\017'
Some of the escaped codes are easily recognized and the hex or decimal equivalents easily remembered (\n == newline == 0x0A) but others are more obscure. For example decimal 7 is shown as '\a', which in some systems represents the ASCII BEL character.
Some of the representations are mystifying to me -- for example decimal 9 would be an ASCII tab, which today often appears as '\t', but intellisense shows it as a space character.
Why is an 8-bit unsigned integer always treated as a character, no matter how I try to define it as a numeric value?
Why are only some, but not all of these characters shown as escaped symbols for their ASCII equivalents, while others get their octal representation?
What is the origin of the obscure symbols used? For example, '\a' for decimal 7 matches the ISO-defined Control0 set, which has a unicode representation -- but then '\t' should be shown for decimal 9. Wikipedia C0 control codes
Is there any way to make intellisense hover tips show me the numeric value of such constants rather than a character representation? Decoration? VS settings? Typedefs? #defines?
You are misreading the intellisense. 0x7 = '\a' not the literal char 'a'. '\a' is the bell / alarm.
See the following article on escape sequences - https://en.wikipedia.org/wiki/Escape_sequences_in_C
'\a' does indeed have the value 0x7. If you assign 0x07 to a uint8_t, you can be pretty sure that the compiler will not change that assignment to something else. IntelliSense just represents the value in another way, it doesn't change your values.
Also, 'a' has the value 0x61, that's what probably tripped you up.
After more than a year, I've decided to document what I found when I pursued this further. The correct answer was implied in djgandy's answer, which cited Wikipedia, but I want to make it explicit.
With the exception of one value (0x09), Intellisense does appear to treat these values consistently, and that treatment is rooted in an authoritative source: my constants are unsigned 8-bit constants, thus they are "character-constants" per the C11 language standard (section 6.4.4).
For character constants that do not map to a displayable character, Section 6.4.4.4 defines their syntax as
6.4.4.4 Character constants
Syntax
. . .
::simple-escape-sequence: one of
\'
\"
\?
\\
\a
\b
\f
\n
\r
\t
\v
::octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
"Escape Sequences" are further defined in the C language definition section 5.2.2:
§5.2.2 Character display semantics
2) Alphabetic escape sequences representing nongraphic characters in
the execution character set are intended to produce actions on display
devices as follows:
\a (alert) Produces an audible or visible alert
without changing the active position.
\b (backspace) Moves the active position to the previous position on
the current line. If the active position is at the initial position of
a line, the behavior of the display device is unspecified.
\f ( form feed) Moves the active position to the initial position at
the start of the next logical page.
\n (new line) Moves the active position to the initial position of the
next line.
\r (carriage return) Moves the active position to the initial position
of the current line.
\t (horizontal tab) Moves the active position
to the next horizontal tabulation position on the current line. If the
active position is at or past the last defined horizontal tabulation
position, the behavior of the display device is unspecified.
\v (vertical tab) Moves the active position to the initial position of
the next vertical tabulation position. If the active position is at or
past the last defined vertical tabulation position, the behavior of
the display device is unspecified.
3) Each of these escape sequences shall produce a unique
implementation-defined value which can be stored in a single char
object. The external representations in a text file need not be
identical to the internal representations, and are outside the scope
of this International Standard.
Thus the only place where Intellisense falls down is in the handling of 0x09, which should be displayed as
'\t'
but actually is displayed as
''
So what's that all about? I suspect Intellisense considers a tab to be a printable character, but suppresses the tab action in its formatting. This seems to me inconsistent with the C and C++ standards and is also inconsistent with its treatment of other escape characters, but perhaps there's some justification for it that "escapes" me :)

Why extended ASCII (special) characters take 2 bytes to get stored?

ASCII ranging from 32 to 126 are printable. 127 is DEL and thereafter are considered the extended characters.
To check, how are they stored in the std::string, I wrote a test program:
int main ()
{
string s; // ASCII
s += "!"; // 33
s += "A"; // 65
s += "a"; // 97
s += "â"; // 131
s += "ä"; // 132
s += "à"; // 133
cout << s << endl; // Print directly
for(auto i : s) // Print after iteration
cout << i;
cout << "\ns.size() = " << s.size() << endl; // outputs 9!
}
The special characters visible in the code above actually look different and those can be seen in this online example (also visible in vi).
In the string s, first 3 normal characters acquire 1 byte each as expected. The next 3 extended characters take surprisingly 2 bytes each.
Questions:
Despite being an ASCII (within range of 0 to 256), why those 3 extended characters take 2 bytes of space?
When we iterate through the s using range based loop, how is it figured out that for normal characters it has to increment 1 time and for extended characters 2 times!?
[Note: This may also apply to C and other languages.]
Despite being an ASCII (within range of 0 to 256), why those 3 extended characters take 2 bytes of space?
If you define 'being ASCII' as containing only bytes in the range [0, 256), then all data is ASCII: [0, 256) is the same as the range a byte is capable of representing, thus all data represented with bytes is ASCII, under your definition.
The issue is that your definition is incorrect, and you're looking incorrectly at how data types are determined; The kind of data represented by a sequence of bytes is not determined by those bytes. Instead, the data type is metadata that is external to the sequence of bytes. (This isn't to say that it's impossible to examine a sequence of bytes and determine statistically what kind of data it is likely to be.)
Let's examine your code, keeping the above in mind. I've taken the relevant snippets from the two versions of your source code:
s += "â"; // 131
s += "ä"; // 132
s += "â"; // 131
s += "ä"; // 132
You're viewing these source code snippets as text rendered in a browser, rather than as raw binary data. You've presented these two things as the 'same' data, but in fact they're not the same. Pictured above are two different sequences of characters.
However there is something interesting about these two sequences of text elements: one of them, when encoded into bytes using a certain encoding scheme, is represented by the same sequence of bytes as the other sequence of text elements when that sequence is encoded into bytes using a different encoding scheme. That is, the same sequence of bytes on disk may represent two different sequences of text elements depending on the encoding scheme! In other words, in order to figure out what the sequence of bytes means, we have to know what kind of data it is, and therefore what decoding scheme to use.
So here's what probably happened. In vi you wrote:
s += "â"; // 131
s += "ä"; // 132
You were under the impression that vi would represent those characters using extended ASCII, and thus use the bytes 131 and 132. But that was incorrect. vi did not use extended ASCII, and instead it represented those characters using a different scheme (UTF-8) which happens to use two bytes to represent each of those characters.
Later, when you opened the source code in a different editor, that editor incorrectly assumed the file was extended ASCII and displayed it as such. Since extended ASCII uses a single byte for every character, it took the two bytes vi used to represent each of those characters, and showed one character for each byte.
The bottom line is that you're incorrect that the source code is using extended ASCII, and so your assumption that those characters would be represented by single bytes with the values 131 and 132 was incorrect.
When we iterate through the s using range based loop, how is it figured out that for normal characters it has to increment 1 time and for extended characters 2 times!?
Your program isn't doing this. The characters are printing okay in your ideone.com example because independently printing out the two bytes that represent those characters works to display that character. Here's an example that makes this clear: live example.
std::cout << "Printed together: '";
std::cout << (char)0xC3;
std::cout << (char)0xA2;
std::cout << "'\n";
std::cout << "Printed separated: '";
std::cout << (char)0xC3;
std::cout << '/';
std::cout << (char)0xA2;
std::cout << "'\n";
Printed together: 'â'
Printed separated: '�/�'
The '�' character is what shows up when an invalid encoding is encountered.
If you're asking how you can write a program that does do this, the answer is to use code that understands the details of the encoding being used. Either get a library that understands UTF-8 or read the UTF-8 spec yourself.
You should also keep in mind that the use of UTF-8 here is simply because this editor and compiler use UTF-8 by default. If you were to write the same code with a different editor and compile it with a different compiler, the encoding could be completely different; assuming code is UTF-8 can be just as wrong as your earlier assumption that the code was extended ASCII.
Your terminal probably uses UTF-8 encoding. It uses one byte for ASCII characters, and 2-4 bytes for everything else.
The basic source character set for C++ source code does not include extended ASCII characters (ref. §2.3 in ISO/IEC 14882:2011) :
The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ∼ ! = , \ " ’
So, an implementation has to map those characters from the source file to characters in the basic source character set, before passing them on to the compiler. They will likely be mapped to universal character names, following ISO/IEC 10646 (UCS) :
The universal-character-name construct provides a way to name other characters.
The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN.
A universal character name in a narrow string literal (as in your case) may be mapped to multiple chars, using multibyte encoding (ref. §2.14.5 in ISO/IEC 14882:2011) :
In a narrow string literal, a universal-character-name may map to more than one char element due to multibyte encoding.
That's what you're seeing for those 3 last characters.

convert char[] of hexadecimal numbers to char[] of letters corresponding to the hexadecimal numbers in ascii table and reversing it

I have a char a[] of hexadecimal characters like this:
"315c4eeaa8b5f8aaf9174145bf43e1784b8fa00dc71d885a804e5ee9fa40b16349c146fb778cdf2d3aff021dfff5b403b510d0d0455468aeb98622b137dae857553ccd8883a7bc37520e06e515d22c954eba5025b8cc57ee59418ce7dc6bc41556bdb36bbca3e8774301fbcaa3b83b220809560987815f65286764703de0f3d524400a19b159610b11ef3e"
I want to convert it to letters corresponding to each hexadecimal number like this:
68656c6c6f = hello
and store it in char b[] and then do the reverse
I don't want a block of code please, I want explanation and what libraries was used and how to use it.
Thanks
Assuming you are talking about ASCII codes. Well, first step is to find the size of b. Assuming you have all characters by 2 hexadecimal digits (for example, a tab would be 09), then size of b is simply strlen(a) / 2 + 1.
That done, you need to go through letters of a, 2 by 2, convert them to their integer value and store it as a string. Written as a formula you have:
b[i] = (to_digit(a[2*i]) << 4) + to_digit(a[2*i+1]))
where to_digit(x) converts '0'-'9' to 0-9 and 'a'-'z' or 'A'-'Z' to 10-15.
Note that if characters below 0x10 are shown with only one character (the only one I can think of is tab, then instead of using 2*i as index to a, you should keep a next_index in your loop which is either added by 2, if a[next_index] < '8' or added by 1 otherwise. In the later case, b[i] = to_digit(a[next_index]).
The reverse of this operation is very similar. Each character b[i] is written as:
a[2*i] = to_char(b[i] >> 4)
a[2*i+1] = to_char(b[i] & 0xf)
where to_char is the opposite of to_digit.
Converting the hexadecimal string to a character string can be done by using std::substr to get the next two characters of the hex string, then using std::stoi to convert the substring to an integer. This can be casted to a character that is added to a std::string. The std::stoi function is C++11 only, and if you don't have it you can use e.g. std::strtol.
To do the opposite you loop over each character in the input string, cast it to an integer and put it in an std::ostringstream preceded by manipulators to have it presented as a two-digit, zero-prefixed hexadecimal number. Append to the output string.
Use std::string::c_str to get an old-style C char pointer if needed.
No external library, only using the C++ standard library.
Forward:
Read two hex chars from input.
Convert to int (0..255). (hint: sscanf is one way)
Append int to output char array
Repeat 1-3 until out of chars.
Null terminate the array
Reverse:
Read single char from array
Convert to 2 hexidecimal chars (hint: sprintf is one way).
Concat buffer from (2) to final output string buffer.
Repeat 1-3 until out of chars.
Almost forgot to mention. stdio.h and the regular C-runtime required only-assuming you're using sscanf and sprintf. You could alternatively create a a pair of conversion tables that would radically speed up the conversions.
Here's a simple piece of code to do the trick:
unsigned int hex_digit_value(char c)
{
if ('0' <= c && c <= '9') { return c - '0'; }
if ('a' <= c && c <= 'f') { return c + 10 - 'a'; }
if ('A' <= c && c <= 'F') { return c + 10 - 'A'; }
return -1;
}
std::string dehexify(std::string const & s)
{
std::string result(s.size() / 2);
for (std::size_t i = 0; i != s.size(); ++i)
{
result[i] = hex_digit_value(s[2 * i]) * 16
+ hex_digit_value(s[2 * i + 1]);
}
return result;
}
Usage:
char const a[] = "12AB";
std::string s = dehexify(a);
Notes:
A proper implementation would add checks that the input string length is even and that each digit is in fact a valid hex numeral.
Dehexifying has nothing to do with ASCII. It just turns any hexified sequence of nibbles into a sequence of bytes. I just use std::string as a convenient "container of bytes", which is exactly what it is.
There are dozens of answers on SO showing you how to go the other way; just search for "hexify".
Each hexadecimal digit corresponds to 4 bits, because 4 bits has 16 possible bit patterns (and there are 16 possible hex digits, each standing for a unique 4-bit pattern).
So, two hexadecimal digits correspond to 8 bits.
And on most computers nowadays (some Texas Instruments digital signal processors are an exception) a C++ char is 8 bits.
This means that each C++ char is represented by 2 hex digits.
So, simply read two hex digits at a time, convert to int using e.g. an istringstream, convert that to char, and append each char value to a std::string.
The other direction is just opposite, but with a twist.
Because char is signed on most systems, you need to convert to unsigned char before converting that value again to hex digits.
Conversion to and from hexadecimal can be done using hex, like e.g.
cout << hex << x;
cin >> hex >> x;
for a suitable definition of x, e.g. int x
This should work for string streams as well.