VS Intellisense shows escaped characters for some (not all) byte constants - c++

In Visual Studio C++ I have defined a series of channelID constants with decimal values from 0 to 15. I have made them of type uint8_t for reasons having to do with the way they are used in the embedded context in which this code runs.
When hovering over one of these constants, I would like intellisense to show me the numeric value of the constant. Instead, it shows me the character representation. For some non-printing values it shows an escaped character representing some ASCII value, and for others it shows an escaped octal value for that character.
const uint8_t channelID_Observations = 1; // '\001'
const uint8_t channelID_Events = 2; // '\002'
const uint8_t channelID_Wav = 3; // '\003'
const uint8_t channelID_FFT = 4; // '\004'
const uint8_t channelID_Details = 5; // '\005'
const uint8_t channelID_DebugData = 6; // '\006'
const uint8_t channelID_Plethysmography = 7; // '\a'
const uint8_t channelID_Oximetry = 8; // 'b'
const uint8_t channelID_Position = 9; // ' ' ** this is displayed as a space between single quotes
const uint8_t channelID_Excursion = 10; // '\n'
const uint8_t channelID_Motion = 11; // '\v'
const uint8_t channelID_Env = 12; // '\f'
const uint8_t channelID_Cmd = 13; // '\r'
const uint8_t channelID_AudioSnore = 14; // '\016'
const uint8_t channelID_AccelSnore = 15; // '\017'
Some of the escaped codes are easily recognized and the hex or decimal equivalents easily remembered (\n == newline == 0x0A) but others are more obscure. For example decimal 7 is shown as '\a', which in some systems represents the ASCII BEL character.
Some of the representations are mystifying to me -- for example decimal 9 would be an ASCII tab, which today often appears as '\t', but intellisense shows it as a space character.
Why is an 8-bit unsigned integer always treated as a character, no matter how I try to define it as a numeric value?
Why are only some, but not all of these characters shown as escaped symbols for their ASCII equivalents, while others get their octal representation?
What is the origin of the obscure symbols used? For example, '\a' for decimal 7 matches the ISO-defined Control0 set, which has a unicode representation -- but then '\t' should be shown for decimal 9. Wikipedia C0 control codes
Is there any way to make intellisense hover tips show me the numeric value of such constants rather than a character representation? Decoration? VS settings? Typedefs? #defines?

You are misreading the intellisense. 0x7 = '\a' not the literal char 'a'. '\a' is the bell / alarm.
See the following article on escape sequences - https://en.wikipedia.org/wiki/Escape_sequences_in_C

'\a' does indeed have the value 0x7. If you assign 0x07 to a uint8_t, you can be pretty sure that the compiler will not change that assignment to something else. IntelliSense just represents the value in another way, it doesn't change your values.
Also, 'a' has the value 0x61, that's what probably tripped you up.

After more than a year, I've decided to document what I found when I pursued this further. The correct answer was implied in djgandy's answer, which cited Wikipedia, but I want to make it explicit.
With the exception of one value (0x09), Intellisense does appear to treat these values consistently, and that treatment is rooted in an authoritative source: my constants are unsigned 8-bit constants, thus they are "character-constants" per the C11 language standard (section 6.4.4).
For character constants that do not map to a displayable character, Section 6.4.4.4 defines their syntax as
6.4.4.4 Character constants
Syntax
. . .
::simple-escape-sequence: one of
\'
\"
\?
\\
\a
\b
\f
\n
\r
\t
\v
::octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
"Escape Sequences" are further defined in the C language definition section 5.2.2:
§5.2.2 Character display semantics
2) Alphabetic escape sequences representing nongraphic characters in
the execution character set are intended to produce actions on display
devices as follows:
\a (alert) Produces an audible or visible alert
without changing the active position.
\b (backspace) Moves the active position to the previous position on
the current line. If the active position is at the initial position of
a line, the behavior of the display device is unspecified.
\f ( form feed) Moves the active position to the initial position at
the start of the next logical page.
\n (new line) Moves the active position to the initial position of the
next line.
\r (carriage return) Moves the active position to the initial position
of the current line.
\t (horizontal tab) Moves the active position
to the next horizontal tabulation position on the current line. If the
active position is at or past the last defined horizontal tabulation
position, the behavior of the display device is unspecified.
\v (vertical tab) Moves the active position to the initial position of
the next vertical tabulation position. If the active position is at or
past the last defined vertical tabulation position, the behavior of
the display device is unspecified.
3) Each of these escape sequences shall produce a unique
implementation-defined value which can be stored in a single char
object. The external representations in a text file need not be
identical to the internal representations, and are outside the scope
of this International Standard.
Thus the only place where Intellisense falls down is in the handling of 0x09, which should be displayed as
'\t'
but actually is displayed as
''
So what's that all about? I suspect Intellisense considers a tab to be a printable character, but suppresses the tab action in its formatting. This seems to me inconsistent with the C and C++ standards and is also inconsistent with its treatment of other escape characters, but perhaps there's some justification for it that "escapes" me :)

Related

Why does the size of this std::string change, when characters are changed?

I have an issue in which the size of the string is effected with the presence of a '\0' character. I searched all over in SO and could not get the answer still.
Here is the snippet.
int main()
{
std::string a = "123123\0shai\0";
std::cout << a.length();
}
http://ideone.com/W6Bhfl
The output in this case is
6
Where as the same program with a different string having numerals instead of characters
int main()
{
std::string a = "123123\0123\0";
std::cout << a.length();
}
http://ideone.com/mtfS50
gives an output of
8
What exactly is happening under the hood? How does presence of a '\0' character change the behavior?
The sequence \012 when used in a string (or character) literal is an octal escape sequence. It's the octal number 12 which corresponds to the ASCII linefeed ('\n') character.
That means your second string is actually equal to "123123\n3\0" (plus the actual string literal terminator).
It would have been very clear if you tried to print the contents of the string.
Octal sequences are one to three digits long, and the compiler will use as many digits as possible.
If you check the coloring at ideone you will see that \012 has a different color. That is because this is a single character written in octal.

std::string optimal way to truncate utf-8 at safe place

I have a valid utf-8 encoded string in a std::string. I have limit in bytes. I would like to truncate the string and add ... at MAX_SIZE - 3 - x - where x is that value that will prevent a utf-8 character to be cut.
Is there function that could determine x based on MAX_SIZE without the need to start from the beginning of the string?
If you have a location in a string, and you want to go backwards to find the start of a UTF-8 character (and therefore a valid place to cut), this is fairly easily done.
You start from the last byte in the sequence. If the top two bits of the last byte are 10, then it is part of a UTF-8 sequence, so keep backing up until the top two bits are not 10 (or until you reach the start).
The way UTF-8 works is that a byte can be one of three things, based on the upper bits of the byte. If the topmost bit is 0, then the byte is an ASCII character, and the next 7 bits are the Unicode Codepoint value itself. If the topmost bit is 10, then the 6 bits that follow are extra bits for a multi-byte sequence. But the beginning of a multibyte sequence is coded with 11 in the top 2 bits.
So if the top bits of a byte are not 10, then it's either an ASCII character or the start of a multibyte sequence. Either way, it's a valid place to cut.
Note however that, while this algorithm will break the string at codepoint boundaries, it ignores Unicode grapheme clusters. This means that combining characters can be culled, away from the base characters that they combine with; accents can be lost from characters, for example. Doing proper grapheme cluster analysis would require having access to the Unicode table that says whether a codepoint is a combining character.
But it will at least be a valid Unicode UTF-8 string. So that's better than most people do ;)
The code would look something like this (in C++14):
auto FindCutPosition(const std::string &str, size_t max_size)
{
assert(str.size() >= max_size, "Make sure stupidity hasn't happened.");
assert(str.size() > 3, "Make sure stupidity hasn't happened.");
max_size -= 3;
for(size_t pos = max_size; pos > 0; --pos)
{
unsigned char byte = static_cast<unsigned char>(str[pos]); //Perfectly valid
if(byte & 0xC0 != 0x80)
return pos;
}
unsigned char byte = static_cast<unsigned char>(str[0]); //Perfectly valid
if(byte & 0xC0 != 0x80)
return 0;
//If your first byte isn't even a valid UTF-8 starting point, then something terrible has happened.
throw bad_utf8_encoded_text(...);
}

Why extended ASCII (special) characters take 2 bytes to get stored?

ASCII ranging from 32 to 126 are printable. 127 is DEL and thereafter are considered the extended characters.
To check, how are they stored in the std::string, I wrote a test program:
int main ()
{
string s; // ASCII
s += "!"; // 33
s += "A"; // 65
s += "a"; // 97
s += "â"; // 131
s += "ä"; // 132
s += "à"; // 133
cout << s << endl; // Print directly
for(auto i : s) // Print after iteration
cout << i;
cout << "\ns.size() = " << s.size() << endl; // outputs 9!
}
The special characters visible in the code above actually look different and those can be seen in this online example (also visible in vi).
In the string s, first 3 normal characters acquire 1 byte each as expected. The next 3 extended characters take surprisingly 2 bytes each.
Questions:
Despite being an ASCII (within range of 0 to 256), why those 3 extended characters take 2 bytes of space?
When we iterate through the s using range based loop, how is it figured out that for normal characters it has to increment 1 time and for extended characters 2 times!?
[Note: This may also apply to C and other languages.]
Despite being an ASCII (within range of 0 to 256), why those 3 extended characters take 2 bytes of space?
If you define 'being ASCII' as containing only bytes in the range [0, 256), then all data is ASCII: [0, 256) is the same as the range a byte is capable of representing, thus all data represented with bytes is ASCII, under your definition.
The issue is that your definition is incorrect, and you're looking incorrectly at how data types are determined; The kind of data represented by a sequence of bytes is not determined by those bytes. Instead, the data type is metadata that is external to the sequence of bytes. (This isn't to say that it's impossible to examine a sequence of bytes and determine statistically what kind of data it is likely to be.)
Let's examine your code, keeping the above in mind. I've taken the relevant snippets from the two versions of your source code:
s += "â"; // 131
s += "ä"; // 132
s += "â"; // 131
s += "ä"; // 132
You're viewing these source code snippets as text rendered in a browser, rather than as raw binary data. You've presented these two things as the 'same' data, but in fact they're not the same. Pictured above are two different sequences of characters.
However there is something interesting about these two sequences of text elements: one of them, when encoded into bytes using a certain encoding scheme, is represented by the same sequence of bytes as the other sequence of text elements when that sequence is encoded into bytes using a different encoding scheme. That is, the same sequence of bytes on disk may represent two different sequences of text elements depending on the encoding scheme! In other words, in order to figure out what the sequence of bytes means, we have to know what kind of data it is, and therefore what decoding scheme to use.
So here's what probably happened. In vi you wrote:
s += "â"; // 131
s += "ä"; // 132
You were under the impression that vi would represent those characters using extended ASCII, and thus use the bytes 131 and 132. But that was incorrect. vi did not use extended ASCII, and instead it represented those characters using a different scheme (UTF-8) which happens to use two bytes to represent each of those characters.
Later, when you opened the source code in a different editor, that editor incorrectly assumed the file was extended ASCII and displayed it as such. Since extended ASCII uses a single byte for every character, it took the two bytes vi used to represent each of those characters, and showed one character for each byte.
The bottom line is that you're incorrect that the source code is using extended ASCII, and so your assumption that those characters would be represented by single bytes with the values 131 and 132 was incorrect.
When we iterate through the s using range based loop, how is it figured out that for normal characters it has to increment 1 time and for extended characters 2 times!?
Your program isn't doing this. The characters are printing okay in your ideone.com example because independently printing out the two bytes that represent those characters works to display that character. Here's an example that makes this clear: live example.
std::cout << "Printed together: '";
std::cout << (char)0xC3;
std::cout << (char)0xA2;
std::cout << "'\n";
std::cout << "Printed separated: '";
std::cout << (char)0xC3;
std::cout << '/';
std::cout << (char)0xA2;
std::cout << "'\n";
Printed together: 'â'
Printed separated: '�/�'
The '�' character is what shows up when an invalid encoding is encountered.
If you're asking how you can write a program that does do this, the answer is to use code that understands the details of the encoding being used. Either get a library that understands UTF-8 or read the UTF-8 spec yourself.
You should also keep in mind that the use of UTF-8 here is simply because this editor and compiler use UTF-8 by default. If you were to write the same code with a different editor and compile it with a different compiler, the encoding could be completely different; assuming code is UTF-8 can be just as wrong as your earlier assumption that the code was extended ASCII.
Your terminal probably uses UTF-8 encoding. It uses one byte for ASCII characters, and 2-4 bytes for everything else.
The basic source character set for C++ source code does not include extended ASCII characters (ref. §2.3 in ISO/IEC 14882:2011) :
The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ∼ ! = , \ " ’
So, an implementation has to map those characters from the source file to characters in the basic source character set, before passing them on to the compiler. They will likely be mapped to universal character names, following ISO/IEC 10646 (UCS) :
The universal-character-name construct provides a way to name other characters.
The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN.
A universal character name in a narrow string literal (as in your case) may be mapped to multiple chars, using multibyte encoding (ref. §2.14.5 in ISO/IEC 14882:2011) :
In a narrow string literal, a universal-character-name may map to more than one char element due to multibyte encoding.
That's what you're seeing for those 3 last characters.

What assumption is safe for a C++ implementation's character set?

In The C++ Programming Language 6.2.3, it says:
It is safe to assume that the implementation character set includes
the decimal digits, the 26 alphabetic characters of English, and some
of the basic punctuation characters. It is not safe to assume
that:
There are no more than 127 characters in an 8-bit character set (e.g., some sets provide 255 characters).
There are no more alphabetic characters than English provides (most European
languages provide more, e.g., æ, þ, and ß).
The alphabetic characters are contiguous (EBCDIC leaves a gap between 'i' and 'j').
Every character used to write C++ is available (e.g.,
some national character sets do not provide {, }, [, ], |, and
\).
A char fits in 1 byte. There are embedded processors
without byte accessing hardware for which a char is 4 bytes. Also, one
could reasonably use a 16-bit Unicode encoding for the basic chars.
I'm not sure I understand the last two statements.
In section 2.3 of the standard, it says:
The basic source character set consists of 96 characters: the space
character, the control characters representing horizontal tab,
vertical tab, form feed, and new-line, plus the following 91 graphical
characters:
a b c d e f g h i j k l m n o p q r s t u v w x y
z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2
3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ! = , \ " ' ...
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic
source character set, plus control characters representing alert,
backspace, and carriage return, plus a null character (respectively,
null wide character), whose representation has all zero bits.
We can see that it is stated by the standard that characters like { } [ ] | \ are part of the basic execution character set. Then why TC++PL says it's not safe to assume that those characters are available in the implementation's character set?
And for the size of a char, in section 5.3.3 of the standard:
The sizeof operator yields the number of bytes in the object
representation of its operand. ... ... sizeof(char), sizeof(signed
char) and sizeof(unsigned char) are 1.
We can see that the standard states that a char is of 1 byte. What is the point TC++PL trying to make here?
The word "byte" seems to be used sloppily in the first quote. As far as C++ is concerned, a byte is always a char, but the number of bits it holds is platform-dependent (and available in CHAR_BITS). Sometimes you want to say "a byte is eight bits", in which case you get a different meaning, and that may have been the intended meaning in the phrase "a char has four bytes".
The execution character set may very well be larger than or incompatible with the input character set provided by the environment. Trigraphs and alternate tokens exist to allow the representation of execution-set characters with fewer input characters on such restricted platforms (e.g. not is identical for all purposes to !, and the latter is not available in all character sets or keyboard layouts).
It used to be the case that some national variants of ASCII, such as the Scandinavian languages, used accented alphabetic characters for the code points where US ASCII has punctuation such as [, ], {, }. These are the reason that C89 included trigraphs — they allow code to be written in the 'invariant subset' of ISO 646. See the chart of the characters used in the national variants on the Wikipedia page.
For example, someone in Scandinavia might have to read:
#include <stdio.h>
int main(int argc, char **argv)
Å
for (int i = 1; i < argc; i++)
printf("%s\n", argvÆiØ);
return 0;
ø
instead of:
#include <stdio.h>
int main(int argc, char **argv)
{
for (int i = 1; i < argc; i++)
printf("%s\n", argv[i]);
return 0;
}
Using trigraphs, you might write:
??=include <stdio.h>
int main(int argc, char **argv)
??<
for (int i = 1; i < argc; i++)
printf("%s??/n", argv??(i??));
return 0;
??>
which is equally ghastly in any language.
I'm not sure how much of an issue this still is, but that's why the comments are there.

C++ MFC RegEx issue

I am using a Regex to restrict the some characters entered into a Textbox
I am using the below for the allowed characters
CAtlRegExp<> regex;
CString csText2 = "Some Test £"
CString m_szRegex = "([a-zA-Z0-9\\.\\,\";\\:'##$£?\\+\\*\\-\\/\\%! ()])";
REParseError status = regex.Parse(m_szRegex, true);
CAtlREMatchContext<> mc;
if (!regex.Match(csText2, &mc))
{
AfxMessageBox("Inavlid Char")
}
This works fine, except for the £ symbol. It doesn't seem to pick this up
Can anyone advise on what I am missing
thanks
This seems to be a bug that affects all Extended ASCII characters (those higher than 0x7F).
The character value is converted to integer and used as an index into
some kind of attributes array. Since char is signed, it undergoes sign
expansion, so any character above 0x7F will cause an overflow.
size_t u = static_cast<size_t>(static_cast<_TUCHAR>(* ((RECHAR *) sz)));
if (pBits[u >> 3] & 1 << (u & 0x7))
You can find more discussions on this topic here: CAtlRegExp crashes with pound sign!