0xc2 character in std::string - c++

The following string has size 4 not 3 as I would have expected.
std::string s = "\r\n½";
int ss = s.size(); //ss is 4
When loop through the string character by character escaping it to hex I get
0x0D (hex code for carriage return)
0x0A (hex code for line feed)
0xc2 (hex code, but what is this?)
0xbd (hex code for the ½ character)
Where does the 0xc2 come from?
Is it some sort of encoding information? I though std::string had a char per visible character in the string. Can someone confirm 0xc2 is a "character set modifier"?

"½" has, in unicode, the code point U+00BD and is represented by UTF-8 by the two bytes sequence 0xc2bd. This means, your string contains only three characters, but is four bytes long.
See https://www.fileformat.info/info/unicode/char/00bd/index.htm
Additional reading on SO: std::wstring VS std::string.

Related

What does this GDB output means?

I have a buffer of chars, when info locals I get this :
buf = "\310X\346\354\376\177\000\000E2\360\025\241\177\000\000pG\356\025\241\177\000\000\000\000\000\000\211\320\005\000\340G\356\025\241\177\000\000\000 \000\000\000\000\000\000 \247\244\025\241\177\000\000\000\243\341\021\000\000\000\000\030L\356\025\241\177\000\000W\220\244\025\241\177\000\000\032\000\000\000\000\000\000\000hJ\356\025\241\177\000\000hJ\356\025\241\177\000\000\241\005$\026"
I am confused how to interpret this output. I excepted pairs of hexa digits (2 hexa = 1 byte). How to read that ? It's not even decimal notation since some are greater than 256.
I am confused how to interpret this output.
By default, GDB prints unprintable characters in octal notation, so \310 is 0xc8. The first characters in your buffer are 0xc8 0x58 0xe6 0xec ....
It might be easier to interpret the buffer if it is printed as hex. Use x/20xb buf to examine the first 20 characters, or p/x buf to examine the entire buffer until terminating NUL.

Finding and comparing a Unicode charater in C++

I am writing a Lexical analyzer that parses a given string in C++. I have a string
line = R"(if n = 4 # comment
return 34;
if n≤3 retur N1
FI)";
All I need to do is output all words, numbers and tokens in a vector.
My program works with regular tokens, words and numbers; but I cannot figure out how to parse Unicode characters. The only Unicode characters my program needs to save in a vector are ≤ and ≠.
So far I all my code basically takes the string line by line, reads the first word, number or token, chops it off and recursively continues to eat tokens until the string is empty. I am unable to compare line[0] with ≠ (of course) and I am also not clear on how much of the string I need to chop off in order to get rid of the Unicode char? In case of "!=" I simple remove line[0] and line[1].
If your input-file is utf8, just treat your unicode characters ≤, ≠, etc as strings. So you just have to use the same logic to recognize "≤" as you would for "<=". The length of a unicode char is then given by strlen("≤")
All Unicode encodings are variable-length except UTF-32. Therefore the next character isn't necessary a single char and you must read it as a string. Since you're using a char* or std::string, the encoding is likely UTF-8 and the next character and can be returned as std::string
The encoding of UTF-8 is very simple and you can read about it everywhere. In short, the first byte of a sequence will indicate how long that sequence is and you can get the next character like this:
std::string getNextChar(const std::string& str, size_t index)
{
if (str[index] & 0x80 == 0) // 1-byte sequence
return std::string(1, str[index])
else if (str[index] & 0xE0 == 0xC0) // 2-byte sequence
return std::string(&str[index], 2)
else if (str[index] & 0xF0 == 0xE0) // 3-byte sequence
return std::string(&str[index], 3)
else if (str[index] & 0xF8 == 0xF0) // 4-byte sequence
return std::string(&str[index], 4)
throw "Invalid codepoint!";
}
It's a very simple decoder and doesn't handle invalid codepoints or broken datastream yet. If you need better handling you'll have to use a proper UTF-8 library

UTF Encoding for "Ü" returns 3 bytes instead of the "real" unicode

I was playing around with the code mentioned in:
https://stackoverflow.com/a/21575607/2416394 as I have issues writing proper utf8 xml with TinyXML.
Well, I need to encode the "LATIN CAPITAL LETTER U WITH DIAERESIS", which is Ü to be properly written to XML etc.
Here is the code take from the post above:
std::string codepage_str = "Ü";
int size = MultiByteToWideChar( CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
codepage_str.length(), nullptr, 0 );
std::wstring utf16_str( size, '\0' );
MultiByteToWideChar( CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
codepage_str.length(), &utf16_str[ 0 ], size );
int utf8_size = WideCharToMultiByte( CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), nullptr, 0,
nullptr, nullptr );
std::string utf8_str( utf8_size, '\0' );
WideCharToMultiByte( CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), &utf8_str[ 0 ], utf8_size,
nullptr, nullptr );
The result is an std::string which has the size of 3 with the following bytes:
- utf8_str "Ü" std::basic_string<char,std::char_traits<char>,std::allocator<char> >
[size] 0x0000000000000003 unsigned __int64
[capacity] 0x000000000000000f unsigned __int64
[0] 0x55 'U' char
[1] 0xcc 'Ì' char
[2] 0x88 'ˆ' char
When I write it into an utf8 file. The hex values remain there: 0x55 0xCC 0x88 and Notepad++ shows me the proper char Ü.
However when I add another Ü to the file via Notepad++ and save it again then the newly written Ü is displayed as 0xC3 0x9C (which I've actually expected in the first place).
I do not understand, why I get a 3 byte representation of this character and not the expected unicode codepoint U+00DC.
Although Notepad++ displays it correctly, our proprietary system renders 0xC3 0x 9C as Ü and breaks on 0x55 0xCC 0x88 by rendering Ü not recognizing it as a two byte utf 8
Unicode is complicated. There are at least two different ways to get Ü:
LATIN CAPITAL LETTER U WITH DIAERESIS is Unicode codepoint U+00DC.
LATIN CAPITAL LETTER U is Unicode codepoint U+0055, and COMBINING DIAERESIS is Unicode codepoint U+0308.
U+00DC and U+0055 U+0308 both display as Ü.
In UTF-8, Unicode codepoint U+00DC is encoded as 0xC3 0x9C, U+0055 is encoded as 0x55, and U+0308 is encoded as 0xCC 0x88.
Your proprietary system seems to have a bug.
Edit: to get what you expect, according to the MultiByteToWideChar() documentation, use MB_PRECOMPOSED instead of MB_COMPOSITE.
While the encoding output is technically correct, you can work around the problem in the proprietary systems by using the NFC form.
In NFC form, all characters are first decomposed (for example, if you had codepoint U+00DC for Ü, it would get decomposed to the sequence U+0055 U+0308) and then re-composed to their canonical representation (in your example, as U+00DC).
In the Win32 API, see the NormalizeString() function.

Why do I get three different numbers when i try to output a UTF-8 character?

I'm trying to tokenize the input consisting of UTF-8 characters. While some trying the learn utf8 i get an output that i cannot understand. when i input the characher π (pi) i get three different numbers 207 128 10. How can i use them to control which category it is belong to?
ostringstream oss;
oss << cin.rdbuf();
string input = oss.str();
for(int i=0; i<input.size(); i++)
{
unsigned char code_unit = input[i];
cout << (int)code_unit << endl;
}
Thanks in advance.
Characters encoded with UTF-8 may take up more than a single byte (and often do). The number of bytes used to encode a single code point can vary from 1 byte to 6 bytes (or 4 under RFC 3629). In the case of π, the UTF-8 encoding, in binary, is:
11001111 10000000
That is, it is two bytes. You are reading these bytes out individually. The first byte has decimal value 207 and the second has decimal value 128 (if you interpret as an unsigned integer). The following byte that you're reading has decimal value 10 and is the Line Feed character which you're giving when you hit enter.
If you're going to do any processing of these UTF-8 characters, you're going to need to interpret what the bytes mean. What exactly you'll need to do depends on how you're categorising the characters.

How many bytes does a string take? A char?

I'm doing a review of my first semester C++ class, and I think I missing something. How many bytes does a string take up? A char?
The examples we were given are, some being character literals and some being strings:
'n', "n", '\n', "\n", "\\n", ""
I'm particularly confused by the usage of newlines in there.
#include <iostream>
int main()
{
std::cout << sizeof 'n' << std::endl; // 1
std::cout << sizeof "n" << std::endl; // 2
std::cout << sizeof '\n' << std::endl; // 1
std::cout << sizeof "\n" << std::endl; // 2
std::cout << sizeof "\\n" << std::endl; // 3
std::cout << sizeof "" << std::endl; // 1
}
Single quotes indicate characters.
Double quotes indicate C-style strings with an invisible NUL
terminator.
\n (line break) is only a single char and so is \\ (backslash). \\n is just a backslash followed by n.
'n': is not a string, is a literal char, one byte, the character code for the letter n.
"n": string, two bytes, one for n and one for the null character every string has at the end.
"\n": two bytes as \n stand for "new line" which takes one byte, plus one byte for the null char.
'\n': same as the first, literal char, not a string, one byte.
"\\n": three bytes.. one for \, one for newline and one for the null character
"": one byte, just the null character.
A char, by definition, takes up one byte.
Literals using ' are char literals; literals using " are string literals.
A string literal is implicitly null-terminated, so it will take up one more byte than the observable number of characters in the literal.
\ is the escape character and \n is a newline character.
Put these together and you should be able to figure it out.
The following will take x consecutive chars in memory:
'n' - 1 char (type char)
"n" - 2 chars (above plus zero character) (type const char[2])
'\n' - 1 char
"\n" - 2 chars
"\\n" - 3 chars ('\', 'n', and zero)
"" - 1 char
edit: formatting fixed
edit2: I've written something very stupid, thanks Mooing Duck for pointing that out.
The number of bytes a string takes up is equal to the number of characters in the string plus 1 (the terminator), times the number of bytes per character. The number of bytes per character can vary. It is 1 byte for a regular char type.
All your examples are one character long except for the second to last, which is two, and the last, which is zero. (Some are of type char and only define a single character.)
'n' -> One char. A char is always 1 byte. This is not a string.
"n" -> A string literal, containing one n and one terminating NULL char. So 2 bytes.
'\n' -> One char, A char is always 1 byte. This is not a string.
"\n" -> A string literal, containing one \n and one terminating NULL char. So 2 bytes.
"\\n" -> A string literal, containing one \, one '\n', and one terminating NULL char. So 3 bytes.
"" -> A string literal, containing one terminating NULL char. So 1 byte.
You appear to be referring to string constants. And distinguishing them from character constants.
A char is one byte on all architectures. A character constant uses the single quote delimiter '.
A string is a contiguous sequence of characters with a trailing NUL character to identify the end of string. A string uses double quote characters '"'.
Also, you introduce the C string constant expression syntax which uses blackslashes to indicate special characters. \n is one character in a string constant.
So for the examples 'n', "n", '\n', "\n":
'n' is one character
"n" is a string with one character, but it takes two characters of storage (one for the letter n and one for the NUL
'\n' is one character, the newline (ctrl-J on ASCII based systems)
"\n" is one character plus a NUL.
I leave the others to puzzle out based on those.
'n' - 0x6e
"n" - 0x6e00
'\n' - 0x0a
"\n" - 0x0a00
"\\n" - 0x5c6e00
"" - 0x00
Depends if using UTF8 a char is 1byte if UTF16 a char is 2bytes doesn't matter if the byte is 00000001 or 10000000 a full byte is registered and reserved for the character once declared for initialization and if the char changes this register is updated with the new value.
a strings bytes is equal to the number of char between "".
example: 11111111 is a filled byte,
UTF8 char T = 01010100 (1 byte)
UTF16 char T = 01010100 00000000 (2 bytes)
UTF8 string "coding" = 011000110110111101100100011010010110111001100111 (6 bytes)
UTF16 string "coding" = 011000110000000001101111000000000110010000000000011010010000000001101110000000000110011100000000 (12 bytes)
UTF8 \n = 0101110001101110 (2 bytes)
UTF16 \n = 01011100000000000110111000000000 (4 bytes)
Note: Every space and every character you type takes up 1-2 bytes in the compiler but there is so much space that unless you are typing code for a computer or game console from the early 90s with 4mb or less you shouldn't worry about bytes in regards to strings or char.
Things that are problematic to memory are calling things that require heavy computation with floats, decimals, or doubles and using math random in a loop or update methods. That would better be ran once at runtime or on a fixed time update and averaged over the time span.