C++: portability of Unicode string literals

C++: portability of Unicode string literals - c++

While debugging on gcc, I found that the Unicode literal u"万不得已" was represented as u"\007\116\015\116\227\137\362\135". Which makes sense -- 万 is 0x4E07, and 0x4E in octal is 116.
Now on Apple LLVM 9.1.0 on an Intel-powered Macbook, I find that that same literal is not handled as the same string, ie:
u16string{u"万不得已"} == u16string{u"\007\116\015\116\227\137\362\135"}
goes from true to false. I'm still on a little-endian system, so I don't understand what's happening.
NB. I'm not trying to use the correspondence u"万不得已" == u"\007\116\015\116\227\137\362\135". I just want to understand what's happening.

I found that the Unicode literal u"万不得已" was represented as u"\007\116\015\116\227\137\362\135"
No, actually it is not. And here's why...
u"..." string literals are encoded as a char16_t-based UTF-16 encoded string on all platforms (that is what the u prefix is specifically meant for).
u"万不得已" is represented by this UTF-16 codeunit sequence:
4E07 4E0D 5F97 5DF2
On a little-endian system, that UTF-16 sequence is represented by this raw byte sequence:
07 4E 0D 4E 97 5F F2 5D
In octal, that would be represented by "\007\116\015\116\227\137\362\135" ONLY WHEN using a char-based string (note the lack of a string prefix, or u8 would also work for this example).
u"\007\116\015\116\227\137\362\135" is NOT a char-based string! It is a char16_t-based string, where each octal number represents a separate UTF-16 codeunit. Thus, this string actually represents this UTF-16 codeunit sequence:
0007 004E 000D 004E 0097 005F 00F2 005D
That is why your two u16string objects are not comparing as the same string value. Because they are really not equal.
You can see this in action here: Live Demo

Related

How do I correctly initialize wide character string?

I am trying to figure out wide characters in c. For example, I test a string that contains a single letter "Ē" that is encoded as c492 in utf8.
char* T1 = "Ē";
//This is the resulting array { 0xc4, 0x92, 0x00 }
wchar_t* T2 = L"Ē";
//This is the resulting array { 0x00c4, 0x2019, 0x0000 }
I expected that the second array would be {0xc492, 0x0000}, instead it contains an extra character that just wastes space in my opinion. Can anyone help me understand what is going on with this?

What you've managed to do here is mojibake. Your source code is written in UTF-8 but it was interpreted in Windows codepage 1252 (i.e. the compiler source character set was CP1252).
The wide string contents are the Windows codepage 1252 characters of the UTF-8 bytes 0xC4 0x92 converted to UCS-2. The easiest way out is to just using an escape instead:
wchar_t* T2 = L"\x112";
or
wchar_t* T2 = L"\u0112";
The larger problem is that to my knowledge neither C nor C++ have a mechanism for specifying the source character set within the code itself, so it is always a setting or option external to something that you can easily copy-paste.

Your compiler is misinterpreting your source code file (which is saved as UTF-8) as Windows-1252 (commonly called ANSI). It does not interpret the byte sequence C4 92 as the one-character UTF-8 string "Ē", but as the two-character Windows-1252 string "Ä’". The unicode codepoint of "Ä" is U+00C4, and the unicode codepoint of "’" is U+2019. This is exactly what you see in your wide character string.
The 8-bit string only works, because the misinterpretation of the string does not matter, as it is not converted during compilation. The compiler reads the string as Windows-1252 and emits the string as Windows-1252 (so it does not need to convert anything, and considers both to be "Ä’"). You interpret the source code and the data in the binary as UTF-8, so you consider both to be "Ē".
To have the compiler treat your source code as UTF-8, use the switch /utf-8.
BTW: The correct UTF-16 encoding (which is the encoding MSVC uses for wide character strings) to be observed in a wide-character string is not {0xc492, 0x0000}, but {0x0112, 0x0000}, because "Ē" is U+0112.

Escaping unicode characters with C/C++

I need to escape unicode characters within a input string to either UTF-16 or UTF-32 escape sequences. For example, the input string literal "Eat, drink, 愛" should be escaped as "Eat, drink, \u611b". Here are the rules in a table of sorts:
Escape | Unicode code point
'\u' HEX HEX HEX HEX | A Unicode code point in the range U+0 to U+FFFF
inclusive corresponding to the encoded hexadecimal value.
'\U' HEX HEX HEX HEX HEX HEX HEX HEX | A Unicode code point in the range
U+0 to U+10FFFF inclusive corresponding to the encoded hexadecimal
value.
It's simple to detect Unicode characters in general, since the second byte is 0 if ASCII:
L"a" = 97, 0
, which will not be escaped. With Unicode characters the second byte is never 0:
L"愛" = 27, 97
, which is escaped as \u611b. But how do I detect UTF-32 a string as it is to be escaped differently than UTF-16 with 8 hex numbers?
It is not as simple as just checking the size of the string, as UTF-16 characters are multibyte, e.g. :
L"प्रे" = 42, 9, 77, 9, 48, 9, 71, 9
I'm tasked to escape unescaped input string literals like Eat, drink, 愛 and store them to disk in their escaped literal form Eat, drink, \u611b (UTF-16 example) If my program finds a UTF-32 character it should escape those too in the form\U8902611b (UTF-32 example), but I can't find a certain way of knowing if I'm dealing with UTF-16 or UTF-32 in an input byte array. So, just how can I reliably differ UTF-32 from UTF-16 characters within a wchar_t string or byte array?

There are many questions within your question, I will try to answer the most important ones.
Q. I have a C++ string like "Eat, drink, 愛", is it a UT8-8, UTF-16 or UTF-32 string?
A. This is implementation-defined. In many implementations this will be a UTF-8 string, but this is not mandated by the standard. Consult your documentation.
Q. I have a wide C++ string like L"Eat, drink, 愛", is it a UT8-8, UTF-16 or UTF-32 string?
A. This is implementation-defined. In many implementations this will be a UTF-32 string. In some other implementations it will be a UTF-16 string. Neither is mandated by the standard. Consult your documentation.
Q. How can I have portable UT8-8, UTF-16 or UTF-32 C++ string literals?
A. In C++11 there is a way:
u8"I'm a UTF-8 string."
u"I'm a UTF-16 string."
U"I'm a UTF-32 string."
In C++03, no such luck.
Q. Does the string "Eat, drink, 愛" contain at least one UTF-32 character?
A. There are no such things as UTF-32 (and UTF-16 and UTF-8) characters. There are UTF-32 etc. strings. They all contain Unicode characters.
Q. What the heck is a Unicode character?
A. It is an element of a coded character set defined by the Unicode standard. In a C++ program it can be represented in various ways, the most simple and straightforward one is with a single 32-bit integral value corresponding to the character's code point. (I'm ignoring composite characters here and equating "character" and "code point", unless stated otherwise, for simplicity).
Q. Given a Unicode character, how can I escape it?
A. Examine its value. If it's between 256 and 65535, print a 2-byte (4 hex digits) escape sequence. If it's greater than 65535, print a 3-byte (6 hex digits) escape sequence. Otherwise, print it as you normally would.
Q. Given a UTF-32 encoded string, how can I decompose it to characters?
A. Each element of the string (which is called a code unit) corresponds to a single character (code point). Just take them one by one. Nothing special needs to be done.
Q. Given a UTF-16 encoded string, how can I decompose it to characters?
A. Values (code units) outside of the 0xD800 to 0xDFFF range correspond to the Unicode characters with the same value. For each such value, print either a normal character or a 2-byte (4 hex digits) escape sequence. Values in the 0xD800 to 0xDFFF range are grouped in pairs, each pair representing a single character (code point) in the U+10000 to U+10FFFF range. For such a pair, print a 3-byte (6 hex digits) escape sequence. To convert a pair (v1, v2) to its character value, use this formula:
c = (v1 - 0xd800) >> 10 + (v2-0xdc00)
Note the first element of the pair must be in the range of 0xd800..0xdbff and the second one is in 0xdc00..0xdfff, otherwise the pair is ill-formed.
Q. Given a UTF-8 encoded string, how can I decompose it to characters?
A. The UTF-8 encoding is a bit more complicated than the UTF-16 one and I will not detail it here. There are many descriptions and sample implementations out there on the 'net, look them up.
Q. What's up with my L"प्रे" string?
A. It is a composite character that is composed of four Unicode code points, U+092A, U+094D, U+0930, U+0947. Note it's not the same as a high code point being represented with a surrogate pair as detailed in the UTF-16 part of the answer. It's a case of "character" being not the same as "code point". Escape each code point separately. At this level of abstraction, you are dealing with code points, not actual characters anyway. Characters come into play when you e.g. display them for the user, or compute their position in a printed text, but not when dealing with string encodings.

How are u8-literals supposed to work?

Having trouble to understand the semantics of u8-literals, or rather, understanding the result on g++ 4.8.1
This is my expectation:
const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() > 3);
This is the result on g++ 4.8.1
const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() == 3);
The source file is ISO-8859(-1)
We use these compiler directives: -m64 -std=c++11 -pthread -O3 -fpic
In my world, regardless of the encoding of the source file the resulting utf8 string should be longer than 3.
Or, have I totally misunderstood the semantics of u8, and the use-case it targets? Please enlighten me.
Update
If I explicitly tell the compiler what encoding the source file is, as many suggested, I got the expected behavior for u8 literals. But, regular literals also gets encoded to utf8
That is:
const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() > 3);
assert( utf8 == "åäö");
compiler directive: g++ -m64 -std=c++11 -pthread -O3 -finput-charset=ISO8859-1
Tried a few other charset defined from iconv, ex: ISO_8859-1 and so on...
I'm even more confused now than before...

The u8 prefix really just means "when compiling this code, generate a UTF-8 string from this literal". It says nothing about how the literal in the source file should be interpreted by the compiler.
So you have several factors at play:
which encoding is the source file written in (In your case, apparently ISO-8859). According to this encoding, the string literal is "åäö" (3 bytes, containing the values 0xc5, 0xe4, 0xf6)
which encoding does the compiler assume when reading the source file? (I suspect that GCC defaults to UTF-8, but I could be wrong.
the encoding that the compiler uses for the generated string in the object file. You specify this to be UTF-8 via the u8 prefix.
Most likely, #2 is where this goes wrong. If the compiler interprets the source file as ISO-8859, then it will read the three characters, convert them to UTF-8, and write those, giving you a 6-byte (I think each of those chars encodes to 2 byte in UTF-8) string as a result.
However, if it assumes the source file to be UTF-8, then it won't need to do a conversion at all: it reads 3 bytes, which it assumes are UTF-8 (even though they're invalid garbage values for UTF-8), and since you asked for the output string to be UTF-8 as well, it just outputs those same 3 bytes.
You can tell GCC which source encoding to assume with -finput-charset, or you can encode the source as UTF-8, or you can use the \uXXXX escape sequences in the string literal ( \u00E5 instead of å, for example)
Edit:
To clarify a bit, when you specify a string literal with the u8 prefix in your source code, then you are telling the compiler that "regardless of which encoding you used when reading the source text, please convert it to UTF-8 when writing it out to the object file". You are saying nothing about how the source text should be interpreted. That is up to the compiler to decide (perhaps based on which flags you passed to it, perhaps based on the process' environment, or perhaps just using a hardcoded default)
If the string in your source text contains the bytes 0xc5, 0xe4, 0xf6, and you tell it that "the source text is encoded as ISO-8859", then the compiler will recognize that "the string consists of the characters "åäö". It will see the u8 prefix, and convert these characters to UTF-8, writing the byte sequence 0xc3, 0xa5, 0xc3, 0xa4, 0xc3, 0xb6 to the object file. In this case, you end up with a valid UTF-8 encoded text string containing the UTF-8 representation of the characters "åäö".
However, if the string in your source text contains the same byte, and you make the compiler believe that the source text is encoded as UTF-8, then there are two things the compiler may do (depending on implementation:
it might try to parse the bytes as UTF-8, in which case it will recognize that "this is not a valid UTF-8 sequence", and issue an error. This is what Clang does.
alternatively, it might say "ok, I have 3 bytes here, I am told to assume that they form a valid UTF-8 string. I'll hold on to them and see what happens". Then, when it is supposed to write the string to the object file, it goes "ok, I have these 3 bytes from before, which are marked as being UTF-8. The u8 prefix here means that I am supposed to write this string as UTF-8. Cool, no need to do a conversion then. I'll just write these 3 bytes and I'm done". This is what GCC does.
Both are valid. The C++ language doesn't state that the compiler is required to check the validity of the string literals you pass to it.
But in both cases, note that the u8 prefix has nothing to do with your problem. That just tells the compiler to convert from "whatever encoding the string had when you read it, to UTF-8". But even before this conversion, the string was already garbled, because the bytes corresponded to ISO-8859 character data, but the compiler believed them to be UTF-8 (because you didn't tell it otherwise).
The problem you are seeing is simply that the compiler didn't know which encoding to use when reading the string literal from your source file.
The other thing you are noticing is that a "traditional" string literal, with no prefix, is going to be encoded with whatever encoding the compiler likes. The u8 prefix (and the corresponding UTF-16 and UTF-32 prefixes) were entroduced precisely to allow you to specify which encoding you wanted the compiler to write the output in. The plain prefix-less literals do not specify an encoding at all, leaving it up to the compiler to decide on one.

In order to illustrate this discussion, here are some examples. Let's consider the code:
int main() {
std::cout << "åäö\n";
}
1) Compiling this with g++ -std=c++11 encoding.cpp will produce an executable that yields:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
In other words, two bytes per "grapheme cluster" (according to unicode jargon, i.e. in this case, per character), plus the final newline (0a). This is because my file is encoded in utf-8, the input-charset is assumed to be utf-8 by cpp, and the exec-charset is utf-8 by default in gcc (see https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html). Good.
2) Now if I convert my file to iso-8859-1 and compile again using the same command, I get:
% ./a.out | od -txC
0000000 e5 e4 f6 0a
i.e. the three characters are now encoded using iso-8859-1. I am not sure about the magic going on here, as this time it seems that cpp correctly guessed that the file was iso-8859-1 (without any hint), converted it to utf-8 internally (according to the link above) but the compiler still stored the iso-8859-1 string in the binary. This we can check by looking at the .rodata section of the binary:
% objdump -s -j .rodata a.out
a.out: file format elf64-x86-64
Contents of section .rodata:
400870 01000200 00e5e4f6 0a00 ..........
(Note the "e5e4f6" sequence of bytes).
This makes perfect sense as a programmer who uses latin-1 literals does not expect them to come out as utf-8 strings in his program's output.
3) Now if I keep the same iso-8859-1-encoded file, but compile with g++ -std=c++11 -finput-charset=iso-8859-1 encoding.cpp, then I get a binary that ouptuts utf-8 data:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
I find this weird: the source encoding has not changed, I explicitly tell gcc it is latin-1, and I get utf-8 as a result! Note that this can be overriden if I explicitly request the exec-charset with g++ -std=c++11 -finput-charset=iso-8859-1 -fexec-charset=iso-8859-1 encoding.cpp:
% ./a.out | od -txC
0000000 e5 e4 f6 0a
It is not clear to me how these two options interact...
4) Now let's add the "u8" prefix into the mix:
int main() {
std::cout << u8"åäö\n";
}
If the file is utf-8-encoded, unsurprisingly compiling with defaults char-sets (g++ -std=c++11 encoding.cpp), the output is utf-8 as well. If I request the compiler to use iso-8859-1 internally instead (g++ -std=c++11 -fexec-charset=iso-8859-1 encoding.cpp), the output is still utf-8:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
So it looks like the prefix "u8" prevented the compiler to convert the literal to the execution character set. Even better, if I convert the same source file to iso-8859-1, and compile with g++ -std=c++11 -finput-charset=iso-8859-1 -fexec-charset=iso-8859-1 encoding.cpp, then I still get utf-8 output:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
So it seems that"u8" actually acts as an "operator" that tells the compiler "convert this literal to utf-8".

What does the 'L' in front a string mean in C++?

this->textBox1->Name = L"textBox1";
Although it seems to work without the L, what is the purpose of the prefix? The way it is used doesn't even make sense to a hardcore C programmer.

It's a wchar_t literal, for extended character set. Wikipedia has a little discussion on this topic, and c++ examples.

'L' means wchar_t, which, as opposed to a normal character, requires 16-bits of storage rather than 8-bits. Here's an example:
"A" = 41
"ABC" = 41 42 43
L"A" = 00 41
L"ABC" = 00 41 00 42 00 43
A wchar_t is twice big as a simple char. In daily use you don't need to use wchar_t, but if you are using windows.h you are going to need it.

It means the text is stored as wchar_t characters rather than plain old char characters.
(I originally said it meant unicode. I was wrong about that. But it can be used for unicode.)

It means that it is a wide character, wchar_t.
Similar to 1L being a long value.

It means it's an array of wide characters (wchar_t) instead of narrow characters (char).
It's a just a string of a different kind of character, not necessarily a Unicode string.

L is a prefix used for wide strings. Each character uses several bytes (depending on the size of wchar_t). The encoding used is independent from this prefix. I mean it must not be necessarily UTF-16 unlike stated in other answers here.

Here is an example of the usage:
By adding L before the char you can return Unicode characters as char32_t type:
char32_t utfRepresentation()
{
if (m_is_white)
{
return L'♔';
}
return L'♚';
};

c++: getting ascii value of a wide char

let's say i have a char array like "äa".
is there a way to get the ascii value (e.g 228) of the first char, which is a multibyte?
even if i cast my array to a wchar_t * array, i'm not able to get the ascii value of "ä", because its 2 bytes long.
is there a way to do this, im trying for 2 days now :(
i'm using gcc.
thanks!

You're contradicting yourself. International characters like ä are (by definition) not in the ASCII character set, so they don't have an "ascii value".
It depends on the exact encoding of your two-character array, if you can get the code point for a single character or not, and if so which format it will be in.

You are very confused. ASCII only has values smaller than 128. Value 228 corresponds to ä in 8 bit character sets ISO-8859-1, CP1252 and some others. It also is the UCS value of ä in the Unicode system. If you use string literal "ä" and get a string of two characters, the string is in fact encoded in UTF-8 and you may wish to parse the UTF-8 coding to acquire Unicode UCS values.
More likely what you really want to do is converting from one character set to another. How to do this heavily depends on your operating system, so more information is required. You also need to specify what exactly you want? A std::string or char* of ISO-8859-1, perhaps?

There is a standard C++ template function to do that conversion, ctype::narrow(). It is part of the localization library. It will convert the wide character to the equivalent char value for you current local, if possible. As the other answers have pointed out, there isn't always a mapping, which is why ctype::narrow() takes a default character that it will return if there is no mapping.

Depends on the encoding used in your char array.
If your char array is Latin 1 encoded, then it it 2 bytes long (plus maybe a NUL terminator, we don't care), and those 2 bytes are:
0xE4 (lower-case a umlaut)
0x61 (lower-case a).
Note that Latin 1 is not ASCII, and 0xE4 is not an ASCII value, it's a Latin 1 (or Unicode) value.
You would get the value like this:
int i = (unsigned char) my_array[0];
If your char array is UTF-8 encoded, then it is three bytes long, and those bytes are:
binary 11000011 (first byte of UTF-8 encoded 0xE4)
binary 10100100 (second byte of UTF-8 encoded 0xE4)
0x61 (lower-case a)
To recover the Unicode value of a character encoded with UTF-8, you either need to implement it yourself based on http://en.wikipedia.org/wiki/UTF-8#Description (usually a bad idea in production code), or else you need to use a platform-specific unicode-to-wchar_t conversion routine. On linux this is mbstowcs or iconv, although for a single character you can use mbtowc provided that the multi-byte encoding defined for the current locale is in fact UTF-8:
wchar_t i;
if (mbtowc(&i, my_array, 3) == -1) {
// handle error
}
If it's SHIFT-JIS then this doesn't work...

what you want is called transliteration - converting letters of one language to another. it has nothing about unicode and wchars. you need to have a table of mapping.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js