Escaping unicode characters with C/C++ - c++

I need to escape unicode characters within a input string to either UTF-16 or UTF-32 escape sequences. For example, the input string literal "Eat, drink, 愛" should be escaped as "Eat, drink, \u611b". Here are the rules in a table of sorts:
Escape | Unicode code point
'\u' HEX HEX HEX HEX | A Unicode code point in the range U+0 to U+FFFF
inclusive corresponding to the encoded hexadecimal value.
'\U' HEX HEX HEX HEX HEX HEX HEX HEX | A Unicode code point in the range
U+0 to U+10FFFF inclusive corresponding to the encoded hexadecimal
value.
It's simple to detect Unicode characters in general, since the second byte is 0 if ASCII:
L"a" = 97, 0
, which will not be escaped. With Unicode characters the second byte is never 0:
L"愛" = 27, 97
, which is escaped as \u611b. But how do I detect UTF-32 a string as it is to be escaped differently than UTF-16 with 8 hex numbers?
It is not as simple as just checking the size of the string, as UTF-16 characters are multibyte, e.g. :
L"प्रे" = 42, 9, 77, 9, 48, 9, 71, 9
I'm tasked to escape unescaped input string literals like Eat, drink, 愛 and store them to disk in their escaped literal form Eat, drink, \u611b (UTF-16 example) If my program finds a UTF-32 character it should escape those too in the form\U8902611b (UTF-32 example), but I can't find a certain way of knowing if I'm dealing with UTF-16 or UTF-32 in an input byte array. So, just how can I reliably differ UTF-32 from UTF-16 characters within a wchar_t string or byte array?

There are many questions within your question, I will try to answer the most important ones.
Q. I have a C++ string like "Eat, drink, 愛", is it a UT8-8, UTF-16 or UTF-32 string?
A. This is implementation-defined. In many implementations this will be a UTF-8 string, but this is not mandated by the standard. Consult your documentation.
Q. I have a wide C++ string like L"Eat, drink, 愛", is it a UT8-8, UTF-16 or UTF-32 string?
A. This is implementation-defined. In many implementations this will be a UTF-32 string. In some other implementations it will be a UTF-16 string. Neither is mandated by the standard. Consult your documentation.
Q. How can I have portable UT8-8, UTF-16 or UTF-32 C++ string literals?
A. In C++11 there is a way:
u8"I'm a UTF-8 string."
u"I'm a UTF-16 string."
U"I'm a UTF-32 string."
In C++03, no such luck.
Q. Does the string "Eat, drink, 愛" contain at least one UTF-32 character?
A. There are no such things as UTF-32 (and UTF-16 and UTF-8) characters. There are UTF-32 etc. strings. They all contain Unicode characters.
Q. What the heck is a Unicode character?
A. It is an element of a coded character set defined by the Unicode standard. In a C++ program it can be represented in various ways, the most simple and straightforward one is with a single 32-bit integral value corresponding to the character's code point. (I'm ignoring composite characters here and equating "character" and "code point", unless stated otherwise, for simplicity).
Q. Given a Unicode character, how can I escape it?
A. Examine its value. If it's between 256 and 65535, print a 2-byte (4 hex digits) escape sequence. If it's greater than 65535, print a 3-byte (6 hex digits) escape sequence. Otherwise, print it as you normally would.
Q. Given a UTF-32 encoded string, how can I decompose it to characters?
A. Each element of the string (which is called a code unit) corresponds to a single character (code point). Just take them one by one. Nothing special needs to be done.
Q. Given a UTF-16 encoded string, how can I decompose it to characters?
A. Values (code units) outside of the 0xD800 to 0xDFFF range correspond to the Unicode characters with the same value. For each such value, print either a normal character or a 2-byte (4 hex digits) escape sequence. Values in the 0xD800 to 0xDFFF range are grouped in pairs, each pair representing a single character (code point) in the U+10000 to U+10FFFF range. For such a pair, print a 3-byte (6 hex digits) escape sequence. To convert a pair (v1, v2) to its character value, use this formula:
c = (v1 - 0xd800) >> 10 + (v2-0xdc00)
Note the first element of the pair must be in the range of 0xd800..0xdbff and the second one is in 0xdc00..0xdfff, otherwise the pair is ill-formed.
Q. Given a UTF-8 encoded string, how can I decompose it to characters?
A. The UTF-8 encoding is a bit more complicated than the UTF-16 one and I will not detail it here. There are many descriptions and sample implementations out there on the 'net, look them up.
Q. What's up with my L"प्रे" string?
A. It is a composite character that is composed of four Unicode code points, U+092A, U+094D, U+0930, U+0947. Note it's not the same as a high code point being represented with a surrogate pair as detailed in the UTF-16 part of the answer. It's a case of "character" being not the same as "code point". Escape each code point separately. At this level of abstraction, you are dealing with code points, not actual characters anyway. Characters come into play when you e.g. display them for the user, or compute their position in a printed text, but not when dealing with string encodings.

Related

Understanding unicode codecvt

I have a UTF-16 encoded stream and I'd like to convert it into plain ASCII, i.e. if there's an ASCII character -> print it. If a codeunit represents something else I don't care e.g. chinese characters) -> output garbage.
I'm using this code
typedef std::codecvt_utf16<wchar_t> convert_typeX;
std::wstring_convert<convert_typeX, wchar_t> converterX;
std::string converted = converterX.from_bytes(str);
and it seems to work.. but why?
documentation for codecvt_utf16 states:
std::codecvt_utf16 is a std::codecvt facet which encapsulates conversion between a UTF-16 encoded byte string and UCS2 or UCS4 character string (depending on the type of Elem).
UCS2 is a version of unicode as far as I know.. so this code is converting to a sequence of wchar_t bytes that represent unicode characters right? How come I'm getting ASCII bytes?
The nice thing about unicode is that unicode values 0-127 represent ASCII characters 0-127.
So, you don't even need to waste your time with std::codecvt. All you have to do is scan your UTF-16 sequence, grab all UTF-16 values in the range of 0-127 (see the wikipedia entry for UTF-16 for the simple process of extracting UTF-16 values from the bytestream), and you'll end up with plain ASCII, as if by magic. That's because, by definition, values above 127 are not plain ASCII. You can do whatever you want with all other characters.
And, if you would like to expand your universe to iso-8859-1, rather than US-ASCII, you can expand your range to 0-255. Because unicode values 128-255 are also equivalent to characters 128-255 in the iso-8859-1 codeset.

std::string and UTF-8 encoded unicode

If I understand well, it is possible to use both string and wstring to store UTF-8 text.
With char, ASCII characters take a single byte, some chinese characters take 3 or 4, etc. Which means that str[3] doesn't necessarily point to the 4th character.
With wchar_t same thing, but the minimal amount of bytes used per characters is always 2 (instead of 1 for char), and a 3 or 4 byte wide character will take 2 wchar_t.
Right ?
So, what if I want to use string::find_first_of() or string::compare(), etc with such a weirdly encoded string ? Will it work ? Does the string class handle the fact that characters have a variable size ? Or should I only use them as dummy feature-less byte arrays, in which case I'd rather go for a wchar_t[] buffer.
If std::string doesn't handle that, second question: are there libraries providing string classes that could handle that UTF-8 encoding so that str[3] actually points to the 3rd character (which would be a byte array from length 1 to 4) ?
You are talking about Unicode. Unicode uses 32 bits to represent a character. However since that is wasting memory there are more compact encodings. UTF-8 is one such encoding. It assumes that you are using byte units and it maps Unicode characters to 1, 2, 3 or 4 bytes. UTF-16 is another that is using words as units and maps Unicode characters to 1 or 2 words (2 or 4 bytes).
You can use both encoding with both string and wchar_t. UTF-8 tends to be more compact for english text/numbers.
Some things will work regardless of encoding and type used (compare). However all functions that need to understand one character will be broken. I.e the 5th character is not always the 5th entry in the underlying array. It might look like it's working with certain examples but It will eventually break.
string::compare will work but do not expect to get alphabetical ordering. That is language dependent.
string::find_first_of will work for some but not all. Long string will likely work just because they are long while shorter ones might get confused by character alignment and generate very hard to find bugs.
Best thing is to find a library that handles it for you and ignore the type underneath (unless you have strong reasons to pick one or the other).
You can't handle Unicode with std::string or any other tools from Standard Library. Use external library such as: http://utfcpp.sourceforge.net/
You are correct for those:
...Which means that str[3] doesn't necessarily point to the 4th character...only use them as dummy feature-less byte arrays...
string of C++ can only handle ascii characters. This is different from the String of Java, which can handle Unicode characters. You can store the encoding result (bytes) of Chinese characters into string (char in C/C++ is just byte), but this is meaningless as string just treat the bytes as ascii chars, so you cannot use string function to process it.
wstring may be something you need.
There is something that should be clarified. UTF-8 is just an encoding method for Unicode characters (transforming characters from/to byte format).

JSON stored as UTF-8 requires two encoding conversions

A JSON string can contain the escape sequence: \u four-hex-digits, which are two octets.
After reading the four hex digits into c1, c2, c3, c4, the JSON Spirit C++ library returns a single character whose value is (hex_to_num (c1) << 12) + (hex_to_num (c2) << 8) + (hex_to_num (c3) << 4) + hex_to_num (c4).
Based on the simplicity of the decoding scheme, and based on having only 2 octets to decode, I conclude that JSON escape sequences support only UCS-2 encoding, which is text from the BMP U+0000 to U+FFFF encoded "as is" using the code point as the 16-bit code unit.
Since UTF-16 and UCS-2 encode valid code points in U+0000 to U+FFFF as single 16-bit code units that are numerically equal to the corresponding code points (wikipedia), one can simply pretend that the decoded UCS-2 character is a UTF-16 character.
The escape character varies from a normal unescaped JSON string, which can contain "any Unicode character except " or \ or control-character"(json spec). Since JSON is a subset of ECMAScript, which is assumed to be UTF-16 (ecma standard), I conclude that JSON supports UTF-16 encoding, which is broader than what the escape sequence provides.
Now having reduced all JSON strings to UTF-16, if one converts them from UTF-16 to UTF-8, my understanding is that it is possible to store the UTF-8 in a std::string on Linux, because during processing one can often ignore that several std::string characters are consumed to represent as long as a 6-byte long UTF-8 sequence.
If all the above assumptions and interpretations are correct, one can safely parse JSON and store it into a std::string on Linux. Can someone please verify?
You are mistaken in several regards:
1) The \u escape values in JSON are UTF-16 code units, not UCS-2 code points, which despite the claims of wikipedia, are not (necessarily) the same as UCS-2 and UTF-16 are not 100% byte compatible (although they are for all characters which existed before UTF-16 was created in the Unicode 2.0 standard)
2) The JSON escape sequence can represent all of UTF-16 by using surrogate pairs of code units.
Your end assertion is still true - you can safely parse JSON and store it in a std::string, but the conversion can't be based on the assumptions you're making (and using std::string to essentially store a bundle of bytes likely isn't what you want anyhow).

How can I convert a decimal code of a character into a Unicode string in C++?

How can I convert a decimal code of a character into a Unicode string in C++?
For example, I give it the integer 241, that is the &apos;ñ&apos; spanish letter, and I want to convert it to a Unicode string.
If your source character set is ISO 8859-1 or 8859-15 (both of which have LATIN SMALL LETTER N WITH TILDE at code point 0xF1 = 241), then the conversion needs to create the correct encoding for Unicode character U+00F1.
Now, we need to know which Unicode encoding scheme you are using. If you use UTF-8, you will need the result:
\xC3 \xB1
If you use UTF-16 BE (big endian), you need:
\x00 \xF1
If you use UTF-16 LE (little endian), you need:
\xF1 \x00
If you are using UTF-32, then you need 4 bytes instead of 2.
And if you want a string, you will need to encode the U+0000 (NULL) as a following character.
If you don't know which form you need, you have big problems; to use Unicode, you need to understand something of how the different forms are encoded. Your library may save you from a lot of the bother of understanding, but ultimately, you need to know at least a minimum about Unicode.
If the character code is determined at runtime, and you cannot use literals like explained by Jonathan, you need to rely on your toolkit. For example, in Qt:
QString codepointToString(QString codepointDecimal) {
int codepoint = codepointDecimal.toInt(); //TODO: check errors
QChar character(codepoint);
return QString(character);
}

c++: getting ascii value of a wide char

let's say i have a char array like "äa".
is there a way to get the ascii value (e.g 228) of the first char, which is a multibyte?
even if i cast my array to a wchar_t * array, i'm not able to get the ascii value of "ä", because its 2 bytes long.
is there a way to do this, im trying for 2 days now :(
i'm using gcc.
thanks!
You're contradicting yourself. International characters like ä are (by definition) not in the ASCII character set, so they don't have an "ascii value".
It depends on the exact encoding of your two-character array, if you can get the code point for a single character or not, and if so which format it will be in.
You are very confused. ASCII only has values smaller than 128. Value 228 corresponds to ä in 8 bit character sets ISO-8859-1, CP1252 and some others. It also is the UCS value of ä in the Unicode system. If you use string literal "ä" and get a string of two characters, the string is in fact encoded in UTF-8 and you may wish to parse the UTF-8 coding to acquire Unicode UCS values.
More likely what you really want to do is converting from one character set to another. How to do this heavily depends on your operating system, so more information is required. You also need to specify what exactly you want? A std::string or char* of ISO-8859-1, perhaps?
There is a standard C++ template function to do that conversion, ctype::narrow(). It is part of the localization library. It will convert the wide character to the equivalent char value for you current local, if possible. As the other answers have pointed out, there isn't always a mapping, which is why ctype::narrow() takes a default character that it will return if there is no mapping.
Depends on the encoding used in your char array.
If your char array is Latin 1 encoded, then it it 2 bytes long (plus maybe a NUL terminator, we don't care), and those 2 bytes are:
0xE4 (lower-case a umlaut)
0x61 (lower-case a).
Note that Latin 1 is not ASCII, and 0xE4 is not an ASCII value, it's a Latin 1 (or Unicode) value.
You would get the value like this:
int i = (unsigned char) my_array[0];
If your char array is UTF-8 encoded, then it is three bytes long, and those bytes are:
binary 11000011 (first byte of UTF-8 encoded 0xE4)
binary 10100100 (second byte of UTF-8 encoded 0xE4)
0x61 (lower-case a)
To recover the Unicode value of a character encoded with UTF-8, you either need to implement it yourself based on http://en.wikipedia.org/wiki/UTF-8#Description (usually a bad idea in production code), or else you need to use a platform-specific unicode-to-wchar_t conversion routine. On linux this is mbstowcs or iconv, although for a single character you can use mbtowc provided that the multi-byte encoding defined for the current locale is in fact UTF-8:
wchar_t i;
if (mbtowc(&i, my_array, 3) == -1) {
// handle error
}
If it's SHIFT-JIS then this doesn't work...
what you want is called transliteration - converting letters of one language to another. it has nothing about unicode and wchars. you need to have a table of mapping.