JSON stored as UTF-8 requires two encoding conversions

JSON stored as UTF-8 requires two encoding conversions - c++

A JSON string can contain the escape sequence: \u four-hex-digits, which are two octets.
After reading the four hex digits into c1, c2, c3, c4, the JSON Spirit C++ library returns a single character whose value is (hex_to_num (c1) << 12) + (hex_to_num (c2) << 8) + (hex_to_num (c3) << 4) + hex_to_num (c4).
Based on the simplicity of the decoding scheme, and based on having only 2 octets to decode, I conclude that JSON escape sequences support only UCS-2 encoding, which is text from the BMP U+0000 to U+FFFF encoded "as is" using the code point as the 16-bit code unit.
Since UTF-16 and UCS-2 encode valid code points in U+0000 to U+FFFF as single 16-bit code units that are numerically equal to the corresponding code points (wikipedia), one can simply pretend that the decoded UCS-2 character is a UTF-16 character.
The escape character varies from a normal unescaped JSON string, which can contain "any Unicode character except " or \ or control-character"(json spec). Since JSON is a subset of ECMAScript, which is assumed to be UTF-16 (ecma standard), I conclude that JSON supports UTF-16 encoding, which is broader than what the escape sequence provides.
Now having reduced all JSON strings to UTF-16, if one converts them from UTF-16 to UTF-8, my understanding is that it is possible to store the UTF-8 in a std::string on Linux, because during processing one can often ignore that several std::string characters are consumed to represent as long as a 6-byte long UTF-8 sequence.
If all the above assumptions and interpretations are correct, one can safely parse JSON and store it into a std::string on Linux. Can someone please verify?

You are mistaken in several regards:
1) The \u escape values in JSON are UTF-16 code units, not UCS-2 code points, which despite the claims of wikipedia, are not (necessarily) the same as UCS-2 and UTF-16 are not 100% byte compatible (although they are for all characters which existed before UTF-16 was created in the Unicode 2.0 standard)
2) The JSON escape sequence can represent all of UTF-16 by using surrogate pairs of code units.
Your end assertion is still true - you can safely parse JSON and store it in a std::string, but the conversion can't be based on the assumptions you're making (and using std::string to essentially store a bundle of bytes likely isn't what you want anyhow).

Related

Convention for a matching Unicode code point while respecting the BOM?

When searching for a codepoint in a Unicode string with a relevant BOM (UTF-16/32), it makes sense to leave the encoding as-is and match the codepoint to the string's BOM.
For example, I want to trim leading and trailing slash characters.
(pseudocode)
utf16 trim_slash(utf16 string) {
bom = bom_from_strong(string)
utf16_slash = utf16_byte_order("/", bom)
offset = 0
search codepoint from right
if codepoint[i] = utf16_slash
offset++
if offset
string = string.substr(0, len(string) - offset)
}
For doing the same with preceding codepoints, I would skip over the BOM and in the case I want to extract a substring, I would simply add the BOM back on.
I'm using ConvertUTF.cpp from LLVM for UTF operations which seems to respect the BOM when converting between encodings but I still need to take the byte order into consideration when comparing with string literals and strings from other sources.
Am I going about this the right way and is my effort justified? I want to ensure that I have as proper handling of Unicode as I can.
I'm currently standardized on converting all incoming strings to UTF-32 where I need to walk along codepoints to compare search terms and then extract some substring. But I see that this is overkill when I only need to walk along the beginning and the end of a string such as the example pseudocode. In this case it would be much faster to just return the same string if nothing changes; whereas with UTF-32 I have to convert to UTF-32 and then back to the original width and then pass the final copy as the result.
With UTF-32 the minimum is 3 copies per call versus one copy if I were to consider the BOM.
Additionally, converting between UTF formats may result in a string which does not align to the original representation (having BOM or not, regardless or endianess).

Usually, BOMs are only relevant "on-the-wire" meaning that they signal the byte order of a file, network data, or some other protocol stream as it is transmitted between systems (see the Unicode FAQ).
When such a stream is read by a program (e.g. when your utf16 string is created), it should be converted to the platform's native byte order. That is, string should always in the native byte order and the BOM becomes irrelevant. When the string is written back to a file/network/stream, it should be converted from the native byte order into whatever is appropriate for the protocol (with a BOM).
Code that works with strings (other than reading/writing byte streams) should never need to handle non-native byte orders.

using unicode in a C++ program

I want that strings with Unicode characters be correctly handled in my file synchronizer application but I don't know how this kind of encoding works ?
In a unicode string, I can see that a unicode char has this form : "\uxxxx" where xs are numbers, how a normal C or C++ program interpret this kind of char ? (why there is a 'u' after '\' ? what's the effect ?)
In internet I see examples using "wide strings or wchar_t ??
So, what's the suitable object to handle unicode characters ? In rapidJson (which supports Unicode, UTF-8, UTF-16, UTF-32) , we can use const char* to store a JSOn that could have "wide characters" but those characters take more than a byte to be represented... I don't understand...
This is the kind of temporary arrangement I found for the moment (unicode->utf8?ascii?, listFolder is a std::string) :
boost::replace_all(listFolder, "\\u00e0", "à");
boost::replace_all(listFolder, "\\u00e2", "â");
boost::replace_all(listFolder, "\\u00e4", "ä");
...

The suitable object to handle Unicode strings in C++ is icu::UnicodeString (check "API References, ICU4C" in the sidebar), at least if you want to really handle Unicode strings (as opposed to just passing them from one point of your application to another).
wchar_t was an early attempt at handling international character sets, which turned out to be a failure because Microsoft's definition of wchar_t as two bytes turned out to be insufficient once Unicode was extended beyond code point 0x10000. Linux defines wchar_t as four bytes, but the inconsistency makes it (and its derived std::wstring) rather useless for portable programming.
TCHAR is a Microsoft define that resolves to char by default and to WCHAR if UNICODE is defined, with WCHAR in turn being wchar_t behind a level of indirection... yeah.
C++11 brought us char16_t and char32_t as well as the corresponding string classes, but those are still instances of basic_string<>, and as such have their shortcomings e.g. when trying to uppercase / lowercase characters that have more than one replacement character (e.g. the German ß would require to be extended to SS in uppercase; the standard library cannot do that).
ICU, on the other hand, goes the full way. For example, it provides normalization and decomposition, which the standard strings do not.
\uxxxx and \UXXXXXXXX are unicode character escapes. The xxxx are a 16-bit hexadecimal number representing a UCS-2 code point, which is equivalent to a UTF-16 code point within the Basic Multilingual Plane.
The XXXXXXXX are a 32-bit hex number, representing a UTF-32 code point, which may be any plane.
How those character escapes are handled depends on the context in which they appear (narrow / wide string, for example), making them somewhat less than perfect.
C++11 introduced "proper" Unicode literals:
u8"..." is always a const char[] in UTF-8 encoding.
u"..." is always a const uchar16_t[] in UTF-16 encoding.
U"..." is always a const uchar32_t[] in UTF-32 encoding.
If you use \uxxxx or \UXXXXXXXX within one of those three, the character literal will always be expanded to the proper code unit sequence.
Note that storing UTF-8 in a std::string is possible, but hazardous. You need to be aware of many things: .length() is not the number of characters in your string. .substr() can lead to partial and invalid sequences. .find_first_of() will not work as expected. And so on.
That being said, in my opinion UTF-8 is the only sane encoding choice for any stored text. There are cases to be made for handling texts as UTF-16 in-memory (the way ICU does), but on file, don't accept anything but UTF-8. It's space-efficient, endianess-independent, and allows for semi-sane handling even by software that is blissfully unaware of Unicode matters (see caveats above).

In a unicode string, I can see that a unicode char has this form : "\uxxxx" where xs are numbers, how a normal C or C++ program interpret this kind of char ? (why there is a 'u' after '\' ? what's the effect ?)
That is a unicode character escape sequence. It will be interpreted as a unicode character. The u after the escape character is part of the syntax and it's what differentiates it from other escape sequences. Read the documentation for more information.
So, what's the suitable object to handle unicode characters ?
char for uft-8
char16_t for utf-16
char32_t for utf-32
The size of wchar_t is platform dependent, so you cannot make portable assumptions of which encoding it suits.
we can use const char* to store a JSOn that could have "wide characters" but those characters take more than a byte to be represented...
If you mean that you can store multi-byte utf-8 characters in a char string, then you're correct.
This is the kind of temporary arrangement I found for the moment (unicode->utf8?ascii?, listFolder is a std::string)
What you're attempting to do there is replacing some unicode characters with characters that have a plaftorm defined encoding. If you have other unicode characters besides those, then you end up with a string that has mixed encoding. Also, in some cases it may accidentally replace parts of other byte sequences. I recommend using library to convert encoding or do any other manipulation on encoded strings.

Escaping unicode characters with C/C++

I need to escape unicode characters within a input string to either UTF-16 or UTF-32 escape sequences. For example, the input string literal "Eat, drink, 愛" should be escaped as "Eat, drink, \u611b". Here are the rules in a table of sorts:
Escape | Unicode code point
'\u' HEX HEX HEX HEX | A Unicode code point in the range U+0 to U+FFFF
inclusive corresponding to the encoded hexadecimal value.
'\U' HEX HEX HEX HEX HEX HEX HEX HEX | A Unicode code point in the range
U+0 to U+10FFFF inclusive corresponding to the encoded hexadecimal
value.
It's simple to detect Unicode characters in general, since the second byte is 0 if ASCII:
L"a" = 97, 0
, which will not be escaped. With Unicode characters the second byte is never 0:
L"愛" = 27, 97
, which is escaped as \u611b. But how do I detect UTF-32 a string as it is to be escaped differently than UTF-16 with 8 hex numbers?
It is not as simple as just checking the size of the string, as UTF-16 characters are multibyte, e.g. :
L"प्रे" = 42, 9, 77, 9, 48, 9, 71, 9
I'm tasked to escape unescaped input string literals like Eat, drink, 愛 and store them to disk in their escaped literal form Eat, drink, \u611b (UTF-16 example) If my program finds a UTF-32 character it should escape those too in the form\U8902611b (UTF-32 example), but I can't find a certain way of knowing if I'm dealing with UTF-16 or UTF-32 in an input byte array. So, just how can I reliably differ UTF-32 from UTF-16 characters within a wchar_t string or byte array?

There are many questions within your question, I will try to answer the most important ones.
Q. I have a C++ string like "Eat, drink, 愛", is it a UT8-8, UTF-16 or UTF-32 string?
A. This is implementation-defined. In many implementations this will be a UTF-8 string, but this is not mandated by the standard. Consult your documentation.
Q. I have a wide C++ string like L"Eat, drink, 愛", is it a UT8-8, UTF-16 or UTF-32 string?
A. This is implementation-defined. In many implementations this will be a UTF-32 string. In some other implementations it will be a UTF-16 string. Neither is mandated by the standard. Consult your documentation.
Q. How can I have portable UT8-8, UTF-16 or UTF-32 C++ string literals?
A. In C++11 there is a way:
u8"I'm a UTF-8 string."
u"I'm a UTF-16 string."
U"I'm a UTF-32 string."
In C++03, no such luck.
Q. Does the string "Eat, drink, 愛" contain at least one UTF-32 character?
A. There are no such things as UTF-32 (and UTF-16 and UTF-8) characters. There are UTF-32 etc. strings. They all contain Unicode characters.
Q. What the heck is a Unicode character?
A. It is an element of a coded character set defined by the Unicode standard. In a C++ program it can be represented in various ways, the most simple and straightforward one is with a single 32-bit integral value corresponding to the character's code point. (I'm ignoring composite characters here and equating "character" and "code point", unless stated otherwise, for simplicity).
Q. Given a Unicode character, how can I escape it?
A. Examine its value. If it's between 256 and 65535, print a 2-byte (4 hex digits) escape sequence. If it's greater than 65535, print a 3-byte (6 hex digits) escape sequence. Otherwise, print it as you normally would.
Q. Given a UTF-32 encoded string, how can I decompose it to characters?
A. Each element of the string (which is called a code unit) corresponds to a single character (code point). Just take them one by one. Nothing special needs to be done.
Q. Given a UTF-16 encoded string, how can I decompose it to characters?
A. Values (code units) outside of the 0xD800 to 0xDFFF range correspond to the Unicode characters with the same value. For each such value, print either a normal character or a 2-byte (4 hex digits) escape sequence. Values in the 0xD800 to 0xDFFF range are grouped in pairs, each pair representing a single character (code point) in the U+10000 to U+10FFFF range. For such a pair, print a 3-byte (6 hex digits) escape sequence. To convert a pair (v1, v2) to its character value, use this formula:
c = (v1 - 0xd800) >> 10 + (v2-0xdc00)
Note the first element of the pair must be in the range of 0xd800..0xdbff and the second one is in 0xdc00..0xdfff, otherwise the pair is ill-formed.
Q. Given a UTF-8 encoded string, how can I decompose it to characters?
A. The UTF-8 encoding is a bit more complicated than the UTF-16 one and I will not detail it here. There are many descriptions and sample implementations out there on the 'net, look them up.
Q. What's up with my L"प्रे" string?
A. It is a composite character that is composed of four Unicode code points, U+092A, U+094D, U+0930, U+0947. Note it's not the same as a high code point being represented with a surrogate pair as detailed in the UTF-16 part of the answer. It's a case of "character" being not the same as "code point". Escape each code point separately. At this level of abstraction, you are dealing with code points, not actual characters anyway. Characters come into play when you e.g. display them for the user, or compute their position in a printed text, but not when dealing with string encodings.

Canonical Unicode string form

I have a Unicode string encoded, say, as UTF8. One string in Unicode can have few byte representations. I wonder, is there any or can be created any canonical (normalized) form of Unicode string -- so we can e.g. compare such strings with memcmp(3) etc. Can e.g. ICU or any other C/C++ library do that?

You might be looking for Unicode normalisation. There are essentially four different normal forms that each ensure that all equivalent strings have a common form afterwards. However, in many instances you need to take locale into account as well, so while this may be a cheap way of doing a byte-to-byte comparison (if you ensure the same Unicode transformation format, like UTF-8 or UTF-16 and the same normal form) it won't gain you much apart from that limited use case.

Comparing Unicode codepoint sequences:
UTF-8 is a canonical representation itself. Two Unicode strings that are composed of the same Unicode codepoints will always be encoded to exactly the same UTF-8 byte sequence and thus can be compared with memcmp. It is a necessary property of the UTF-8 encoding, otherwise it would not be easily decodable. But we can go further, this is true for all official Unicode encoding schemes, UTF-8, UTF-16 and UTF-32. They encode a string to different byte sequences, but they always encode the same string to the same sequence. If you consider endianness and platform independence, UTF-8 is the recommended encoding scheme because you don't have to deal with byte orders when reading or writing 16-bit or 32-bit values.
So the answer is that if two strings are encoded with the same encoding scheme (eg. UTF-8) and endiannes (it's not an issue with UTF-8), the resulting byte sequence will be the same.
Comparing Unicode strings:
There's an other issue that is more difficult to handle. In Unicode some glyphs (the character you see on the screen or paper) can be represented with a single codepoint or a combination of two consecutive codepoints (called combining characters). This is usually true for glyphs with accents, diacritic marks, etc. Because of the different codepoint representation, their corresponding byte sequence will differ. Comparing strings while taking these combining characters into consideration can not be performed with simple byte comparison, first you have to normalize it.
The other answers mention some Unicode normalization techniques, canonical forms and libraries that you can use for converting Unicode strings to their normal form. Then you will be able to compare them byte-by-byte with any encoding scheme.

You're looking to normalize the string to one of the Unicode normalization forms. libicu can do this for you, but not on a UTF-8 string. You have to first convert it to UChar, using e.g. ucnv_toUChars, then normalize with unorm_normalize, then convert back using ucnv_fromUChars. I think there's also some specific version of ucnv_* for UTF-8 encoding.
If memcmp is your only goal you can of course do that directly on the UChar array after unorm_normalize.

How to UTF-8 encode a character/string

I am using a Twitter API library to post a status to Twitter. Twitter requires that the post be UTF-8 encoded. The library contains a function that URL encodes a standard string, which works perfectly for all special characters such as !##$%^&*() but is the incorrect encoding for accented characters (and other UTF-8).
For example, 'é' gets converted to '%E9' rather than '%C3%A9' (it pretty much only converts to a hexadecimal value). Is there a built-in function that could input something like 'é' and return something like '%C9%A9"?
edit: I am fairly new to UTF-8 in case what I am requesting makes no sense.
edit: if I have a
string foo = "bar é";
I would like to convert it to
"bar %C3%A9"
Thanks

If you have a wide character string, you can encode it in UTF8 with the standard wcstombs() function. If you have it in some other encoding (e.g. Latin-1) you will have to decode it to a wide string first.
Edit: ... but wcstombs() depends on your locale settings, and it looks like you can't select a UTF8 locale on Windows. (You don't say what OS you're using.) WideCharToMultiByte() might be more useful on Windows, as you can specify the encoding in the call.

To understand what needs to be done, you have to first understand a bit of background. Different encodings use different values for the "same" character. Latin-1, for example, says "é" is a single byte with value E9 (hex), while UTF-8 says "é" is the two byte sequence C3 A9, and yet UTF-16 says that same character is the single double-byte value 00E9 – a single 16-bit value rather than two 8-bit values as in UTF-8. (Unicode, which isn't an encoding, actually uses the same codepoint value, U+E9, as Latin-1.)
To convert from one encoding to another, you must first take the encoded value, decode it to a value independent of the source encoding (i.e. Unicode codepoint), then re-encode it in the target encoding. If the target encoding doesn't support all of the source encoding's codepoints, then you'll either need to translate or otherwise handle this condition.
This re-encoding step requires knowing both the source and target encodings.
Your API function is not converting encodings; it appears to be URL-escaping an arbitrary byte string. The authors of the function apparently assume you will have already converted to UTF-8.
In order to convert to UTF-8, you must know what encoding your system is using and be able to map to Unicode codepoints. From there, the UTF-8 encoding is trivial.
Depending on your system, this may be as easy as converting the "native" character set (which has "é" as E9 for you, so probably Windows-1252, Latin-1, or something very similar) to wide characters (which is probably UTF-16 or UCS-2 if sizeof(wchar_t) is 2, or UTF-32 if sizeof(wchar_t) is 4) and then to UTF-8. Wcstombs, as Martin answers, may be able to handle the second part of this conversion, but this is system-dependent. However, I believe Latin-1 is a subset of Unicode, so conversion from this source encoding can skip the wide character step. Windows-1252 is close to Latin-1, but replaces some control characters with printable characters.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js