How to handle unicode values in JSON strings? - c++

I'm writing a JSON parser in C++ and am facing a problem when parsing JSON strings:
The JSON specification states that JSON strings can contain unicode characters in the form of:
"here comes a unicode character: \u05d9 !"
My JSON parser tries to map JSON strings to std::string so usually, one character of the JSON strings becomes one character of the std::string. However for those unicode characters, I really don't know what to do:
Should I just put the raw bytes values in my std::string like so:
std::string mystr;
mystr.push_back('\0x05');
mystr.push_back('\0xd9');
Or should I interpret the two characters with a library like iconv and store the UTF-8 encoded result in my string instead ?
Should I use a std::wstring to store all the characters ? What then on *NIX OSes where wchar_t are 4-bytes long ?
I sense something is wrong in my solutions but I fail to understand what. What should I do in that situation ?

After some digging and thanks to H2CO3's comments and Philipp's comments, I finally could understand how this is supposed to work:
Reading the RFC4627, Section 3. Encoding:
Encoding
JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8.
Since the first two characters of a JSON text will always be ASCII
characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
00 00 00 xx UTF-32BE
00 xx 00 xx UTF-16BE
xx 00 00 00 UTF-32LE
xx 00 xx 00 UTF-16LE
xx xx xx xx UTF-8
So it appears a JSON octet stream can be encoded in UTF-8, UTF-16, or UTF-32 (in both their BE or LE variants, for the last two).
Once that is clear, Section 2.5. Strings explains how to handle those \uXXXX values in JSON strings:
Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lowercase. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".
With more complete explanations for characters not in the Basic Multilingual Plane.
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
Hope this helps.

If I were you, I would use std::string to store UTF-8 and UTF-8 only.
If incoming JSON text does not contain any \uXXXX sequences, std::string can be used as is, byte to byte, without any conversion.
When you parse \uXXXX, you can simply decode it and convert it to UTF-8, effectively treating it as if it was true UTF-8 character in its place - this is what most JSON parsers are doing anyway (libjson for sure).
Granted, with this approach reading JSON with \uXXXX and immediately dumping it back using your library is likely to lose \uXXXX sequences and replace them with their true UTF-8 representations, but who really cares? Ultimately, net result is still exactly the same.

Related

Loop through Unicode string as character

With the following string, the size is incorrectly output. Why is this, and how can I fix it?
string str = " ██████";
cout << str.size();
// outputs 19 rather than 7
I'm trying to loop through str character by character so I can read it into into a vector<string> which should have a size of 7, but I can't do this since the above code outputs 19.
TL;DR
The size() and length() members of basic_string returns the size in units of the underlying string, not the number of visible characters. To get the expected number:
Use UTF-16 with u prefix for very simple strings that contain no non-BMP, no combining characters and no joining characters
Use UTF-32 with U prefix for very simple strings that don't contain any combining or joining characters
Normalize the string and count for arbitrary Unicode strings
" ██████" is a space followed by a series of 6 U+2588 characters. Your compiler seems to be using UTF-8 for std::string. UTF-8 is a variable-length encoding and many letters are encoded using multiple bytes (because obviously you can't encode more than 256 characters with just one byte). In UTF-8 code points between U+0800 and U+FFFF are encoded by 3 bytes. Therefore the length of the the string in UTF-8 is 1 + 6*3 = 19 bytes.
You can check with any Unicode converter like this one and see that the string is encoded as 20 E2 96 88 E2 96 88 E2 96 88 E2 96 88 E2 96 88 E2 96 88 in UTF-8, and you can also loop through each byte of your string to check
If you want the total number of visible characters in the string then it's a lot trickier and churill's solution doesn't work. Read the example in Twitter
If you use anything beyond the most basic letters, numbers, and punctuation the situation gets more confusing. While many people use multi-byte Kanji characters to exemplify these issues, Twitter has found that accented vowels cause the most confusion because English speakers simply expect them to work. Take the following example: the word “café”. It turns out there are two byte sequences that look exactly the same, but use a different number of bytes:
café 0x63 0x61 0x66 0xC3 0xA9 Using the “é” character, called the “composed character”.
café 0x63 0x61 0x66 0x65 0xCC 0x81 Using the combining diacritical, which overlaps the “e”
You need a Unicode library like ICU to normalize the string and count. Twitter for example uses Normalization Form C
Edit:
Since you're only interested in box-drawing characters which doesn't seem to lie outside the BMP and don't contain any combining characters, UTF-16 and UTF-32 will work. Like std::string, std::wstring is also a basic_string and doesn't have a mandatory encoding. In most implementations it's often either UTF-16 (Windows) or UTF-32 (*nix) so you may use it, but it's unreliable and depends on source code encoding. The better way is to use std::u16string (std::basic_string<char16_t>) and std::u32string (std::basic_string<char32_t>). They'll work regardless of system and encoding of the source file
std::wstring wstr = L" ██████";
std::u16string u16str = u" ██████";
std::u32string u32str = U" ██████";
std::cout << str.size(); // may work, returns the number of wchar_t characters
std::cout << u16str.size(); // always returns the number of UTF-16 code units
std::cout << u32str.size(); // always returns the number of UTF-32 code units
In case you're interested in how to work out on that for all Unicode characters continue reading below
The “café” issue mentioned above raises the question of how you count the characters in the Tweet string “café”. To the human eye the length is clearly four characters. Depending on how the data is represented this could be either five or six UTF-8 bytes. Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation. Therefore, Twitter does count “café” as four characters, no matter which representation is sent.
[...]
Twitter counts the length of a Tweet using the Normalization Form C (NFC) version of the text. This type of normalization favors the use of a fully combined character (0xC3 0xA9 from the café example) over the long-form version (0x65 0xCC 0x81). Twitter also counts the number of codepoints in the text rather than UTF-8 bytes. The 0xC3 0xA9 from the café example is one codepoint (U+00E9) that is encoded as two bytes in UTF-8, whereas 0x65 0xCC 0x81 is two codepoints encoded as three bytes
Twitter - Counting characters
See also
When "Zoë" !== "Zoë". Or why you need to normalize Unicode strings
Getting Twitter characters count
Why is the length of this string longer than the number of characters in it?
std::string only contains 1 byte long chars (usually 8 bit, contains UTF-8 char), you need wchar_t and std::wstring to achieve what you want:
std::wstring str = L" ██████";
std::cout << str.size();
Allthough this prints 7 (one space and 6 unicode chars). Notice the L before the string literal, so it will be interpreted as a wide string.

How to find whether byte read is japanese or english?

I have an array which contains Japanese and ascii characters.
I am trying to find whether characters read is English character or Japanese characters.
in order to solve this i followed as
read first byte , if multicharcterswidth is not equal to one, move pointer to next byte
now display whole two byte together and display that Japanese character has been read.
if multicharcterswidth is equal to one, display the byte. and show message english has been read.
above algo work fine but fails in case of halfwidth form of Japanese eg.シ,ァ etc. as it is only one byte.
How can i find out whether characters are Japanese or English?
**Note:**What i tried
I read from web that first byte will tell whether it is japanese or not which i have covered in step 1 of my algo. But It won't work for half width.
EDIT:
The problem i was solving i include control characters 0X80 at start and end of my characters to identify the string of characters.
i wrote following to identify the end of control character.
cntlchar.....(my characters , can be japnese).....cntlchar
if ((buf[*p+1] & 0X80) && (mbMBCS_charWidth(&buf[*p]) == 1))
// end of control characters reached
else
// *p++
it worked fine when for english but didn't work for japanese half width.
How can i handle this?
Your data must be using Windows Codepage 932. That is a guess, but examining the codepoints shows what you are describing.
The codepage shows that characters in the range 00 to 7F are "English" (a better description is "7-bit ASCII"), the characters in the ranges 81 to 9F and E0 to FF are the first byte of a multibyte code, and everything between A1 and DF are half-width Kana characters.
For individual bytes this is impractical to impossible. For larger sets of data you could do statistical analysis on the bytes and see if it matches known English or Japanese patterns. For example, vowels are very common in English text but different Japanese letters would have similar frequency patterns.
Things get more complicated than testing bits if your data includes accented characters.
If you're dealing with Shift-JIS data and Windows-1252 encoded text, ideally you just remap it to UTF-8. There's no standard way to identify text encoding within a text file, although things like MIME can help if added on externally as metadata.

Qt QString from string - Strange letters

Whenever I try to convert a std::string into a QString with this letter in it ('ß'), the QString will turn into something like "Ã" or some other really strange letters. What's wrong? I used this code and it didn't cause any errors or warnings!
std::string content = "Heißes Teil.";
ui->txtFind_lang->setText(QString::fromStdString(content));
The std::string has no problem with this character. I even wrote it into a text file without problems. So what am I doing wrong?
You need to set the codec to UTF-8 :
QTextCodec::setCodecForTr(QTextCodec::codecForName("UTF-8"));
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
By default, Qt uses the Latin-1 encoding, which is limited. By adding this code, you set the default encoding to UTF-8 which allow you to use much more characters.
Though antoyo's answer works, I wasn't too sure why. So, I decided to investigate.
All of my documents are encoded in UTF-8, as are most web-pages. The ß character has the UTF code point of UTF+00DF.
Since UTF-8 is a variable length encoding, in the binary form, ß would be encoded as 11000011 10011111 or C3 9F. Since by default Qt relies on Latin1 encoding. It would read ß as two different characters. The first one C3 will map to à and the second one 9F will not map to anything as Latin1 does not recognize bytes in between 128-159 (in decimal).
That's why ß appears as à when using Latin1 encoding.
Side note: You might want to brush up on how UTF-8 encoding works, because otherwise it seems a little unintuitive that ß takes two bytes even though its code point DF is less than FF and should consume just one byte.

UTF 8 encoded Japanese string in XML

I am trying to create a SOAP call with Japanese string. The problem I faced is that when I encode this string to UTF8 encoded string, it has many control characters in it (e.g. 0x1B (Esc)). If I remove all such control characters to make it a valid SOAP call then the Japanese content appears as garbage on server side.
How can I create a valid SOAP request for Japanese characters? Any suggestion is highly appreciated.
I am using C++ with MS-DOM.
With Best Regards.
If I remember correctly it's true, the first 32 unicode code points are not allowed as characters in XML documents, even escaped with &#. Not sure whether they're allowed in HTML or not, but certainly the server thinks they're not allowed in your requests, and it gets the only meaningful vote.
I notice that your document claims to be encoded in iso-2022-jp, not utf-8. And indeed, the sequence of characters ESC $ B that appears in your document is valid iso-2022-jp. It indicates that the data is switching encodings (from ASCII to a 2-byte Japanese encoding called JIS X 0208-1983).
But somewhere in the process of constructing your request, something has seen that 0x1B byte and interpreted it as a character U+001B, not realising that it's intended as one byte in data that's already encoded in the document encoding. So, it has XML-escaped it as a "best effort", even though that's not valid XML.
Probably, whatever is serializing your XML document doesn't know that the encoding is supposed to be iso-2022-jp. I imagine it thinks it's supposed to be serializing the document as ASCII, ISO-Latin-1, or UTF-8, and the <meta> element means nothing to it (that's an HTML way of specifying the encoding anyway, it has no particular significance in XML). But I don't know MS-DOM, so I don't know how to correct that.
If you just remove the ESC characters from iso-2022-jp data, then you conceal the fact that the data has switched encodings, and so the decoder will continue to interpret all that 7nMK stuff as ASCII, when it's supposed to be interpreted as JIS X 0208-1983. Hence, garbage.
Something else strange -- the iso-2022-jp code to switch back to ASCII is ESC ( B, but I see |(B</font> in your data, when I'd expect the same thing to happen to the second ESC character as happened to the first: &#0x1B(B</font>. Similarly, $B#M#S(B and $BL#D+(B are mangled attempts to switch from ASCII to JIS X 0208-1983 and back, and again the ESC characters have just disappeared rather than being escaped.
I have no explanation for why some ESC characters have disappeared and one has been escaped, but it cannot be coincidence that what you generate looks almost, but not quite, like valid iso-2022-jp. I think iso-2022-jp is a 7 bit encoding, so part of the problem might be that you've taken iso-2022-jp data, and run it through a function that converts ISO-Latin-1 (or some other 8 bit encoding for which the lower half matches ASCII, for example any Windows code page) to UTF-8. If so, then this function leaves 7 bit data unchanged, it won't convert it to UTF-8. Then when interpreted as UTF-8, the data has ESC characters in it.
If you want to send the data as UTF-8, then first of all you need to actually convert it out of iso-2022-jp (to wide characters or to UTF-8, whichever your SOAP or XML library expects). Secondly you need to label it as UTF-8, not as iso-2022-jp. Finally you need to serialize the whole document as UTF-8, although as I've said you might already be doing that.
As pointed out by Steve Jessop, it looks like you have encoded the text as iso-2022-jp, not UTF-8. So the first thing to do is to check that and ensure that you have proper UTF-8.
If the problem still persists, consider encoding the text.
The simplest option is "hex encoding" where you just write the hex value of each byte as ASCII digits. e.g. the 0x1B byte becomes "1B", i.e. 0x31, 0x42.
If you want to be fancy you could use MIME or even UUENCODE.

Unicode Woes! Ms-Access 97 migration to Ms-Access 2007

Problem is categorized in two steps:
Problem Step 1. Access 97 db containing XML strings that are encoded in UTF-8.
The problem boils down to this: the Access 97 db contains XML strings that are encoded in UTF-8. So I created a patch tool for separate conversion for the XML strings from UTF-8 to Unicode. In order to covert UTF8 string to Unicode, I have used function
MultiByteToWideChar(CP_UTF8, 0, PChar(OriginalName), -1, #newName, Size);.(where newName is array as declared "newName : Array[0..2048] of WideChar;" ).
This function works good on most of the cases, I have checked it with Spainsh, Arabic, characters. but I am working on Greek and Chineese Characters it is choking.
For some greek characters like "Ευγ. ΚαÏαβιά" (as stored in Access-97), the resultant new string contains null charaters in between, and when it is stored to wide-string the characters are getting clipped.
For some chineese characters like "?¢»?µ?"(as stored in Access-97), the result is totally absurd like "?¢»?µ?".
Problem Step 2. Access 97 db Text Strings, Application GUI takes unicode input and saved in Access-97
First I checked with Arabic and Spainish Characters, it seems then that no explicit characters encoding is required. But again the problem comes with greek and chineese characters.
I tried the above mentioned same function for the text conversion( Is It correct???), the result was again disspointing. The Spainsh characters which are ok with out conversion, get unicode character either lost or converted to regular Ascii Alphabets.
The Greek and Chineese characters shows similar behaviour as mentined in step 1.
Please guide me. Am I taking the right approach? Is there some other way around???
Well Right now I am confused and full of Questions :)
There is no special requirement for working with Greek characters. The real problem is that the characters were stored in an encoding that Access doesn't recognize in the first place. When the application stored the UTF8 values in the database it tried to convert every single byte to the equivalent byte in the database's codepage. Every character that had no correspondence in that encoding was replaced with ? That may mean that the Greek text is OK, while the chinese text may be gone.
In order to convert the data to something readable you have to know the codepage they are stored in. Using this you can get the actual bytes and then convert them to Unicode.