Should I use UTF-8 to send data over the network? - c++

WinAPI uses UTF-16LE encoding, so if I called some WinAPI function that returns a string, it will return it as UTF-16LE encoded.
So I'm thinking of using UTF-16LE encoding for strings in my program, and when it's time to send the data over the network, I convert it to UTF-8, and on the other side I convert it back to UTF-16LE. This is so there is less amount of data to send.
Is there's a reason why I shouldn't do that?

With UTF-8 encoding, you'll use:
1 byte for ASCII chars
2 bytes for unicode chars between U+0000 and U+07FF
more bytes if necesseray
So, if your text is western language, in most case it will probably be shorter in UTF-8 than in UTF-16LE encoding: the western alphabets are encoded between U-0000 and U-0590.
On the opposite, if your text is asian, then the UTF8 encoding might inflate significantly your data. The asian caracter sets are beyond U+7FF and require hence at least 3 bytes
In the UTF8 everywhere article you can find some (basic) statistics about length of text encoding, as well as other arguments supporting the use of UTF8.
One that comes to my mind for networking, is taht UTF8 representation is the same représentation on all platforms, whereas for UTF16 you have the LE and BE, depending on OS and CPU architecure.

Related

Text encoding of Protocol Buffers string fields

If a C++ program receives a Protocol Buffers message that has a Protocol Buffers string field, which is represented by a std::string, what is the encoding of text in that field? Is it UTF-8?
Protobuf strings are always valid UTF-8 strings.
See the Language Guide:
A string must always contain UTF-8 encoded or 7-bit ASCII text.
(And ASCII is always also valid UTF-8.)
Not all protobuf implementations enforce this, but if I recall correctly, at least the Python library refuses to decode non-unicode strings.

AES encrypted password in UTF-8

My application receives UTF-16 string as password, which should be saved post encryption in the database with UTF-8 encoding. I'm taking following actions for it
Take input password in wstring (UTF-16)
Reinterpret this password using reinterpret_cast to unsigned char *.
Use step 2 password and encrypt it using AES_cbc_encrypt, which returns unsigned char *
Convert step 3 output to wstring (UTF-16)
Convert wstring to UTF-8 using Poco's UnicodeConvertor class. Save this UTF-8 string in the database
Is this the correct way of saving AES encrypted password? Please suggest if there is a better way
Depending on your requirements you might want to consider first encoding the string to UTF-8 and then encrypting it.
Advantage of this approach is, that the hash stored in the DB is based on a binary format that is independent of endianess.
Using UTF-16 you usually need to deal with endianess when you have clients on different systems implemented in different programming languages.
I think you'd be much better off converting the encripted password to hex digits or to base-64 encoding. This way you're guaranteed to have no weird or illegal UTF-16 symbols, nor will you have \n, \r or \t in your UTF-8. The converted text will be somewhat larger - hope it's not a big deal.

Loading a UTF-16 image into memory

I am trying to load an ID3 image tag that has been saved in UTF-16 JFIF format. The library I am using (Juce) fails to parse the image, as it assumes that the data is in a raw binary format.
The majority of image tags I've parsed successfully report the encoding as ISO-8859 (latin-1), but because latin-1 is a subset of UTF-16 a conversion wouldn't work.
How can I get this UTF16 encoded binary block in the raw format that I want? And could anybody enlighten me as to the benefits of storing an image in UTF16 format?!
latin1 is not a subset of UTF-16!
I think you misunderstood text encoding and binary encoding. UTF-16 is used for character encoding, the base unit is a 16-bits integer (UTF-8 is using 8 bits integer).
A JPEG picture (JFIF) is binary encoded, and its data should never get converted via character encoding algorithm.
If you actually did so, you're out of luck, since using a character conversion algorithm on a binary stream depends on whatever "source" text charset that was used at the time.
You can probably try to convert that (UTF-16) binary data back to binary by guessing the initial source charset, using iconv.

UTF 8 encoded Japanese string in XML

I am trying to create a SOAP call with Japanese string. The problem I faced is that when I encode this string to UTF8 encoded string, it has many control characters in it (e.g. 0x1B (Esc)). If I remove all such control characters to make it a valid SOAP call then the Japanese content appears as garbage on server side.
How can I create a valid SOAP request for Japanese characters? Any suggestion is highly appreciated.
I am using C++ with MS-DOM.
With Best Regards.
If I remember correctly it's true, the first 32 unicode code points are not allowed as characters in XML documents, even escaped with &#. Not sure whether they're allowed in HTML or not, but certainly the server thinks they're not allowed in your requests, and it gets the only meaningful vote.
I notice that your document claims to be encoded in iso-2022-jp, not utf-8. And indeed, the sequence of characters ESC $ B that appears in your document is valid iso-2022-jp. It indicates that the data is switching encodings (from ASCII to a 2-byte Japanese encoding called JIS X 0208-1983).
But somewhere in the process of constructing your request, something has seen that 0x1B byte and interpreted it as a character U+001B, not realising that it's intended as one byte in data that's already encoded in the document encoding. So, it has XML-escaped it as a "best effort", even though that's not valid XML.
Probably, whatever is serializing your XML document doesn't know that the encoding is supposed to be iso-2022-jp. I imagine it thinks it's supposed to be serializing the document as ASCII, ISO-Latin-1, or UTF-8, and the <meta> element means nothing to it (that's an HTML way of specifying the encoding anyway, it has no particular significance in XML). But I don't know MS-DOM, so I don't know how to correct that.
If you just remove the ESC characters from iso-2022-jp data, then you conceal the fact that the data has switched encodings, and so the decoder will continue to interpret all that 7nMK stuff as ASCII, when it's supposed to be interpreted as JIS X 0208-1983. Hence, garbage.
Something else strange -- the iso-2022-jp code to switch back to ASCII is ESC ( B, but I see |(B</font> in your data, when I'd expect the same thing to happen to the second ESC character as happened to the first: &#0x1B(B</font>. Similarly, $B#M#S(B and $BL#D+(B are mangled attempts to switch from ASCII to JIS X 0208-1983 and back, and again the ESC characters have just disappeared rather than being escaped.
I have no explanation for why some ESC characters have disappeared and one has been escaped, but it cannot be coincidence that what you generate looks almost, but not quite, like valid iso-2022-jp. I think iso-2022-jp is a 7 bit encoding, so part of the problem might be that you've taken iso-2022-jp data, and run it through a function that converts ISO-Latin-1 (or some other 8 bit encoding for which the lower half matches ASCII, for example any Windows code page) to UTF-8. If so, then this function leaves 7 bit data unchanged, it won't convert it to UTF-8. Then when interpreted as UTF-8, the data has ESC characters in it.
If you want to send the data as UTF-8, then first of all you need to actually convert it out of iso-2022-jp (to wide characters or to UTF-8, whichever your SOAP or XML library expects). Secondly you need to label it as UTF-8, not as iso-2022-jp. Finally you need to serialize the whole document as UTF-8, although as I've said you might already be doing that.
As pointed out by Steve Jessop, it looks like you have encoded the text as iso-2022-jp, not UTF-8. So the first thing to do is to check that and ensure that you have proper UTF-8.
If the problem still persists, consider encoding the text.
The simplest option is "hex encoding" where you just write the hex value of each byte as ASCII digits. e.g. the 0x1B byte becomes "1B", i.e. 0x31, 0x42.
If you want to be fancy you could use MIME or even UUENCODE.

How to convert a single-byte const char* to a UTF-8 encoding

I have a function which requires me to pass a UTF-8 string pointed by a char*, and I have the char pointer to a single-byte string. How can I convert the string to UTF-8 encoding in C++? Is there any code I can use to do this?
Thanks!
Assuming Linux, you're looking for iconv. When you open the converter (iconv_open), you pass from and to encoding. If you pass an empty string as from, it'll convert from the locale used on your system which should match the file system.
On Windows, you have pretty much the same with MultiByteToWideChar where you pass CP_ACP as the codepage. But on Windows you can simply call the Unicode version of the functions to get Unicode straight away and then convert to UTF-8 with WideCharToMultiByte and CP_UTF8.
To convert a string to a different character encoding, use any of various character encoding libraries. A popular choice is iconv (the standard on most Linux systems).
However, to do this you first need to figure out the encoding of your input. There is unfortunately no general solution to this. If the input does not specify its encoding (like e.g. web pages generally do), you'll have to guess.
As to your question: You write that you get the string from calling readdir on a FAT32 file system. I'm not quite sure, but I believe readdir will return the file names as they are stored by the file system. In the case of FAT/FAT32:
The short file names are encoded in some DOS code page - which code page depends on how the files where written, there's no way to tell from just the file system AFAIK.
The long file names are in UTF-16.
If you use the standard vfat Linux kernel module to access the FAT32 partition, you should get long file names from readdir (unless a file only has an 8.3 name). These can be decoded as UTF-16. FAT32 stores the long file names in UTF-16 internally. The vfat driver will convert them to the encoding given by the iocharset= mount parameter (with the default being the default system encoding, I believe).
Additional information:
You may have to play with the mount options codepage and iocharset (see http://linux.die.net/man/8/mount ) to get filenames right on the FAT32 volume. Try to mount such that filenames are shown correctly in a Linux console, then proceed. There is some more explanation here: http://www.nslu2-linux.org/wiki/HowTo/MountFATFileSystems
I guess the top bit is set on the 1 byte string so the function you're passing that to is expecting more than 1 byte to be passed.
First, print the string out in hex.
i.e.
unsigned char* str = "your string";
for (int i = 0; i < strlen(str); i++)
printf("[%02x]", str[i]);
Now have a read of the wikipedia article on UTF8 encoding which explains it well.
http://en.wikipedia.org/wiki/UTF-8
UTF-8 is variable width where each character can occupy from 1 to 4 bytes.
Therefore, convert the hex to binary and see what the code point is.
i.e. if the first byte starts 11110 (in binary) then it's expecting a 4 byte string. Since ascii is 7-bit 0-127 the top bit is always zero so there should be only 1 byte. By the way, the bytes following the first byte in a wide character of a UTF8 string will start "10..." for the top bits. These are the continuation bytes... that's what your function is complaining about... i.e. the continuation bytes are missing when expected.
So the string is not quite true ascii as you thought it was.
You can convert using as someone suggested iconv, or perhaps this library http://utfcpp.sourceforge.net/