Text encoding of Protocol Buffers string fields - c++

If a C++ program receives a Protocol Buffers message that has a Protocol Buffers string field, which is represented by a std::string, what is the encoding of text in that field? Is it UTF-8?

Protobuf strings are always valid UTF-8 strings.
See the Language Guide:
A string must always contain UTF-8 encoded or 7-bit ASCII text.
(And ASCII is always also valid UTF-8.)
Not all protobuf implementations enforce this, but if I recall correctly, at least the Python library refuses to decode non-unicode strings.

Related

Standard way of Serializing utf-8 characters in a JSON String

What's the standard way of serializing a utf-8 string in JSON? Should it be with u escaped sequence or should it be the hex code.
I want to serialize some sensor readings with units in a JSON Format.
For example I have temperature readings with units °C. Should it be serialized as
{
"units": "\u00b0"
}
´´´
or should it be something like
´´´
{
"units":"c2b0"
}
Or could both of these supported by the standard.
If JSON is used to exchange data, it must use UTF-8 encoding (see RFC8259). UTF-16 and UTF-32 encodings are no longer allowed. So it is not necessary to escape the degree character. And I strongly recommend against escaping unnecessarily.
Correct and recommended
{
"units": "°C"
}
Of course, you must apply a proper UTF-8 encoding.
If JSON is used in a closed ecosystem, you can use other text encodings (though I would recommend against it unless you have a very good reason). If you need to escape the degree character in your non-UTF-8 encoding, the correct escaping sequence is \u00b0.
Possible but not recommended
{
"units": "\u00b0C"
}
Your second approach is incorrect under all circumstances.
Incorrect
{
"units":"c2b0"
}
It is also incorrect to use something like "\xc2\xb0". This is the escaping used in C/C++ source code. It also used by debugger to display strings. In JSON, it always invalid.
Incorrect as well
{
"units":"\xc2\xb0"
}
JSON uses unicode to be encoded, but it is specified that you can use \uxxxx escape codes to represent characters that don't map into your computer native environment, so it's perfectly valid to include such escape sequences and use only plain ascii encoding to transfer JSON serialized data.

Should I use UTF-8 to send data over the network?

WinAPI uses UTF-16LE encoding, so if I called some WinAPI function that returns a string, it will return it as UTF-16LE encoded.
So I'm thinking of using UTF-16LE encoding for strings in my program, and when it's time to send the data over the network, I convert it to UTF-8, and on the other side I convert it back to UTF-16LE. This is so there is less amount of data to send.
Is there's a reason why I shouldn't do that?
With UTF-8 encoding, you'll use:
1 byte for ASCII chars
2 bytes for unicode chars between U+0000 and U+07FF
more bytes if necesseray
So, if your text is western language, in most case it will probably be shorter in UTF-8 than in UTF-16LE encoding: the western alphabets are encoded between U-0000 and U-0590.
On the opposite, if your text is asian, then the UTF8 encoding might inflate significantly your data. The asian caracter sets are beyond U+7FF and require hence at least 3 bytes
In the UTF8 everywhere article you can find some (basic) statistics about length of text encoding, as well as other arguments supporting the use of UTF8.
One that comes to my mind for networking, is taht UTF8 representation is the same représentation on all platforms, whereas for UTF16 you have the LE and BE, depending on OS and CPU architecure.

std::string with different encoding to QString

Is there any way to detect std::string encoding?
My problem: I have an external web services which give data in different encodings. Also I have a library witch parse that data and store it in std::string. Than I want to display data in Qt GUI. The problem is that std::string can have different encodings. Some string can be converted using QString::fromAscii(), some QString::fromUtf8().
I haven't looked into it but I did use some Qt3.3 in the past.
ASCII vs Unicode + UTF-8
Utf8 is 8-bit, ascii 7-bit. I guess you can try to look into the values of string array
http://doc.qt.digia.com/3.3/qstring.html#ascii and http://doc.qt.digia.com/3.3/qstring.html#utf8
it seems ascii returns an 8-bit ASCII representation of the string, still I think it should have values from 0 to 127 or something like that. you must compare more characters in the string.

Encode gives wrong value of Japanese kanji

As a part of a scraper, I need to encode kanji to URLs, but I just can't seem to even get the correct output from a simple sign, and I'm currently blinded by everything I've tried thus far from various Stack Overflow posts.
The document is set to UTF-8.
sampleText=u'ル'
print sampleText
print sampleText.encode('utf-8')
print urllib2.quote(sampleText.encode('utf-8'))
It gives me the values:
ル
ル
%E3%83%AB
But as far as I understand, it should give me:
ル
XX
%83%8B
What am I doing wrong? Are there some settings I don't have correct? Because as far as I understand it, my output from the encode() should not be ル.
The code you show works correctly. The character ル is KATAKANA LETTER RU, and is Unicode codepoint U+30EB. When encoded to UTF-8, you'll get the Python bytestring '\xe3\x83\xab', which prints out as ル if your console encoding is Latin-1. When you URL-escape those three bytes, you get %E3%83%AB.
The value you seem to be expecting, %83%8B is the Shift-JIS encoding of ル, rather than UTF-8 encoding. For a long time there was no standard for how to encode non-ASCII text in a URL, and as this Wikipedia section notes, many programs simply assumed a particular encoding (often without specifying it). The newer standard of Internationalized Resource Identifiers (IRIs) however says that you should always convert Unicode text to UTF-8 bytes before performing percent encoding.
So, if you're generating your encoded string for a new program that wants to meet the current standards, stick with the UTF-8 value you're getting now. I would only use the Shift-JIS version if you need it for backwards compatibility with specific old websites or other software that expects that the data you send will have that encoding. If you have any influence over the server (or other program), see if you can update it to use IRIs too!

Converting character encoding within c++

I have a website which allows users to input usernames.
The problem here is that the code in c++ assumes the browser encoding is Western Europe and converts the string received from the username text box into unicode to compare with string stored within the databasse.
with the right browser encoding set the character úser is recieved as %FAser and coverted properly to úser within the program
however with the browser settings set to UTF-8 the string is recieved as %C3%BAser and then converted to úser due to the code converting C3 and BA as seperate characters.
Is there a way to convert the example %c3%BA to ú while ensuring the right conversions are being made?
You can use the ICU library to convert between almost all usable encodings. This library also provides lots of string manipulation facilities.