utf8mb4 encode/decode in c++ - c++

A third-part server echoes string to my client program, the string contains both utf8 data and unicode emoji (listed here). for example:
I googled some time and found this is called utf8mb4 encoding, which is used in SQL application.
I find some article about utf8mb4 in mysql/python/ruby/etc... but no c++.
Is there any c++ library can do encoding/decoding utf8mb4?

MySQL calls utf8mb4 what is in truth utf8:
The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters:
so any library that supports utf8 will give you utf8mb4. In this question it was asked what solutions are there in C++ for converting to/from utf8: How to work with UTF-8 in C++, Conversion from other Encodings to UTF-8 . The three solutions given are ICU (International Components for Unicode), Boost.Locale and C++11.

Related

Using Traditional Chinese with AWS DynamoDB

I have a mobile app that stores data in dynamoDB tables. There is a group of users in Taiwan that attempted to store there names in the database. when the data is stored it become garbled. I have researched this and see that it is because dynamoDB uses UTF encoding while tradional chinese uses big 5 text encoding. How do I setup dynamoDB so that it will store and recall the proper characters??
So you start with a string in your head. It's a sequence of Unicode characters. There's no inherent byte encoding to the characters. The same string could be encoded into bytes in a variety of ways. Big5 is one. UTF-8 is another.
When you say that Traditional Chinese uses Big5, that's not entirely true. It may be commonly encoded in Big5, but it could be in UTF-8 instead, and UTF-8 has this cool property that it can encode all Unicode character sequences. That's why it's become the standard encoding for situations where you don't want to optimize for one character set.
So your challenge is make sure to carefully control the characters and encodings so that you're sending UTF-8 sequences to DynamoDB. The standard SDKs would do this correctly as long as you're creating the strings as basic strings in them.
You also have to make sure you're not confusing yourself when you look at the data. If you look at UTF-8 bytes but in a way where you're interpreting them as Big5 then it's going to look like gibberish, or vice versa.
You don't say how they're loading the data. If they're starting with a file, could be that. You'd want to read the file in a language saying it's Big5, then you'll have the string version, and then you can send the string version and rely on the SDK to correctly translate to UTF-8 on the wire.
I remember when I first learned this stuff it was all kind of confusing. The thing to remember is a capital A exists as an idea (and is a defined character in Unicode) and there's a whole lot of mechanisms you could use to put that letter into ones and zeros on disk. Each of those ways is an encoding. ASCII is popular but EBCDIC was another contender from the past, and UTF-16 is yet another contender now. Traditional Chinese is a character set (a set of characters) and you can encode each those characters a bunch of ways too. It's just a question of how you map characters to bits and bytes and back again.

Text encoding of Protocol Buffers string fields

If a C++ program receives a Protocol Buffers message that has a Protocol Buffers string field, which is represented by a std::string, what is the encoding of text in that field? Is it UTF-8?
Protobuf strings are always valid UTF-8 strings.
See the Language Guide:
A string must always contain UTF-8 encoded or 7-bit ASCII text.
(And ASCII is always also valid UTF-8.)
Not all protobuf implementations enforce this, but if I recall correctly, at least the Python library refuses to decode non-unicode strings.

Encode gives wrong value of Japanese kanji

As a part of a scraper, I need to encode kanji to URLs, but I just can't seem to even get the correct output from a simple sign, and I'm currently blinded by everything I've tried thus far from various Stack Overflow posts.
The document is set to UTF-8.
sampleText=u'ル'
print sampleText
print sampleText.encode('utf-8')
print urllib2.quote(sampleText.encode('utf-8'))
It gives me the values:
ル
ル
%E3%83%AB
But as far as I understand, it should give me:
ル
XX
%83%8B
What am I doing wrong? Are there some settings I don't have correct? Because as far as I understand it, my output from the encode() should not be ル.
The code you show works correctly. The character ル is KATAKANA LETTER RU, and is Unicode codepoint U+30EB. When encoded to UTF-8, you'll get the Python bytestring '\xe3\x83\xab', which prints out as ル if your console encoding is Latin-1. When you URL-escape those three bytes, you get %E3%83%AB.
The value you seem to be expecting, %83%8B is the Shift-JIS encoding of ル, rather than UTF-8 encoding. For a long time there was no standard for how to encode non-ASCII text in a URL, and as this Wikipedia section notes, many programs simply assumed a particular encoding (often without specifying it). The newer standard of Internationalized Resource Identifiers (IRIs) however says that you should always convert Unicode text to UTF-8 bytes before performing percent encoding.
So, if you're generating your encoded string for a new program that wants to meet the current standards, stick with the UTF-8 value you're getting now. I would only use the Shift-JIS version if you need it for backwards compatibility with specific old websites or other software that expects that the data you send will have that encoding. If you have any influence over the server (or other program), see if you can update it to use IRIs too!

Converting character encoding within c++

I have a website which allows users to input usernames.
The problem here is that the code in c++ assumes the browser encoding is Western Europe and converts the string received from the username text box into unicode to compare with string stored within the databasse.
with the right browser encoding set the character úser is recieved as %FAser and coverted properly to úser within the program
however with the browser settings set to UTF-8 the string is recieved as %C3%BAser and then converted to úser due to the code converting C3 and BA as seperate characters.
Is there a way to convert the example %c3%BA to ú while ensuring the right conversions are being made?
You can use the ICU library to convert between almost all usable encodings. This library also provides lots of string manipulation facilities.

How do i create a unicode filename in linux?

I heard fopen supports UTF8 but i dont know how to convert an array of shorts to utf8
How do i create a file with unicode letters in it? I prefer to use only built in libraries (no boost which is not installed on the linux box). I do need to use fopen but its pretty simple to.
fopen(3) supports any valid byte sequence; the encoding is unimportant. Use nl_langinfo(3) with CODESET to get what charset you should use for the encoding, and libiconv or icu for the actual encoding.