How to Convert UTF-16 Surrogate Decimal to UNICODE in C++ - c++

I got some string data from parameter such as 😊.
These are Unicode's UTF-16 surrogate pairs represented as decimal.
How can I convert them to Unicode code points such as "U+1F62C" with the standard library?

You can easily to it by hand. The algorythm for passing from a high unicode point to the surrogate pair and back is not that hard. Wikipedia page on UTF16 says:
U+10000 to U+10FFFF
0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF.
The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate, which will be in the range 0xD800..0xDBFF.
The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
That's just bitwise and, or and shift and can trivially be implemented in C or C++.
As you said you wanted to use the standard library, what you ask for is a conversion from two 16 bits UTF-16 surrogates to one 32 bits unicode code point, so codecvt is your friend, provided you can compile in C++11 mode or higher.
Here is an example processing your values on a little endian architecture:
#include <iostream>
#include <locale>
#include <codecvt>
int main() {
std::codecvt_utf16<char32_t, 0x10ffffUL,
std::codecvt_mode::little_endian> cvt;
mbstate_t state;
char16_t pair[] = { 55357, 56842 };
const char16_t *next;
char32_t u[2];
char32_t *unext;
cvt.in(state, (const char *) pair, (const char *) (pair + 2),
(const char *&) next, u, u+1, unext);
std::cout << std::hex << (uint16_t) pair[0] << " " << (uint16_t) pair[1]
<< std::endl;
std::cout << std::hex << (uint32_t) u[0] << std::endl;
return 0;
}
Output is as expected:
d83d de0a
1f60a

Related

UTF-8, sprintf, strlen, etc

I try to understand how to handle basic UTF-8 operations in C++.
Let's say we have this scenario: User inputs a name, it's limited to 10 letters (symbols in user's language, not bytes), it's being stored.
It can be done this way in ASCII.
// ASCII
char * input; // user's input
char buf[11] // 10 letters + zero
snprintf(buf,11,"%s",input); buf[10]=0;
int len= strlen(buf); // return 10 (correct)
Now, how to do it in UTF-8? Let's assume it's up to 4 bytes charset (like Chinese).
// UTF-8
char * input; // user's input
char buf[41] // 10 letters * 4 bytes + zero
snprintf(buf,41,"%s",input); //?? makes no sense, it limits by number of bytes not letters
int len= strlen(buf); // return number of bytes not letters (incorrect)
Can it be done with standard sprintf/strlen? Are there any replacements of those function to use with UTF-8 (in PHP there was mb_ prefix of such functions IIRC)? If not, do I need to write those myself? Or maybe do I need to approach it another way?
Note: I would prefer to avoid wide characters solution...
EDIT: Let's limit it to Basic Multilingual Plane only.
I would prefer to avoid wide characters solution...
Wide characters are just not enough, because if you need 4 bytes for a single glyph, then that glyph is likely to be outside the Basic Multilingual Plane, and it will not be represented by a single 16 bits wchar_t character (assuming wchar_t is 16 bits wide which is just the common size).
You will have to use a true unicode library to convert the input to a list of unicode characters in their Normal Form C (canonical composition) or the compatibility equivalent (NFKC)(*) depending on whether for example you want to count one or two characters for the ligature ff (U+FB00). AFAIK, you best bet should be ICU.
(*) Unicode allows multiple representation for the same glyph, notably the normal composed form (NFC) and normal decomposed form (NFD). For example the french é character can be represented in NFC as U+00E9 or LATIN SMALL LETTER E WITH ACUTE or as U+0065 U+0301 or LATIN SMALL LETTER E followed with COMBINING ACUTE ACCENT (also displayed as é).
References and other examples on Unicode equivalence
strlen only counts the bytes in the input string, until the terminating NUL.
On the other hand, you seem interested in the glyph count (what you called "symbols in user's language").
The process is complicated by UTF-8 being a variable length encoding (as is, in a kind of lesser extent, also UTF-16), so code points can be encoded using one up to four bytes. And there are also Unicode combining characters to consider.
To my knowledge, there's nothing like that in the standard C++ library. However, you may have better luck using third party libraries like ICU.
std::strlen indeed considers only one byte characters. To compute the length of a unicode NUL terminated string, one can use std::wcslen instead.
Example:
#include <iostream>
#include <cwchar>
#include <clocale>
int main()
{
const wchar_t* str = L"爆ぜろリアル!弾けろシナプス!パニッシュメントディス、ワールド!";
std::setlocale(LC_ALL, "en_US.utf8");
std::wcout.imbue(std::locale("en_US.utf8"));
std::wcout << "The length of \"" << str << "\" is " << std::wcslen(str) << '\n';
}
If you do not want to count utf-8 chars by yourself - you can use temporary conversion to widechar to cut your input string. You do not need to store the intermediate values
#include <iostream>
#include <codecvt>
#include <string>
#include <locale>
std::string cutString(const std::string& in, size_t len)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> cvt;
auto wstring = cvt.from_bytes(in);
if(len < wstring.length())
{
wstring = wstring.substr(0,len);
return cvt.to_bytes(wstring);
}
return in;
}
int main(){
std::string test = "你好世界這是演示樣本";
std::string res = cutString(test,5);
std::cout << test << '\n' << res << '\n';
return 0;
}
/****************
Output
$ ./test
你好世界這是演示樣本
你好世界這
*/

C++: Convert hex representation of UTF16 char into decimal (like python's int(hex_data, 16))

I found an explanation to decode hex-representations into decimal but only by using Qt:
How to get decimal value of a unicode character in c++
As I am not using Qt and cout << (int)c does not work (Edit: it actually does work if you use it properly..!):
How to do the following:
I got the hex representation of two chars which were transmitted over some socket (Just figured out how to get the hex repr finally!..) and both combined yield following utf16-representation:
char c = u"\0b7f"
This shall be converted into it's utf16 decimal value of 2943!
(see it at utf-table http://www.fileformat.info/info/unicode/char/0b7f/index.htm)
This should be absolut elementary stuff, but as a designated Python developer compelled to use C++ for a project I am hanging this issue for hours....
Use a wider character type (char is only 8 bits, you need at least 16), and also the correct format for UTC literals. This works (live demo):
#include <iostream>
int main()
{
char16_t c = u'\u0b7f';
std::cout << (int)c << std::endl; //output is 2943 as expected
return 0;
}

Doublebyte encodings on MSVC (std::codecvt): Lead bytes not recognized

I want to convert a string encoded in a doublebyte code page into an UTF-16 string using std::codecvt<wchar_t, char, std::mbstate_t>::in() on the Microsoft standard library implementation (MSVC11). For example, consider the following program:
#include <iostream>
#include <locale>
int main()
{
// KATAKANA LETTER A (U+30A2) in Shift-JIS (Codepage 932)
// http://msdn.microsoft.com/en-us/goglobal/cc305152
char const cs[] = "\x83\x41";
std::locale loc = std::locale("Japanese");
// Output: "Japanese_Japan.932" (as expected)
std::cout << loc.name() << '\n';
typedef std::codecvt<wchar_t, char, std::mbstate_t> cvt_t;
cvt_t const& codecvt = std::use_facet<cvt_t>(loc);
wchar_t out = 0;
std::mbstate_t mbst = std::mbstate_t();
char const* mid;
wchar_t* outmid;
// Output: "2" (error) (expected: "0" (ok))
std::cout << codecvt.in(
mbst, cs, cs + 2, mid,
&out, &out + 1, outmid) << '\n';
// Output: "0" (expected: "30a2")
std::cout << std::hex << out << '\n';
}
When debugging, I found out that in() ends up calling the internal _Mbrtowc() function (crt\src\xmbtowc.c), passing the internal (C?) part of the std::locale, initialized with {_Page=932 _Mbcurmax=2 _Isclocale=0 ...}, where ... stands for (and this seems to be the problem) the _Isleadbyte member, initialized to an array of 32 zeros (of type unsigned char). Thus, when the function processes the '\x32' lead byte, it checks with this array and naturally comes to the (wrong) conclusion that this is not a lead byte. So it happily calls the MultiByteToWideChar() Win-API function, which, of course, fails to convert the halfed character. So, _Mbrtowc() returns the error code -1, which more or less cancels everything up the call stack and ultimately the 2 (std::codecvt_base::result::error) is returned.
Is this a bug in the MS standard library (it seems so)? (How) can I work around this in a portable way (i.e. with the least amount of #ifdefs)?
I reported it internally to Microsoft. The have now filled it as a new bug (DevDiv#737880). But I recomment to fill out a connect item at: http://connect.microsoft.com/VisualStudio
I copy pasted your code in VC2010 / Windows 7 64-bit.
It works as you expect. Here's the output:
Japanese_Japan.932
0
30a2
It's probably a bug introduced with VC2012...

Converting a Char to Its Int Representation

I don't see this an option in things like sprintf().
How would I convert the letter F to 255? Basically the reverse operation of conversion using the %x format in sprintf?
I am assuming this is something simple I'm missing.
char const* data = "F";
int num = int(strtol(data, 0, 16));
Look up strtol and boost::lexical_cast for more details and options.
Use the %x format in sscanf!
The C++ way of doing it, with streams:
#include <iomanip>
#include <iostream>
#include <sstream>
int main() {
std::string hexvalue = "FF";
int value;
// Construct an input stringstream, initialized with hexvalue
std::istringstream iss(hexvalue);
// Set the stream in hex mode, then read the value, with error handling
if (iss >> std::hex >> value) std::cout << value << std::endl;
else std::cout << "Conversion failed" << std::endl;
}
The program prints 255.
You can't get (s)printf to convert 'F' to 255 without some black magic. Printf will convert a character to other representations, but won't change its value. This might show how character conversion works:
printf("Char %c is decimal %i (0x%X)\n", 'F', 'F', 'F');
printf("The high order bits are ignored: %d: %X -> %hhX -> %c\n",
0xFFFFFF46, 0xFFFFFF46, 0xFFFFFF46, 0xFFFFFF46);
produces
Char F is decimal 70 (0x46)
The high order bits are ignored: -186: FFFFFF46 -> 46 -> F
Yeah, I know you asked about sprintf, but that won't show you anything until you do another print.
The idea is that each generic integer parameter to a printf is put on the stack (or in a register) by promotion. That means it is expanded to it's largest generic size: bytes, characters, and shorts are converted to int by sign-extending or zero padding. This keeps the parameter list on the stack in sensible state. It's a nice convention, but it probably had it's origin in the 16-bit word orientation of the stack on the PDP-11 (where it all started).
In the printf library (on the receiving end of the call), the code uses the format specifier to determine what part of the parameter (or all of it) are processed. So if the format is '%c', only 8 bits are used. Note that there may be some variation between systems on how the hex constants are 'promoted'. But if a value greater thann 255 is passed to a character conversion, the high order bits are ignored.

how to print the unicode characters in hexadecimal codes in c++

I am reading the string of data from the oracle database that may or may not contain the Unicode characters into a c++ program.Is there any way for checking the string extracted from the database contains an Unicode characters(UTF-8).if any Unicode characters are present they should be converted into hexadecimal format and need to displayed.
There are two aspects to this question.
Distinguish UTF-8-encoded characters from ordinary ASCII characters.
UTF-8 encodes any code point higher than 127 as a series of two or more bytes. Values at 127 and lower remain untouched. The resultant bytes from the encoding are also higher than 127, so it is sufficient to check a byte's high bit to see whether it qualifies.
Display the encoded characters in hexadecimal.
C++ has std::hex to tell streams to format numeric values in hexadecimal. You can use std::showbase to make the output look pretty. A char isn't treated as numeric, though; streams will just print the character. You'll have to force the value to another numeric type, such as int. Beware of sign-extension, though.
Here's some code to demonstrate:
#include <iostream>
void print_characters(char const* s)
{
std::cout << std::showbase << std::hex;
for (char const* pc = s; *pc; ++pc) {
if (*pc & 0x80)
std::cout << (*pc & 0xff);
else
std::cout << *pc;
std::cout << ' ';
}
std::cout << std::endl;
}
You could call it like this:
int main()
{
char const* test = "ab\xef\xbb\xbfhu";
print_characters(test);
return 0;
}
Output on Solaris 10 with Sun C++ 5.8:
$ ./a.out
a b 0xef 0xbb 0xbf h u
The code detects UTF-8-encoded characters, but it makes no effort to decode them; you didn't mention needing to do that.
I used *pc & 0xff to convert the expression to an integral type and to mask out the sign-extended bits. Without that, the output on my computer was 0xffffffbb, for instance.
I would convert the string to UTF-32 (you can use something like UTF CPP for that - it is very easy), and then loop through the resulting string, detect code points (characters) that are above 0x7F and print them as hex.