How to get the name of a Unicode character? - c++

I think I saw this a long time ago; a way to get a string containing the name of a unicode character by using Win32 API calls. I'm using C++ Builder so if there is support for it in the VCL library that would work fine too.
For example:
GetUnicodeName(U+0021) would return a string (or fill in a struct or similar), such as "EXCLAMATION MARK".
Or if there are some other way to get the same result from Windows with C or C++.
The worst case scenario would be to have a HUGE lookup table with the names of interest (mainly Latin characters).

You can use undocumented GetUName method from getuname.dll:
std::string GetUnicodeCharacterName(wchar_t character)
{
// https://github.com/reactos/reactos/tree/master/dll/win32/getuname
typedef int(WINAPI* GetUNameFunc)(WORD wCharCode, LPWSTR lpBuf);
static GetUNameFunc pfnGetUName = reinterpret_cast<GetUNameFunc>(::GetProcAddress(::LoadLibraryA("getuname.dll"), "GetUName"));
if (!pfnGetUName)
return {};
std::array<WCHAR, 256> buffer;
int length = pfnGetUName(character, buffer.data());
return utf8::narrow(buffer.data(), length);
}
// Replace invisible code point with code point that is visible
wchar_t ReplaceInvisible(wchar_t character)
{
if (!std::iswgraph(character))
{
if (character <= 0x21)
character += 0x2400; // U+2400 Control Pictures https://www.unicode.org/charts/PDF/U2400.pdf
else
character = 0xFFFD; // REPLACEMENT CHARACTER
}
return character;
}
// Accepts in UTF-8.
// Returns UTF-8 string like this:
// q <U+71 Latin Small Letter Q>
// п <U+43F Cyrillic Small Letter Pe>
// ␈ <U+8 Backspace>
// 𐌸 <U+10338 Supplementary Multilingual Plane>
// 🚒 <U+1F692 Supplementary Multilingual Plane>
std::string GetUnicodeCharacterNames(std::string string)
{
// UTF-8 <=> UTF-32 converter
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf32conv;
// UTF-8 to UTF-32
std::u32string utf32string = utf32conv.from_bytes(string);
std::string characterNames;
characterNames.reserve(35 * utf32string.size());
for (const char32_t& codePoint : utf32string)
{
if (!characterNames.empty())
characterNames.append(", ");
char32_t visibleCodePoint = (codePoint < 0xFFFF) ? ReplaceInvisible(static_cast<wchar_t>(codePoint)) : codePoint;
std::string charName = (codePoint < 0xFFFF) ? GetUnicodeCharacterName(static_cast<wchar_t>(codePoint)) : "Supplementary Multilingual Plane";
// UTF-32 to UTF-8
std::string utf8codePoint = utf32conv.to_bytes(&visibleCodePoint, &visibleCodePoint + 1);
characterNames.append(fmt::format("{} <U+{:X} {}>", utf8codePoint, static_cast<uint32_t>(codePoint), charName));
}
return characterNames;
}
The downside is that it only contains characters from Unicode Basic Multilingual Plane (BMP).
Update: You can use u_charName() ICU API that comes with Windows since Fall Creators Update (Version 1709 Build 16299):
std::string GetUCharNameWrapper(char32_t codePoint)
{
typedef int32_t(*u_charNameFunc)(char32_t code, int nameChoice, char* buffer, int32_t bufferLength, int* pErrorCode);
static u_charNameFunc pfnU_charName = reinterpret_cast<u_charNameFunc>(::GetProcAddress(::LoadLibraryA("icuuc.dll"), "u_charName"));
if (!pfnU_charName)
return {};
int errorCode = 0;
std::array<char, 512> buffer;
int32_t length = pfnU_charName(codePoint, 0/*U_UNICODE_CHAR_NAME*/ , buffer.data(), static_cast<int32_t>(buffer.size() - 1), &errorCode);
if (errorCode != 0)
return {};
return std::string(buffer.data(), length);
}

Related

How to split a string by emojis in C++

I'm trying to take a string of emojis and split them into a vector of each emoji Given the string:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
I'm trying to get:
std::vector<std::string> splitted_emojis = {"😀", "🔍", "🦑", "😁", "🔍", "🎉", "😂", "🤣"};
Edit
I've tried to do:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
std::vector<std::string> splitted_emojis;
size_t pos = 0;
std::string token;
while ((pos = emojis.find("")) != std::string::npos)
{
token = emojis.substr(0, pos);
splitted_emojis.push_back(token);
emojis.erase(0, pos);
}
But it seems like it throws terminate called after throwing an instance of 'std::bad_alloc' after a couple of seconds.
When trying to check how many emojis are in a string using:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
std::cout << emojis.size() << std::endl; // returns 32
it returns a bigger number which i assume are the unicode data. I don't know too much about unicode data but i'm trying to figure out how to check for when the data of an emoji begins and ends to be able to split the string to each emoji
I would definitely recommend that you use a library with better unicode support (all large frameworks do), but in a pinch you can get by with knowing that the UTF-8 encoding spreads Unicode characters over multiple bytes, and that the first bits of the first byte determine how many bytes a character is made up of.
I stole a function from boost. The split_by_codepoint function uses an iterator over the input string and constructs a new string using the first N bytes (where N is determined by the byte count function) and pushes it to the ret vector.
// Taken from boost internals
inline unsigned utf8_byte_count(uint8_t c)
{
// if the most significant bit with a zero in it is in position
// 8-N then there are N bytes in this UTF-8 sequence:
uint8_t mask = 0x80u;
unsigned result = 0;
while(c & mask)
{
++result;
mask >>= 1;
}
return (result == 0) ? 1 : ((result > 4) ? 4 : result);
}
std::vector<std::string> split_by_codepoint(std::string input) {
std::vector<std::string> ret;
auto it = input.cbegin();
while (it != input.cend()) {
uint8_t count = utf8_byte_count(*it);
ret.emplace_back(std::string{it, it+count});
it += count;
}
return ret;
}
int main() {
std::string emojis = u8"😀🔍🦑😁🔍🎉😂🤣";
auto split = split_by_codepoint(emojis);
std::cout << split.size() << std::endl;
}
Note that this function simply splits a string into UTF-8 strings containing one code point each. Determining if the character is an emoji is left as an exercise: UTF-8-decode any 4-byte characters and see if they are in the proper range.

How can I "convert" ISO-8859-7 strings to UTF-8 in C++?

I'm working with 10+ years old machines which use ISO 8859-7 to represent Greek characters using a single byte each.
I need to catch those characters and convert them to UTF-8 in order to inject them in a JSON to be sent via HTTPS.
Also, I'm using GCC v4.4.7 and I don't feel like upgrading so I can't use codeconv or such.
Example: "OΛΑ":
I get char values [ 0xcf, 0xcb, 0xc1, ], I need to write this string "\u039F\u039B\u0391".
PS: I'm not a charset expert so please avoid philosophical answers like "ISO 8859 is a subset of Unicode so you just need to implement the algorithm".
Given that there are so few values to map, a simple solution is to use a lookup table.
Pseudocode:
id_offset = 0x80 // 0x00 .. 0x7F same in UTF-8
c1_offset = 0x20 // 0x80 .. 0x9F control characters
table_offset = id_offset + c1_offset
table = [
u8"\u00A0", // 0xA0
u8"‘", // 0xA1
u8"’",
u8"£",
u8"€",
u8"₯",
// ... Refer to ISO 8859-7 for full list of characters.
]
let S be the input string
let O be an empty output string
for each char C in S
reinterpret C as unsigned char U
if U less than id_offset // same in both encodings
append C to O
else if U less than table_offset // control code
append char '\xC2' to O // lead byte
append char C to O
else
append string table[U - table_offset] to O
All that said, I recommend to save some time by using a library instead.
One way could be to use the Posix libiconv library. On Linux, the functions needed (iconv_open, iconv and iconv_close) are even included in libc so no extra linkage is needed there. On your old machines you may need to install libiconv but I doubt it.
Converting may be as simple as this:
#include <iconv.h>
#include <cerrno>
#include <cstring>
#include <iostream>
#include <iterator>
#include <stdexcept>
#include <string>
// A wrapper for the iconv functions
class Conv {
public:
// Open a conversion descriptor for the two selected character sets
Conv(const char* to, const char* from) : cd(iconv_open(to, from)) {
if(cd == reinterpret_cast<iconv_t>(-1))
throw std::runtime_error(std::strerror(errno));
}
Conv(const Conv&) = delete;
~Conv() { iconv_close(cd); }
// the actual conversion function
std::string convert(const std::string& in) {
const char* inbuf = in.c_str();
size_t inbytesleft = in.size();
// make the "out" buffer big to fit whatever we throw at it and set pointers
std::string out(inbytesleft * 6, '\0');
char* outbuf = out.data();
size_t outbytesleft = out.size();
// the const_cast shouldn't be needed but my "iconv" function declares it
// "char**" not "const char**"
size_t non_rev_converted = iconv(cd, const_cast<char**>(&inbuf),
&inbytesleft, &outbuf, &outbytesleft);
if(non_rev_converted == static_cast<size_t>(-1)) {
// here you can add misc handling like replacing erroneous chars
// and continue converting etc.
// I'll just throw...
throw std::runtime_error(std::strerror(errno));
}
// shrink to keep only what we converted
out.resize(outbuf - out.data());
return out;
}
private:
iconv_t cd;
};
int main() {
Conv cvt("UTF-8", "ISO-8859-7");
// create a string from the ISO-8859-7 data
unsigned char data[]{0xcf, 0xcb, 0xc1};
std::string iso88597_str(std::begin(data), std::end(data));
auto utf8 = cvt.convert(iso88597_str);
std::cout << utf8 << '\n';
}
Output (in UTF-8):
ΟΛΑ
Using this you can create a mapping table, from ISO-8859-7 to UTF-8, that you include in your project instead of iconv:
Demo
Ok I decided to do this myself instead of looking for a compatible library. Here's how I did.
The main problem was figuring out how to fill the two bytes for Unicode using the single one for ISO, so I used the debugger to read the value for the same character, first written by the old machine and then written with a constant string (UTF-8 by default). I started with "O" and "Π" and saw that in UTF-8 the first byte was always 0xCE while the second one was filled with the ISO value plus an offset (-0x30). I built the following code to implement this and used a test string filled with all greek letters, both upper and lower case. Then I realised that starting from "π" (0xF0 in ISO) both the first byte and the offset for the second one changed, so I added a test to figure out which of the two rules to apply. The following method returns a bool to let the caller know whether the original string contained ISO characters (useful for other purposes) and overwrites the original string, passed as reference, with the new one. I worked with char arrays instead of strings for coherence with the rest of the project which is basically a C project written in C++.
bool iso_to_utf8(char* in){
bool wasISO=false;
if(in == NULL)
return wasISO;
// count chars
int i=strlen(in);
if(!i)
return wasISO;
// create and size new buffer
char *out = new char[2*i];
// fill with 0's, useful for watching the string as it gets built
memset(out, 0, 2*i);
// ready to start from head of old buffer
i=0;
// index for new buffer
int j=0;
// for each char in old buffer
while(in[i]!='\0'){
if(in[i] >= 0){
// it's already utf8-compliant, take it as it is
out[j++] = in[i];
}else{
// it's ISO
wasISO=true;
// get plain value
int val = in[i] & 0xFF;
// first byte to CF or CE
out[j++]= val > 0xEF ? 0xCF : 0xCE;
// second char to plain value normalized
out[j++] = val - (val > 0xEF ? 0x70 : 0x30);
}
i++;
}
// add string terminator
out[j]='\0';
// paste into old char array
strcpy(in, out);
return wasISO;
}

Lexicographical sorting for non-ascii characters

I have done lexicographical sorting for ascii characters by the following code:
std::ifstream infile;
std::string line, new_line;
std::vector<std::string> v;
while(std::getline(infile, line))
{
// If line is empty, ignore it
if(line.empty())
continue;
new_line = line + "\n";
// Line contains string of length > 0 then save it in vector
if(new_line.size() > 0)
v.push_back(new_line);
}
sort(v.begin(), v.end());
The result should be:
a
aahr
abyutrw
bb
bhehjr
cgh
cuttrew
....
But I don't know how to do Lexicographical sorting for both ascii and non-ascii characters in the order like this: a A À Á Ã brg Baq ckrwg CkfgF d Dgrn... Please tell me how to write code for it. Thank you!
The OP didn't but I find it worth to mention: Speaking about non-ASCII characters, the encoding should be considered as well.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Characters like À, Á, and  are not part of the 7 bit ASCII but were considered in a variety of 8 bit encodings like e.g. Windows 1252. Thereby, it's not granted that a certain character (which is not part of ASCII) has the same code point (i.e. number) in any encoding. (Most of the characters have no number in most encodings.)
However, a unique encoding table is provided by the Unicode containing all characters of any other encoding (I believe). There are implementations as
UTF-8 where code points are represented by 1 or more 8 bit values (storage with char)
UTF-16 where code points are represented with 1 or 2 16 bit values (storage with std::char16_t or, maybe, wchar_t)
UTF-32 where code points are represented with 1 32 bit value (storage with std::char32_t or, maybe, wchar_t if it has sufficient size).
Concerning the size of wchar_t: Character types.
Having that said, I used wchar_t and std::wstring in my sample to make the usage of umlauts locale and platform independent.
The order used in std::sort() to sort a range of T elements is defined by default withbool < operator(const T&, const T&) the < operator for T.
However, there are flavors of std::sort() to define a custom predicate instead.
The custom predicate must match the signature and must provide a strict weak ordering relation.
Hence, my recommendation to use a std::map which maps the charactes to an index which results in the intended order.
This is the predicate, I used in my sample:
// sort words
auto charIndex = [&mapChars](wchar_t chr)
{
const CharMap::const_iterator iter = mapChars.find(chr);
return iter != mapChars.end()
? iter->second
: (CharMap::mapped_type)mapChars.size();
};
auto pred
= [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2)
{
const size_t len = std::min(word1.size(), word2.size());
// + 1 to include zero terminator
for (size_t i = 0; i < len; ++i) {
const wchar_t chr1 = word1[i], chr2 = word2[i];
const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2);
if (i1 != i2) return i1 < i2;
}
return word1.size() < word2.size();
};
std::sort(words.begin(), words.end(), pred);
From bottom to top:
std::sort(words.begin(), words.end(), pred); is called with a third parameter which provides the predicate pred for my customized order.
The lambda pred(), compares two std::wstrings character by character.
Thereby, the comparison is done using a std::map mapChars which maps wchar_t to unsigned i.e. a character to its rank in my order.
The mapChars stores only a selection of all character values. Hence, the character in quest might not be found in the mapChars. To handle this, a helper lambda charIndex() is used which returns mapChars.size() in this case – which is granted to be higher than all occurring indices.
The type CharMap is simply a typedef:
typedef std::map<wchar_t, unsigned> CharMap;
To initialize a CharMap, a function is used:
CharMap makeCharMap(const wchar_t *table[], size_t size)
{
CharMap mapChars;
unsigned rank = 0;
for (const wchar_t **chars = table; chars != table + size; ++chars) {
for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank;
++rank;
}
return mapChars;
}
It has to be called with an array of strings which contains all groups of characters in the intended order:
const wchar_t *table[] = {
L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};
The complete sample:
#include <string>
#include <sstream>
#include <vector>
static const wchar_t *table[] = {
L"aA", L"äÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
L"oO", L"öÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uU", L"üÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};
static const wchar_t *tableGerman[] = {
L"aAäÄ", L"bB", L"cC", L"dD", L"eE", L"fF", L"gG", L"hH", L"iI", L"jJ", L"kK", L"lL", L"mM", L"nN",
L"oOöÖ", L"pP", L"qQ", L"rR", L"sS", L"tT", L"uUüÜ", L"vV", L"wW", L"xX", L"yY", L"zZ"
};
typedef std::map<wchar_t, unsigned> CharMap;
// fill a look-up table to map characters to the corresponding rank
CharMap makeCharMap(const wchar_t *table[], size_t size)
{
CharMap mapChars;
unsigned rank = 0;
for (const wchar_t **chars = table; chars != table + size; ++chars) {
for (const wchar_t *chr = *chars; *chr; ++chr) mapChars[*chr] = rank;
++rank;
}
return mapChars;
}
// conversion to UTF-8 found in https://stackoverflow.com/a/7561991/7478597
// needed to print to console
// Please, note: std::codecvt_utf8() is deprecated in C++17. :-(
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_conv;
// collect words and sort accoring to table
void printWordsSorted(
const std::wstring &text, const wchar_t *table[], const size_t size)
{
// make look-up table
const CharMap mapChars = makeCharMap(table, size);
// strip punctuation and other noise
std::wstring textClean;
for (const wchar_t chr : text) {
if (chr == ' ' || mapChars.find(chr) != mapChars.end()) {
textClean += chr;
}
}
// fill word list with sample text
std::vector<std::wstring> words;
for (std::wistringstream in(textClean);;) {
std::wstring word;
if (!(in >> word)) break; // bail out
// store word
words.push_back(word);
}
// sort words
auto charIndex = [&mapChars](wchar_t chr)
{
const CharMap::const_iterator iter = mapChars.find(chr);
return iter != mapChars.end()
? iter->second
: (CharMap::mapped_type)mapChars.size();
};
auto pred
= [&mapChars, &charIndex](const std::wstring &word1, const std::wstring &word2)
{
const size_t len = std::min(word1.size(), word2.size());
// + 1 to include zero terminator
for (size_t i = 0; i < len; ++i) {
const wchar_t chr1 = word1[i], chr2 = word2[i];
const unsigned i1 = charIndex(chr1), i2 = charIndex(chr2);
if (i1 != i2) return i1 < i2;
}
return word1.size() < word2.size();
};
std::sort(words.begin(), words.end(), pred);
// remove duplicates
std::vector<std::wstring>::iterator last = std::unique(words.begin(), words.end());
words.erase(last, words.end());
// print result
for (const std::wstring &word : words) {
std::cout << utf8_conv.to_bytes(word) << '\n';
}
}
template<typename T, size_t N>
size_t size(const T (&arr)[N]) { return sizeof arr / sizeof *arr; }
int main()
{
// a sample string
std::wstring sampleText
= L"In the German language the ä (a umlaut), ö (o umlaut) and ü (u umlaut)"
L" have the same lexicographical rank as their counterparts a, o, and u.\n";
std::cout << "Sample text:\n"
<< utf8_conv.to_bytes(sampleText) << '\n';
// sort like requested by OP
std::cout << "Words of text sorted as requested by OP:\n";
printWordsSorted(sampleText, table, size(table));
// sort like correct in German
std::cout << "Words of text sorted as usual in German language:\n";
printWordsSorted(sampleText, tableGerman, size(tableGerman));
}
Output:
Words of text sorted as requested by OP:
a
and
as
ä
counterparts
German
have
In
language
lexicographical
o
ö
rank
same
the
their
u
umlaut
ü
Words of text sorted as usual in German language:
ä
a
and
as
counterparts
German
have
In
language
lexicographical
o
ö
rank
same
the
their
u
ü
umlaut
Live Demo on coliru
Note:
My original intention was to do the output with std::wcout. This didn't work correctly for ä, ö, ü. Hence, I looked up a simple way to convert wstrings to UTF-8. I already knew that UTF-8 is supported in coliru.
#Phil1970 reminded me that I forgot to mention something else:
Sorting of strings (according to “human dictionary” order) is usually provided by std::locale. std::collate provides a locale dependent lexicographical ordering of strings.
The locale plays a role because the order of characters might vary with distinct locales. The std::collate doc. has a nice example for this:
Default locale collation order: Zebra ar förnamn zebra ängel år ögrupp
English locale collation order: ängel ar år förnamn ögrupp zebra Zebra
Swedish locale collation order: ar förnamn zebra Zebra år ängel ögrupp
Conversion of UTF-16 ⇔ UTF-32 ⇔ UTF-8 can be achieved by mere bit-arithmetics. For conversion to/from any other encoding (ASCII excluded which is a subset of Unicode), I would recommend a library like e.g. libiconv.

How to convert an integer to a unicode character?

So I wanted to try converting Unicode to an integer for a project of mine. I tried something like this :
unsigned int foo = (unsigned int)L'آ';
std::cout << foo << std::endl;
How do I convert it back? Or in other words, How do I convert an int to the respective Unicode character ?
EDIT : I am expecting the output to be the unicode value of an integer, example:
cout << (wchar_t) 1570 ; // This should print the unicode value of 1570 (which is :آ)
I am using Visual Studio 2013 Community with it's default compiler, Windows 10 64 bit Pro
Cheers
L'آ' will work okay as a signle wide character, because it is below 0xFFFF. But in general UTF16 includes surrogate pairs, so a unicode code point cannot be represented with a single wide character. You need wide string instead.
Your problem is also partly to do with printing UTF16 character in Windows console. If you use MessageBoxW to view a wide string it will work as expected:
wchar_t buf[2] = { 0 };
buf[0] = 1570;
MessageBoxW(0, buf, 0, 0);
However, in general you need a wide string to account for surrogate pairs, not a single wide char. Example:
int utf32 = 1570;
const int mask = (1 << 10) - 1;
std::wstring str;
if(utf32 < 0xFFFF)
{
str.push_back((wchar_t)utf32);
}
else
{
utf32 -= 0x10000;
int hi = (utf32 >> 10) & mask;
int lo = utf32 & mask;
hi += 0xD800;
lo += 0xDC00;
str.push_back((wchar_t)hi);
str.push_back((wchar_t)lo);
}
MessageBox(0, str.c_str(), 0, 0);
See related posts for printing UTF16 in Windows console.
The key here is setlocale(LC_ALL, "en_US.UTF-8");. en_US is the localization string which you may want to set to a different value like zh_CN for Chinese for example.
#include <stdio.h>
#include <iostream>
int main() {
setlocale(LC_ALL, "en_US.UTF-8");
// This does not work without setlocale(LC_ALL, "en_US.UTF-8");
for(int ch=30000; ch<30030; ch++) {
wprintf(L"%lc", ch);
}
printf("\n");
return 0;
}
Things to notice here is the use of wprintf and how the formatted string is given: L"%lc" which tells wprintf to treat the string and the character as long characters.
If you want to use this method to print some variables, use the type wchat_t.
Useful links:
setlocale
wprintf

One file lib to conv utf8 (char*) to wchar_t?

I am using libjson which is awesome. The only problem I have is I need to convert an utf8 string (char*) to a wide char string (wchar_t*). I googled and tried 3 different libs and they ALL failed (due to missing headers).
I don't need anything fancy. Just a one way conversion. How do I do this?
If you're on windows (which, chances are you are, given your need for wchar_t), use MultiByteToWideChar function (declared in windows.h), as so:
int length = MultiByteToWideChar(CP_UTF8, 0, src, src_length, 0, 0);
wchar_t *output_buffer = new wchar_t [length];
MultiByteToWideChar(CP_UTF8, 0, src, src_length, output_buffer, length);
Alternatively, if all you're looking for is a literal multibyte representation of your UTF8 (which is improbable, but possible), use the following (stdlib.h):
wchar_t * output_buffer = new wchar_t [1024];
int length = mbstowcs(output_buffer, src, 1024);
if(length > 1024){
delete[] output_buffer;
output_buffer = new wchar_t[length+1];
mbstowcs(output_buffer, src, length);
}
Hope this helps.
the below successfully enables CreateDirectoryW() to write to C:\Users\ПетрКарасев , basically an easier-to-understand wrapper around the MultiByteTyoWideChar mentioned by someone earlier.
std::wstring utf16_from_utf8(const std::string & utf8)
{
// Special case of empty input string
if (utf8.empty())
return std::wstring();
// Шаг 1, Get length (in wchar_t's) of resulting UTF-16 string
const int utf16_length = ::MultiByteToWideChar(
CP_UTF8, // convert from UTF-8
0, // default flags
utf8.data(), // source UTF-8 string
utf8.length(), // length (in chars) of source UTF-8 string
NULL, // unused - no conversion done in this step
0 // request size of destination buffer, in wchar_t's
);
if (utf16_length == 0)
{
// Error
DWORD error = ::GetLastError();
throw ;
}
// // Шаг 2, Allocate properly sized destination buffer for UTF-16 string
std::wstring utf16;
utf16.resize(utf16_length);
// // Шаг 3, Do the actual conversion from UTF-8 to UTF-16
if ( ! ::MultiByteToWideChar(
CP_UTF8, // convert from UTF-8
0, // default flags
utf8.data(), // source UTF-8 string
utf8.length(), // length (in chars) of source UTF-8 string
&utf16[0], // destination buffer
utf16.length() // size of destination buffer, in wchar_t's
) )
{
// не работает сука ...
DWORD error = ::GetLastError();
throw;
}
return utf16; // ура!
}
Here is a piece of code i wrote. It seems to work well enough. It returns 0 on utf8 error or when the value is > FFFF (which cant be held by a wchar_t)
#include <string>
using namespace std;
wchar_t* utf8_to_wchar(const char*utf8){
wstring sz;
wchar_t c;
auto p=utf8;
while(*p!=0){
auto v=(*p);
if(v>=0){
c = v;
sz+=c;
++p;
continue;
}
int shiftCount=0;
if((v&0xE0) == 0xC0){
shiftCount=1;
c = v&0x1F;
}
else if((v&0xF0) == 0xE0){
shiftCount=2;
c = v&0xF;
}
else
return 0;
++p;
while(shiftCount){
v = *p;
++p;
if((v&0xC0) != 0x80) return 0;
c<<=6;
c |= (v&0x3F);
--shiftCount;
}
sz+=c;
}
return (wchar_t*)sz.c_str();
}
The following (untested) code shows how to convert a multibyte string in your current locale into a wide string. So if your current locale is UTF-8, then this will suit your needs.
const char * inputStr = ... // your UTF-8 input
size_t maxSize = strlen(inputStr) + 1;
wchar_t * outputWStr = new wchar_t[maxSize];
size_t result = mbstowcs(outputWStr, inputStr, maxSize);
if (result == -1) {
cerr << "Invalid multibyte characters in input";
}
You can use setlocale() to set your locale.