How to convert an integer to a unicode character? - c++

So I wanted to try converting Unicode to an integer for a project of mine. I tried something like this :
unsigned int foo = (unsigned int)L'آ';
std::cout << foo << std::endl;
How do I convert it back? Or in other words, How do I convert an int to the respective Unicode character ?
EDIT : I am expecting the output to be the unicode value of an integer, example:
cout << (wchar_t) 1570 ; // This should print the unicode value of 1570 (which is :آ)
I am using Visual Studio 2013 Community with it's default compiler, Windows 10 64 bit Pro
Cheers

L'آ' will work okay as a signle wide character, because it is below 0xFFFF. But in general UTF16 includes surrogate pairs, so a unicode code point cannot be represented with a single wide character. You need wide string instead.
Your problem is also partly to do with printing UTF16 character in Windows console. If you use MessageBoxW to view a wide string it will work as expected:
wchar_t buf[2] = { 0 };
buf[0] = 1570;
MessageBoxW(0, buf, 0, 0);
However, in general you need a wide string to account for surrogate pairs, not a single wide char. Example:
int utf32 = 1570;
const int mask = (1 << 10) - 1;
std::wstring str;
if(utf32 < 0xFFFF)
{
str.push_back((wchar_t)utf32);
}
else
{
utf32 -= 0x10000;
int hi = (utf32 >> 10) & mask;
int lo = utf32 & mask;
hi += 0xD800;
lo += 0xDC00;
str.push_back((wchar_t)hi);
str.push_back((wchar_t)lo);
}
MessageBox(0, str.c_str(), 0, 0);
See related posts for printing UTF16 in Windows console.

The key here is setlocale(LC_ALL, "en_US.UTF-8");. en_US is the localization string which you may want to set to a different value like zh_CN for Chinese for example.
#include <stdio.h>
#include <iostream>
int main() {
setlocale(LC_ALL, "en_US.UTF-8");
// This does not work without setlocale(LC_ALL, "en_US.UTF-8");
for(int ch=30000; ch<30030; ch++) {
wprintf(L"%lc", ch);
}
printf("\n");
return 0;
}
Things to notice here is the use of wprintf and how the formatted string is given: L"%lc" which tells wprintf to treat the string and the character as long characters.
If you want to use this method to print some variables, use the type wchat_t.
Useful links:
setlocale
wprintf

Related

How to get the name of a Unicode character?

I think I saw this a long time ago; a way to get a string containing the name of a unicode character by using Win32 API calls. I'm using C++ Builder so if there is support for it in the VCL library that would work fine too.
For example:
GetUnicodeName(U+0021) would return a string (or fill in a struct or similar), such as "EXCLAMATION MARK".
Or if there are some other way to get the same result from Windows with C or C++.
The worst case scenario would be to have a HUGE lookup table with the names of interest (mainly Latin characters).
You can use undocumented GetUName method from getuname.dll:
std::string GetUnicodeCharacterName(wchar_t character)
{
// https://github.com/reactos/reactos/tree/master/dll/win32/getuname
typedef int(WINAPI* GetUNameFunc)(WORD wCharCode, LPWSTR lpBuf);
static GetUNameFunc pfnGetUName = reinterpret_cast<GetUNameFunc>(::GetProcAddress(::LoadLibraryA("getuname.dll"), "GetUName"));
if (!pfnGetUName)
return {};
std::array<WCHAR, 256> buffer;
int length = pfnGetUName(character, buffer.data());
return utf8::narrow(buffer.data(), length);
}
// Replace invisible code point with code point that is visible
wchar_t ReplaceInvisible(wchar_t character)
{
if (!std::iswgraph(character))
{
if (character <= 0x21)
character += 0x2400; // U+2400 Control Pictures https://www.unicode.org/charts/PDF/U2400.pdf
else
character = 0xFFFD; // REPLACEMENT CHARACTER
}
return character;
}
// Accepts in UTF-8.
// Returns UTF-8 string like this:
// q <U+71 Latin Small Letter Q>
// п <U+43F Cyrillic Small Letter Pe>
// ␈ <U+8 Backspace>
// 𐌸 <U+10338 Supplementary Multilingual Plane>
// 🚒 <U+1F692 Supplementary Multilingual Plane>
std::string GetUnicodeCharacterNames(std::string string)
{
// UTF-8 <=> UTF-32 converter
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf32conv;
// UTF-8 to UTF-32
std::u32string utf32string = utf32conv.from_bytes(string);
std::string characterNames;
characterNames.reserve(35 * utf32string.size());
for (const char32_t& codePoint : utf32string)
{
if (!characterNames.empty())
characterNames.append(", ");
char32_t visibleCodePoint = (codePoint < 0xFFFF) ? ReplaceInvisible(static_cast<wchar_t>(codePoint)) : codePoint;
std::string charName = (codePoint < 0xFFFF) ? GetUnicodeCharacterName(static_cast<wchar_t>(codePoint)) : "Supplementary Multilingual Plane";
// UTF-32 to UTF-8
std::string utf8codePoint = utf32conv.to_bytes(&visibleCodePoint, &visibleCodePoint + 1);
characterNames.append(fmt::format("{} <U+{:X} {}>", utf8codePoint, static_cast<uint32_t>(codePoint), charName));
}
return characterNames;
}
The downside is that it only contains characters from Unicode Basic Multilingual Plane (BMP).
Update: You can use u_charName() ICU API that comes with Windows since Fall Creators Update (Version 1709 Build 16299):
std::string GetUCharNameWrapper(char32_t codePoint)
{
typedef int32_t(*u_charNameFunc)(char32_t code, int nameChoice, char* buffer, int32_t bufferLength, int* pErrorCode);
static u_charNameFunc pfnU_charName = reinterpret_cast<u_charNameFunc>(::GetProcAddress(::LoadLibraryA("icuuc.dll"), "u_charName"));
if (!pfnU_charName)
return {};
int errorCode = 0;
std::array<char, 512> buffer;
int32_t length = pfnU_charName(codePoint, 0/*U_UNICODE_CHAR_NAME*/ , buffer.data(), static_cast<int32_t>(buffer.size() - 1), &errorCode);
if (errorCode != 0)
return {};
return std::string(buffer.data(), length);
}

Unicode big endian some character not getting properly from wchar_t array

I am trying to extract the exact "unicode big endian" character from an array.
The values i directly taken from a file using big endian. i use vs 2015, mfc framework (unicode support).
values: 𠀐亙𠀃𠀃亙亙𠀐𠀐Val𪛕𨕥
So these values directly taken from file to an array and without changing those values in the same array and directly printing to another txt file as unicode big endian format is possible. But changing some chars getting wrong result.
Directly written to editor.cpp file
wchar_t chr[] = {L'𠀐', L'亙', L'𠀃', L'𠀃', L'亙', L'亙', L'𠀐', L'𠀐', L'V', L'a', L'l', L'𪛕', L'𨕥'};
wchar_t chVal = (wchar_t) chr[0]; // getting � or a rectangle mark
if(chVal == L'𠀐')
MessageBox(_T("Show msg")); // results wrong
wchar_t chVal = (wchar_t) chr[1]; // getting 亙 proper element.
if(chVal == L'亙')
MessageBox(_T("Show msg")); // results correct
llly correct results in 'V', 'a', 'l'
=======================================
Before i placed the code
wchar_t* ch = _wsetlocale(LC_ALL, _T("Chinese"));
is it a problem from _wsetLocale ?
in the editor we can directly write those characters. But during debug or exe the results wrong.
why the editor not displaying some characters during debugging or execution.
================
updated:
// wcstring is wchar_t array with unicode characters
CStringW str; wchar_t wh;
System::Text::Encoding^ encodingWr = System::Text::Encoding::BigEndianUnicode;
StreamWriter^ writer = gcnew StreamWriter("Converted.txt", true, encodingWr );
//String^ line = reader->ReadLine();
for(int ct = 0; ct< ctTot; ct++)
{
int ln = wcstring[ct]; // correct number
wh = /*(wchar_t)*/ wcstring[ct]; //wrong
str.Format(_T("UNNUM %d %lc"), ln, wh);
/* https://learn.microsoft.com/en-us/cpp/text/how-to-convert-between-various-string-types?view=vs-2017*/
// Convert a wide character CStringW to a
// System::String.
String ^systemstringw = gcnew String(str);
//systemstringw += " (System::String)";
//Console::WriteLine("{0}", systemstringw);
//delete systemstringw;
writer->WriteLine(systemstringw);
delete systemstringw;
OutputDebugString(str);
}
but needed to print on file correct unicode character.
so compiler problems need to know too.

how to print each character of strings that mix ascii character with unicode?

for example, I want to create some typewriter effects so need to print strings like that:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(0,i).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
but the output is
a
ab
ab?
ab?
ab》
ab》c
ab》cd
ab》cd?
ab》cd?
ab》cd《
ab》cd《e
and not:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
how to know the upcoming character is unicode?
similar question, print each character also has the problem:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(i,1).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
the output is:
a
b
?
?
?
c
d
?
?
?
e
f
not:
a
b
》
c
d
《
e
f
I think the problem is encoding. Likely your string is in UTF-8 encoding which has variable sized characters. This means you can not iterate one char at a time because some characters are more than one char wide.
The fact is, in unicode, you can only iterate reliably one fixed character at a time with UTF-32 encoding.
So what you can do is use a UTF library like ICU to convert vetween UTF-8 and UTF-32.
If you have C++11 then there are some tools to help you here, mostly std::u32string which is able to hold UTF-32 encoded strings:
#include <string>
#include <iostream>
#include <unicode/ucnv.h>
#include <unicode/uchar.h>
#include <unicode/utypes.h>
// convert from UTF-32 to UTF-8
std::string to_utf8(std::u32string s)
{
UErrorCode status = U_ZERO_ERROR;
char target[1024];
int32_t len = ucnv_convert(
"UTF-8", "UTF-32"
, target, sizeof(target)
, (const char*)s.data(), s.size() * sizeof(char32_t)
, &status);
return std::string(target, len);
}
// convert from UTF-8 to UTF-32
std::u32string to_utf32(const std::string& utf8)
{
UErrorCode status = U_ZERO_ERROR;
char32_t target[256];
int32_t len = ucnv_convert(
"UTF-32", "UTF-8"
, (char*)target, sizeof(target)
, utf8.data(), utf8.size()
, &status);
return std::u32string(target, (len / sizeof(char32_t)));
}
int main()
{
// UTF-8 input (needs UTF-8 editor)
std::string utf8 = "ab》cd《ef"; // UTF-8
// convert to UTF-32
std::u32string utf32 = to_utf32(utf8);
// Now it is safe to use string indexing
// But i is for length so starting from 1
for(std::size_t i = 1; i < utf32.size(); ++i)
{
// convert back to to UTF-8 for output
// NOTE: i + 1 to include the BOM
std::cout << to_utf8(utf32.substr(0, i + 1)) << '\n';
}
}
Output:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
ab》cd《ef
NOTE:
The ICU library adds a BOM (Byte Order Mark) at the beginning of the strings it converts into Unicode. Therefore you need to deal with the fact that the first character of the UTF-32 string is the BOM. This is why the substring uses i + 1 for its length parameter to include the BOM.
Your C++ code is simply echoing octets to your terminal, and it is your terminal display that's converting octets encoded in its default character set to unicode characteers.
It looks like, based on your example, that your terminal display uses UTF-8. The rules for converting UTF-8-encoded characters to unicode are fairly well specified (Google is your friend), so all you have to do is to check the first character of a UTF-8 sequence to figure out how many octets make up the next unicode character.

Compare std::wstring and std::string

How can I compare a wstring, such as L"Hello", to a string? If I need to have the same type, how can I convert them into the same type?
Since you asked, here's my standard conversion functions from string to wide string, implemented using C++ std::string and std::wstring classes.
First off, make sure to start your program with set_locale:
#include <clocale>
int main()
{
std::setlocale(LC_CTYPE, ""); // before any string operations
}
Now for the functions. First off, getting a wide string from a narrow string:
#include <string>
#include <vector>
#include <cassert>
#include <cstdlib>
#include <cwchar>
#include <cerrno>
// Dummy overload
std::wstring get_wstring(const std::wstring & s)
{
return s;
}
// Real worker
std::wstring get_wstring(const std::string & s)
{
const char * cs = s.c_str();
const size_t wn = std::mbsrtowcs(NULL, &cs, 0, NULL);
if (wn == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}
std::vector<wchar_t> buf(wn + 1);
const size_t wn_again = std::mbsrtowcs(buf.data(), &cs, wn + 1, NULL);
if (wn_again == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}
assert(cs == NULL); // successful conversion
return std::wstring(buf.data(), wn);
}
And going back, making a narrow string from a wide string. I call the narrow string "locale string", because it is in a platform-dependent encoding depending on the current locale:
// Dummy
std::string get_locale_string(const std::string & s)
{
return s;
}
// Real worker
std::string get_locale_string(const std::wstring & s)
{
const wchar_t * cs = s.c_str();
const size_t wn = std::wcsrtombs(NULL, &cs, 0, NULL);
if (wn == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}
std::vector<char> buf(wn + 1);
const size_t wn_again = std::wcsrtombs(buf.data(), &cs, wn + 1, NULL);
if (wn_again == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}
assert(cs == NULL); // successful conversion
return std::string(buf.data(), wn);
}
Some notes:
If you don't have std::vector::data(), you can say &buf[0] instead.
I've found that the r-style conversion functions mbsrtowcs and wcsrtombs don't work properly on Windows. There, you can use the mbstowcs and wcstombs instead: mbstowcs(buf.data(), cs, wn + 1);, wcstombs(buf.data(), cs, wn + 1);
In response to your question, if you want to compare two strings, you can convert both of them to wide string and then compare those. If you are reading a file from disk which has a known encoding, you should use iconv() to convert the file from your known encoding to WCHAR and then compare with the wide string.
Beware, though, that complex Unicode text may have multiple different representations as code point sequences which you may want to consider equal. If that is a possibility, you need to use a higher-level Unicode processing library (such as ICU) and normalize your strings to some common, comparable form.
You should convert the char string to a wchar_t string using mbstowcs, and then compare the resulting strings. Notice that mbstowcs works on char */wchar *, so you'll probably need to do something like this:
std::wstring StringToWstring(const std::string & source)
{
std::wstring target(source.size()+1, L' ');
std::size_t newLength=std::mbstowcs(&target[0], source.c_str(), target.size());
target.resize(newLength);
return target;
}
I'm not entirely sure that that usage of &target[0] is entirely standard-conforming, if someone has a good answer to that please tell me in the comments. Also, there's an implicit assumption that the converted string won't be longer (in number of wchar_ts) than the number of chars of the original string - a logical assumption that still I'm not sure it's covered by the standard.
On the other hand, it seems that there's no way to ask to mbstowcs the size of the needed buffer, so either you go this way, or go with (better done and better defined) code from Unicode libraries (be it Windows APIs or libraries like iconv).
Still, keep in mind that comparing Unicode strings without using special functions is slippery ground, two equivalent strings may be evaluated different when compared bitwise.
Long story short: this should work, and I think it's the maximum you can do with just the standard library, but it's a lot implementation-dependent in how Unicode is handled, and I wouldn't trust it a lot. In general, it's just better to stick with an encoding inside your application and avoid this kind of conversions unless absolutely necessary, and, if you are working with definite encodings, use APIs that are less implementation-dependent.
Think twice before doing this — you might not want to compare them in the first place. If you are sure you do and you are using Windows, then convert string to wstring with MultiByteToWideChar, then compare with CompareStringEx.
If you are not using Windows, then the analogous functions are mbstowcs and wcscmp. The standard wide character C++ functions are often not portable under Windows; for instance mbstowcs is deprecated.
The cross-platform way to work with Unicode is to use the ICU library.
Take care to use special functions for Unicode string comparison, don't do it manually. Two Unicode strings could have different characters, yet still be the same.
wstring ConvertToUnicode(const string & str)
{
UINT codePage = CP_ACP;
DWORD flags = 0;
int resultSize = MultiByteToWideChar
( codePage // CodePage
, flags // dwFlags
, str.c_str() // lpMultiByteStr
, str.length() // cbMultiByte
, NULL // lpWideCharStr
, 0 // cchWideChar
);
vector<wchar_t> result(resultSize + 1);
MultiByteToWideChar
( codePage // CodePage
, flags // dwFlags
, str.c_str() // lpMultiByteStr
, str.length() // cbMultiByte
, &result[0] // lpWideCharStr
, resultSize // cchWideChar
);
return &result[0];
}

How I can print the wchar_t values to console?

Example:
#include <iostream>
using namespace std;
int main()
{
wchar_t en[] = L"Hello";
wchar_t ru[] = L"Привет"; //Russian language
cout << ru
<< endl
<< en;
return 0;
}
This code only prints HEX-values like adress.
How to print the wchar_t string?
Edit: This doesn’t work if you are trying to write text that cannot be represented in your default locale. :-(
Use std::wcout instead of std::cout.
wcout << ru << endl << en;
Can I suggest std::wcout ?
So, something like this:
std::cout << "ASCII and ANSI" << std::endl;
std::wcout << L"INSERT MULTIBYTE WCHAR* HERE" << std::endl;
You might find more information in a related question here.
You cannot portably print wide strings using standard C++ facilities.
Instead you can use the open-source {fmt} library to portably print Unicode text. For example (https://godbolt.org/z/nccb6j):
#include <fmt/core.h>
int main() {
const char en[] = "Hello";
const char ru[] = "Привет";
fmt::print("{}\n{}\n", ru, en);
}
prints
Привет
Hello
This requires compiling with the /utf-8 compiler option in MSVC.
For comparison, writing to wcout on Linux:
wchar_t en[] = L"Hello";
wchar_t ru[] = L"Привет";
std::wcout << ru << std::endl << en;
may transliterate the Russian text into Latin (https://godbolt.org/z/za5zP8):
Privet
Hello
This particular issue can be fixed by switching to a locale that uses UTF-8 but a similar problem exists on Windows that cannot be fixed just with standard facilities.
Disclaimer: I'm the author of {fmt}.
Windows has the very confusing information. You should learn C/C++ concept from Unix/Linux before programming in Windows.
wchar_t stores character in UTF-16 which is a fixed 16-bit memory size called wide character but wprintf() or wcout() will never print non-english wide characters correctly because no console will output in UTF-16. Windows will output in current locale while unix/linux will output in UTF-8, all are multi-byte. So you have to convert wide characters to multi-byte before printing. The unix command wcstombs() doesn't work on Windows, use WideCharToMultiByte() instead.
First you need to convert file to UTF-8 using notepad or other editor. Then install font in command prompt console so that it can read/write in your language and change code page in console to UTF-8 to display correctly by typing in the command prompt "chcp 65001" while cygwin is already default to UTF-8. Here is what I did in Thai.
#include <windows.h>
#include <stdio.h>
int main()
{
wchar_t* in=L"ทดสอบ"; // thai language
char* out=(char *)malloc(15);
WideCharToMultiByte(874, 0, in, 15, out, 15, NULL, NULL);
printf(out); // result is correctly in Thai although not neat
}
Note that
874=(Thai) code page in the operating system, 15=size of string
My suggestion is to avoid printing non-english wide characters to console unless necessary because it is not easy.
#include <iostream>
using namespace std;
void main()
{
setlocale(LC_ALL, "Russian");
cout << "\tДОБРО ПОЖАЛОВАТЬ В КИНО!\n";
}
The way to do it is to convert UTF-16 LE (Default Windows encoding) into UTF-8, and then print to console (chcp 65001 first, to switch codepage to UTF-8).
It's pretty trivial to convert UTF-16 to UTF-8. Use this page as a guide, if you need more than 2 byte characters.
short* cmd_s = (short*)cmd;
while(cmd_s[i] != 0)
{
short u16 = cmd_s[i++];
if(u16 > 0x7F)
{
unsigned char c0 = ((char)u16 & 0x3F) | 0x80; // Least significant
unsigned char c1 = char(((u16 >> 6) & 0x1F) | 0xC0); // Most significant
cout << c1 << c0; // Use Big-endian network order
}
else
{
unsigned char c0 = (char)u16;
cout << c0;
}
}
Of course, you can put it in a function and extend it to handle wider characters (For Cyrillic it should be enough), but I wanted to show basic algorithm, and to prove that it's not hard at all and you don't need any libraries, just a few lines of code.
You could use use a normal char array that is actually filled with utf-8 characters. This should allow mixing characters across languages.
You can print wide characters with wprintf.
#include <iostream>
int main()
{
wchar_t en[] = L"Hello";
wchar_t ru[] = L"Привет"; //Russian language
wprintf(en);
wprintf(ru);
return 0;
}
Output:
Hello
Привет