C++. Convert char* with unicode (UTF-8) symbols to std::wstring [duplicate] - c++

This question already has answers here:
Convert const char* to wstring
(7 answers)
Closed 7 years ago.
I faced with pretty common problem, but I can't find solution for this. tinyxml2 library returning const char* with Attribute(const char*) method. In xml file, opened with that library, I have attributes with unicode and without. File converted to UTF-8. Using Linux, but it would be nice to see solution for windows too. Any suggestions?

Like this:
std::wstring utf8_decode(const std::string &str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo( size_needed, 0 );
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}

Related

BSTR conversion to UTF-8

I'm working with UIAutomation and I'm struggling with the localized BSTRs. I'm in Germany, so there are some special characters that are represented funny in the BSTRs. I'm logging the information and need to have them in UTF-8 to process later on.
I tried already every version of the answers that I could find regarding to WideCharToMultiByte, but that's just converting the funny character into an even funnier one. I'd really appreciate if anyone could tell me what I'm doing wrong, it's really bugging me.
So I tried both of the following versions and got both times this result (the upper one is the converted one, the lower the original one):
The first word should be "Schaltfläche" and the second "Fünf".
My tried code:
BSTR* origin;
_bstr_t originWrapper(*origin);
char* originChar = originWrapper;
size_t len = strlen(originChar) + 1;
int room = MultiByteToWideChar(CP_ACP, 0, originChar, -1, NULL, 0);
wchar_t* unicodeString = (wchar_t*)malloc((sizeof(wchar_t))*room);
MultiByteToWideChar(CP_ACP, 0, originChar, -1, unicodeString, room);
int size_needed = WideCharToMultiByte(CP_UTF8, 0, unicodeString, -1, NULL, 0, NULL, NULL);
char* utf8Char = (char*) malloc(size_needed);
WideCharToMultiByte(CP_UTF8, 0, unicodeString, -1, utf8Char, size_needed, NULL, NULL);
and
BSTR* origin;
_bstr_t originWrapper(*origin);
int size_needed = WideCharToMultiByte(CP_UTF8, 0, originWrapper, SysStringByteLen(*origin), NULL, 0, NULL, NULL);
std::string resultingString(size_needed, 0);
WideCharToMultiByte(CP_UTF8, 0, *origin, SysStringByteLen(*origin), &resultingString[0], size_needed, NULL, NULL);
BSTR is a pointer to UTF-16 (WCHAR) character data, preceded by the string length. So, your roundtrip through narrow strings is misguided, you should straight use WideCharToMultiByte:
std::string BSTRtoUTF8(BSTR bstr) {
int len = SysStringLen(bstr);
// special case because a NULL BSTR is a valid zero-length BSTR,
// but regular string functions would balk on it
if(len == 0) return "";
int size_needed = WideCharToMultiByte(CP_UTF8, 0, bstr, len, NULL, 0, NULL, NULL);
std::string ret(size_needed, '\0');
WideCharToMultiByte(CP_UTF8, 0, bstr, len, ret.data(), ret.size(), NULL, NULL);
return ret;
}
To check the validity of the conversion don't output the result to the console, as it doesn't support UTF-8 output by default (it interprets narrow strings not even as in CP_ACP, but in CP_OEM, go figure). Instead, write the output to a file and check it with a reliable editor supporting UTF-8.

How to show Cyrillic text in MFC Multi-Byte application?

I am new with C++ and MFC. The main problem is that I have an MFC project that needs to be translated into Russian. I see that the best option is to change the project to Unicode but I cannot, because it is a huge project and when I change I receive more than 4000 errors. Later we will pass all the code to Unicode, but for now I just need to show Cyrillic on the buttons and CListBox.
Well, the main thing is: How to print Cyrillic with Multibyte?
Thanks guys!
PD: Sorry, I am gonna be more explicit with what I tried:
Use russian locales:
setlocale(LC_ALL, "russian_russia.1251");
setlocale(LC_CTYPE, "rus");
But didn't work. Shows question marks.
Also I tried to convert with function WideCharToMultiByte. But shows characters that seems to be encoded wrong.
std::string utf8_encode(const std::wstring &wstr)
{
if (wstr.empty()) return std::string();
int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
std::string strTo(size_needed, 0);
WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
return strTo;
}
wchar_t* wch = L"Привет";
std::string ch = utf8_encode(wch);
m_wndOutputBuild.AddString(ch.c_str()); //OUTPUT Привет
PD2: Now I call like this
setlocale(LC_ALL, "russian_russia.1251");
std::wstring wch = L"Привет";
std::string ch = encode_1251(wch);
m_wndOutputBuild.AddString(ch.c_str()); //OUTPUT Ïðèâåò
and Function:
std::string encode_1251(const std::wstring &wstr)
{
if (wstr.empty()) return std::string();
int size_needed = WideCharToMultiByte(1251, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
std::string strTo(size_needed, 0);
WideCharToMultiByte(1251, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
return strTo;
}
I found that Windows-1251 puts CP like that on WideCharToMultiByte here.
In your utf8_encode function, when converting your Unicode UTF-16 string to a std::string, you passed CP_UTF8 to WideCharToMultiByte. Then you take the returned UTF-8 std::string, and pass it via .c_str() to the CListBox::AddString method.
However, if your application is in MBCS Cyrillic, you should convert from UTF-16 to your Cyrillic code page, instead of UTF-8, and pass the strings encoded in your Cyrillic code page to your MFC class methods, like CListBox::AddString.
In other words, you may want to substitute your utf8_encode function with a cyrillic_encode function, that takes UTF-16 text as input, and converts it to your Cyrillic code page:
// Convert from Unicode UTF-16 to Cyrillic code page
std::string cyrillic_encode(const std::wstring &utf16)
And then pass the returned string to the MFC class methods of interest, e.g.:
// From Unicode UTF-16 to Cyrillic code page
std::string cyrillic_text = cyrillic_encode(wch);
// Show Cyrillic-encoded "MBCS" text
m_wndOutputBuild.AddString(cyrillic_text.c_str());
Moreover, as correctly pointed out by #IInspectable in the comments, consider adding proper error checking code in your conversion functions. In fact, in general, there can be UTF-16 text that cannot be properly encoded in Cyrillic, as the latter is a proper subset of the former.

Issues Converting wstring to TCHAR [duplicate]

This question already has answers here:
How to convert std::wstring to a TCHAR*?
(6 answers)
Closed 10 years ago.
I'm fairly new to programming, and I'm trying to write a program where a user inputs a date, then that date is added to the file directory name, then that file directory is searched.
Here is what I'm working with below. I have a number of functions to do this.. I've searched online and tried doing the conversion a few different ways and I'm just not understanding it.... so I left off with (what I know is incorrected) a static_cast.
Maybe I'm just not doing the conversion right... basically this will throw it back to a function that uses the WINAPI handler. Whether I can get that to work is a completely different story... Thanks in advance for any help!
wstring fDate;
wstring fileDin;
const TCHAR* s = _T (fileDin);
std::wstring(fDate);
std::wstring(fileDin) =L"Z:\\software\\A\\AC\\" + fDate;
wcout<< fileDin;
cout <<endl;
//wstring fileDin(&arc[1]);
fileDin = static_cast<TCHAR>(arc[1]);
dir(2, arc);
TCHAR can be either wchar_t (when you use Unicode) or char (when you use Multi-byte).
On the other hand std::wstring always contains characters of type wchar_t, so it's better if you use wchar_t* directly instead of TCHAR* (if possible).
Then wchar_t* to std::wstring conversion can be done by using constructor of std::wstring:
wchar_t* wcstr = L"my string";
std::wstring wstr(wcstr);
and std::wstring to wchar_t* by simple calling c_str() method:
wchar_t* wcstr = wstr.c_str();
Then sometimes you might need to convert between "wide" strings (std::wstrings holding wchar_t characaters) and multi-byte strings (std::strings holding chars). I usually use following helpers:
// multi byte to wide char:
std::wstring s2ws(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
// wide char to multi byte:
std::string ws2s(const std::wstring& wstr)
{
int size_needed = WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), int(wstr.length() + 1), 0, 0, 0, 0);
std::string strTo(size_needed, 0);
WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), int(wstr.length() + 1), &strTo[0], size_needed, 0, 0);
return strTo;
}

Handling UTF-8 encoded strings between std::wstring and std::string

I am using two libraries one that stores UTF-8 strings in std::wstring and another stores strings ( UTF-8) in std::string.
What is the best / efficient method I can use to pass strings between the two libraries.
I am currently on Windows using Visual C++ v9 Express but would prefer a portable solution.
Assuming you mean UTF-16 and not UTF-8 for std::wstring, you will have to encode/decode the strings from one library to the other. I'm not sure if/what the STL provides for that, but you can use Windows's own MultiByteToWideChar() and WideCharToMultiByte() functions to convert between UTF-8 and UTF-16 with just a few lines of code. You could then wrap that into your own functions so you can replace the logic when you find something more portable, eg:
std::wstring Utf8ToUtf16(const std::string &s)
{
std::wstring ret;
int len = MultiByteToWideChar(CP_UTF8, 0, s.c_str(), s.length(), NULL, 0);
if (len > 0)
{
ret.resize(len);
MultiByteToWideChar(CP_UTF8, 0, s.c_str(), s.length(), const_cast<wchar_t*>(ret.c_str()), len);
}
return ret;
}
std::string Utf16ToUtf8(const std::wstring &s)
{
std::string ret;
int len = WideCharToMultiByte(CP_UTF8, 0, s.c_str(), s.length(), NULL, 0, NULL, NULL);
if (len > 0)
{
ret.resize(len);
WideCharToMultiByte(CP_UTF8, 0, s.c_str(), s.length(), const_cast<char*>(ret.c_str()), len, NULL, NULL);
}
return ret;
}
Consider ICU. It is portable and has a lot of converters between encodings

How do I use MultiByteToWideChar?

I want to convert a normal string to a wstring. For this, I am trying to use the Windows API function MultiByteToWideChar.
But it does not work for me.
Here is what I have done:
string x = "This is c++ not java";
wstring Wstring;
MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , x.size() , &Wstring , 0 );
The last line produces the compiler error:
'MultiByteToWideChar' : cannot convert parameter 5 from 'std::wstring *' to 'LPWSTR'
How do I fix this error?
Also, what should be the value of the argument cchWideChar? Is 0 okay?
You must call MultiByteToWideChar twice:
The first call to MultiByteToWideChar is used to find the buffer size you need for the wide string. Look at Microsoft's documentation; it states:
If the function succeeds and cchWideChar is 0, the return value is the required size, in characters, for the buffer indicated by lpWideCharStr.
Thus, to make MultiByteToWideChar give you the required size, pass 0 as the value of the last parameter, cchWideChar. You should also pass NULL as the one before it, lpWideCharStr.
Obtain a non-const buffer large enough to accommodate the wide string, using the buffer size from the previous step. Pass this buffer to another call to MultiByteToWideChar. And this time, the last argument should be the actual size of the buffer, not 0.
A sketchy example:
int wchars_num = MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , -1, NULL , 0 );
wchar_t* wstr = new wchar_t[wchars_num];
MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , -1, wstr , wchars_num );
// do whatever with wstr
delete[] wstr;
Also, note the use of -1 as the cbMultiByte argument. This will make the resulting string null-terminated, saving you from dealing with them.
Few common conversions:
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <string>
std::string ConvertWideToANSI(const std::wstring& wstr)
{
int count = WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), wstr.length(), NULL, 0, NULL, NULL);
std::string str(count, 0);
WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), -1, &str[0], count, NULL, NULL);
return str;
}
std::wstring ConvertAnsiToWide(const std::string& str)
{
int count = MultiByteToWideChar(CP_ACP, 0, str.c_str(), str.length(), NULL, 0);
std::wstring wstr(count, 0);
MultiByteToWideChar(CP_ACP, 0, str.c_str(), str.length(), &wstr[0], count);
return wstr;
}
std::string ConvertWideToUtf8(const std::wstring& wstr)
{
int count = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), wstr.length(), NULL, 0, NULL, NULL);
std::string str(count, 0);
WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, &str[0], count, NULL, NULL);
return str;
}
std::wstring ConvertUtf8ToWide(const std::string& str)
{
int count = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0);
std::wstring wstr(count, 0);
MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), &wstr[0], count);
return wstr;
}
You can try this solution below. I tested, it works, detect special characters (example: º ä ç á ) and works on Windows XP, Windows 2000 with SP4 and later, Windows 7, 8, 8.1 and 10.
Using std::wstring instead new wchar_t / delete, we reduce problems with leak resources, overflow buffer and corrupt heap.
dwFlags was set to MB_ERR_INVALID_CHARS to works on Windows 2000 with SP4 and later, Windows XP. If this flag is not set, the function silently drops illegal code points.
std::wstring ConvertStringToWstring(const std::string &str)
{
if (str.empty())
{
return std::wstring();
}
int num_chars = MultiByteToWideChar(CP_ACP, MB_ERR_INVALID_CHARS, str.c_str(), str.length(), NULL, 0);
std::wstring wstrTo;
if (num_chars)
{
wstrTo.resize(num_chars);
if (MultiByteToWideChar(CP_ACP, MB_ERR_INVALID_CHARS, str.c_str(), str.length(), &wstrTo[0], num_chars))
{
return wstrTo;
}
}
return std::wstring();
}
Second question about this, this morning!
WideCharToMultiByte() and MultiByteToWideChar() are a pain to use. Each conversion requires two calls to the routines and you have to look after allocating/freeing memory and making sure the strings are correctly terminated. You need a wrapper!
I have a convenient C++ wrapper on my blog, here, which you are welcome to use.
Here's the other question this morning
The function cannot take a pointer to a C++ string. It will expect a pointer to a buffer of wide characters of sufficient size- you must allocate this buffer yourself.
string x = "This is c++ not java";
wstring Wstring;
Wstring.resize(x.size());
int c = MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , x.size() , &Wstring[0], 0 );