BSTR conversion to UTF-8 - c++

I'm working with UIAutomation and I'm struggling with the localized BSTRs. I'm in Germany, so there are some special characters that are represented funny in the BSTRs. I'm logging the information and need to have them in UTF-8 to process later on.
I tried already every version of the answers that I could find regarding to WideCharToMultiByte, but that's just converting the funny character into an even funnier one. I'd really appreciate if anyone could tell me what I'm doing wrong, it's really bugging me.
So I tried both of the following versions and got both times this result (the upper one is the converted one, the lower the original one):
The first word should be "Schaltfläche" and the second "Fünf".
My tried code:
BSTR* origin;
_bstr_t originWrapper(*origin);
char* originChar = originWrapper;
size_t len = strlen(originChar) + 1;
int room = MultiByteToWideChar(CP_ACP, 0, originChar, -1, NULL, 0);
wchar_t* unicodeString = (wchar_t*)malloc((sizeof(wchar_t))*room);
MultiByteToWideChar(CP_ACP, 0, originChar, -1, unicodeString, room);
int size_needed = WideCharToMultiByte(CP_UTF8, 0, unicodeString, -1, NULL, 0, NULL, NULL);
char* utf8Char = (char*) malloc(size_needed);
WideCharToMultiByte(CP_UTF8, 0, unicodeString, -1, utf8Char, size_needed, NULL, NULL);
and
BSTR* origin;
_bstr_t originWrapper(*origin);
int size_needed = WideCharToMultiByte(CP_UTF8, 0, originWrapper, SysStringByteLen(*origin), NULL, 0, NULL, NULL);
std::string resultingString(size_needed, 0);
WideCharToMultiByte(CP_UTF8, 0, *origin, SysStringByteLen(*origin), &resultingString[0], size_needed, NULL, NULL);

BSTR is a pointer to UTF-16 (WCHAR) character data, preceded by the string length. So, your roundtrip through narrow strings is misguided, you should straight use WideCharToMultiByte:
std::string BSTRtoUTF8(BSTR bstr) {
int len = SysStringLen(bstr);
// special case because a NULL BSTR is a valid zero-length BSTR,
// but regular string functions would balk on it
if(len == 0) return "";
int size_needed = WideCharToMultiByte(CP_UTF8, 0, bstr, len, NULL, 0, NULL, NULL);
std::string ret(size_needed, '\0');
WideCharToMultiByte(CP_UTF8, 0, bstr, len, ret.data(), ret.size(), NULL, NULL);
return ret;
}
To check the validity of the conversion don't output the result to the console, as it doesn't support UTF-8 output by default (it interprets narrow strings not even as in CP_ACP, but in CP_OEM, go figure). Instead, write the output to a file and check it with a reliable editor supporting UTF-8.

Related

Serialize string to UTF w/o BOM

I'm trying to code unicode string serialization to UTF-8 w/o BOM file. For some reason the code below gives wrong output.
static void MyWriteFile(HANDLE hFile, PTCHAR pszText, int cchLen, BOOL bAsUnicode)
{
DWORD dwBytes;
size_t utf8len = WideCharToMultiByte(CP_UTF8, 0, pszText, -1, NULL, 0, NULL, NULL);
PCHAR pszConverted = (PCHAR)LocalAlloc(LPTR, utf8len);
WideCharToMultiByte(CP_UTF8, 0, pszText, utf8len, pszConverted, utf8len, 0, 0);
WriteFile(hFile, pszConverted, utf8len, &dwBytes, NULL);
}
WideCharToMultiByte(CP_UTF8, 0, pszText, utf8len, pszConverted, utf8len, 0, 0);
The fourth parameter of WideCharToMultiByte (cchWideChar) is the size of the input string. You should leave that to -1 as it is null terminated. Otherwise your output buffer is probably not large enough, and it will contain too much data.

Issues Converting wstring to TCHAR [duplicate]

This question already has answers here:
How to convert std::wstring to a TCHAR*?
(6 answers)
Closed 10 years ago.
I'm fairly new to programming, and I'm trying to write a program where a user inputs a date, then that date is added to the file directory name, then that file directory is searched.
Here is what I'm working with below. I have a number of functions to do this.. I've searched online and tried doing the conversion a few different ways and I'm just not understanding it.... so I left off with (what I know is incorrected) a static_cast.
Maybe I'm just not doing the conversion right... basically this will throw it back to a function that uses the WINAPI handler. Whether I can get that to work is a completely different story... Thanks in advance for any help!
wstring fDate;
wstring fileDin;
const TCHAR* s = _T (fileDin);
std::wstring(fDate);
std::wstring(fileDin) =L"Z:\\software\\A\\AC\\" + fDate;
wcout<< fileDin;
cout <<endl;
//wstring fileDin(&arc[1]);
fileDin = static_cast<TCHAR>(arc[1]);
dir(2, arc);
TCHAR can be either wchar_t (when you use Unicode) or char (when you use Multi-byte).
On the other hand std::wstring always contains characters of type wchar_t, so it's better if you use wchar_t* directly instead of TCHAR* (if possible).
Then wchar_t* to std::wstring conversion can be done by using constructor of std::wstring:
wchar_t* wcstr = L"my string";
std::wstring wstr(wcstr);
and std::wstring to wchar_t* by simple calling c_str() method:
wchar_t* wcstr = wstr.c_str();
Then sometimes you might need to convert between "wide" strings (std::wstrings holding wchar_t characaters) and multi-byte strings (std::strings holding chars). I usually use following helpers:
// multi byte to wide char:
std::wstring s2ws(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
// wide char to multi byte:
std::string ws2s(const std::wstring& wstr)
{
int size_needed = WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), int(wstr.length() + 1), 0, 0, 0, 0);
std::string strTo(size_needed, 0);
WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), int(wstr.length() + 1), &strTo[0], size_needed, 0, 0);
return strTo;
}

How can I fix this code?

This code does not work as expected and I don't know what's wrong.
szTeam should change, but doesn't.
Could anyone explain this?
-----------------------------------------------------
WCHAR szTeam[MAX_PATH] = L"\u7F57\u5207\u8FBE\u5C14\u6D41\u6D6A";
char szMsg[MAX_PATH];
sprintf(szMsg , "%s" , WideStringToMultiByte(szTeam));
swprintf( szTeam , L"%s" , MultiByteToWideString(szMsg));
......
WCHAR* MultiByteToWideString(const char* szSrc)
{
int iSizeOfStr = MultiByteToWideChar(CP_ACP, 0, szSrc, -1, NULL, 0);
wchar_t* wszTgt = new wchar_t[iSizeOfStr];
if(!wszTgt)
return (NULL);
MultiByteToWideChar(CP_ACP, 0, szSrc, -1, wszTgt, iSizeOfStr);
return(wszTgt);
}
char* WideStringToMultiByte(const wchar_t* wszSrc)
{
int iSizeOfStr = WideCharToMultiByte(CP_UTF8, 0, wszSrc, -1, NULL, 0, NULL, NULL);
char* szTgt = new char[iSizeOfStr];
if(!szTgt)
return(NULL);
WideCharToMultiByte(CP_UTF8, 0, wszSrc, -1, szTgt, iSizeOfStr, NULL, NULL);
return szTgt;
}
-----------------------------------------------------
Erm, no szTeam does change. Into something unrecognizable, mojibake. You start out with "罗切达尔流浪" and convert it from its utf-16 encoding to utf-8. That works fine. The debugger won't show you anything recognizable because it neither knows nor cares that szMsg is encoded in utf-8.
You then go wrong though, you are converting that utf-8 string with CP_ACP. Which says that the string is encoded in the default system code page. It is not, it was encoded in utf-8.
Fix your problem with:
WCHAR* MultiByteToWideString(const char* szSrc)
{
int iSizeOfStr = MultiByteToWideChar(CP_UTF8, 0, szSrc, -1, NULL, 0);
// etc..
}
And now szTeam won't change because the string got properly converted back.

How do I use MultiByteToWideChar?

I want to convert a normal string to a wstring. For this, I am trying to use the Windows API function MultiByteToWideChar.
But it does not work for me.
Here is what I have done:
string x = "This is c++ not java";
wstring Wstring;
MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , x.size() , &Wstring , 0 );
The last line produces the compiler error:
'MultiByteToWideChar' : cannot convert parameter 5 from 'std::wstring *' to 'LPWSTR'
How do I fix this error?
Also, what should be the value of the argument cchWideChar? Is 0 okay?
You must call MultiByteToWideChar twice:
The first call to MultiByteToWideChar is used to find the buffer size you need for the wide string. Look at Microsoft's documentation; it states:
If the function succeeds and cchWideChar is 0, the return value is the required size, in characters, for the buffer indicated by lpWideCharStr.
Thus, to make MultiByteToWideChar give you the required size, pass 0 as the value of the last parameter, cchWideChar. You should also pass NULL as the one before it, lpWideCharStr.
Obtain a non-const buffer large enough to accommodate the wide string, using the buffer size from the previous step. Pass this buffer to another call to MultiByteToWideChar. And this time, the last argument should be the actual size of the buffer, not 0.
A sketchy example:
int wchars_num = MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , -1, NULL , 0 );
wchar_t* wstr = new wchar_t[wchars_num];
MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , -1, wstr , wchars_num );
// do whatever with wstr
delete[] wstr;
Also, note the use of -1 as the cbMultiByte argument. This will make the resulting string null-terminated, saving you from dealing with them.
Few common conversions:
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <string>
std::string ConvertWideToANSI(const std::wstring& wstr)
{
int count = WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), wstr.length(), NULL, 0, NULL, NULL);
std::string str(count, 0);
WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), -1, &str[0], count, NULL, NULL);
return str;
}
std::wstring ConvertAnsiToWide(const std::string& str)
{
int count = MultiByteToWideChar(CP_ACP, 0, str.c_str(), str.length(), NULL, 0);
std::wstring wstr(count, 0);
MultiByteToWideChar(CP_ACP, 0, str.c_str(), str.length(), &wstr[0], count);
return wstr;
}
std::string ConvertWideToUtf8(const std::wstring& wstr)
{
int count = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), wstr.length(), NULL, 0, NULL, NULL);
std::string str(count, 0);
WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, &str[0], count, NULL, NULL);
return str;
}
std::wstring ConvertUtf8ToWide(const std::string& str)
{
int count = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0);
std::wstring wstr(count, 0);
MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), &wstr[0], count);
return wstr;
}
You can try this solution below. I tested, it works, detect special characters (example: º ä ç á ) and works on Windows XP, Windows 2000 with SP4 and later, Windows 7, 8, 8.1 and 10.
Using std::wstring instead new wchar_t / delete, we reduce problems with leak resources, overflow buffer and corrupt heap.
dwFlags was set to MB_ERR_INVALID_CHARS to works on Windows 2000 with SP4 and later, Windows XP. If this flag is not set, the function silently drops illegal code points.
std::wstring ConvertStringToWstring(const std::string &str)
{
if (str.empty())
{
return std::wstring();
}
int num_chars = MultiByteToWideChar(CP_ACP, MB_ERR_INVALID_CHARS, str.c_str(), str.length(), NULL, 0);
std::wstring wstrTo;
if (num_chars)
{
wstrTo.resize(num_chars);
if (MultiByteToWideChar(CP_ACP, MB_ERR_INVALID_CHARS, str.c_str(), str.length(), &wstrTo[0], num_chars))
{
return wstrTo;
}
}
return std::wstring();
}
Second question about this, this morning!
WideCharToMultiByte() and MultiByteToWideChar() are a pain to use. Each conversion requires two calls to the routines and you have to look after allocating/freeing memory and making sure the strings are correctly terminated. You need a wrapper!
I have a convenient C++ wrapper on my blog, here, which you are welcome to use.
Here's the other question this morning
The function cannot take a pointer to a C++ string. It will expect a pointer to a buffer of wide characters of sufficient size- you must allocate this buffer yourself.
string x = "This is c++ not java";
wstring Wstring;
Wstring.resize(x.size());
int c = MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , x.size() , &Wstring[0], 0 );

utf-8 to/from utf-16 problem

I based these two conversion functions and an answer on StackOverflow, but converting back-and-forth doesn't work:
std::wstring MultiByteToWideString(const char* szSrc)
{
unsigned int iSizeOfStr = MultiByteToWideChar(CP_ACP, 0, szSrc, -1, NULL, 0);
wchar_t* wszTgt = new wchar_t[iSizeOfStr];
if(!wszTgt) assert(0);
MultiByteToWideChar(CP_ACP, 0, szSrc, -1, wszTgt, iSizeOfStr);
std::wstring wstr(wszTgt);
delete(wszTgt);
return(wstr);
}
std::string WideStringToMultiByte(const wchar_t* wszSrc)
{
int iSizeOfStr = WideCharToMultiByte(CP_ACP, 0, wszSrc, -1, NULL, 0, NULL, NULL);
char* szTgt = new char[iSizeOfStr];
if(!szTgt) return(NULL);
WideCharToMultiByte(CP_ACP, 0, wszSrc, -1, szTgt, iSizeOfStr, NULL, NULL);
std::string str(szTgt);
delete(szTgt);
return(str);
}
[...]
// はてなブ in utf-16
wchar_t wTestUTF16[] = L"\u306f\u3066\u306a\u30d6\u306f\u306f";
// shows the text correctly
::MessageBoxW(NULL, wTestUTF16, L"Message", MB_OK);
// convert to UTF8, and back to UTF-16
std::string strUTF8 = WideStringToMultiByte(wTestUTF16);
std::wstring wstrUTF16 = MultiByteToWideString(strUTF8.c_str());
// this doesn't show the proper text. Should be same as first message box
::MessageBoxW(NULL, wstrUTF16.c_str(), L"Message", MB_OK);
Check the docs for WideCharToMultiByte(). CP_ACP converts using the current system code page. That's a very lossy one. You want CP_UTF8.