Serialize string to UTF w/o BOM - c++

I'm trying to code unicode string serialization to UTF-8 w/o BOM file. For some reason the code below gives wrong output.
static void MyWriteFile(HANDLE hFile, PTCHAR pszText, int cchLen, BOOL bAsUnicode)
{
DWORD dwBytes;
size_t utf8len = WideCharToMultiByte(CP_UTF8, 0, pszText, -1, NULL, 0, NULL, NULL);
PCHAR pszConverted = (PCHAR)LocalAlloc(LPTR, utf8len);
WideCharToMultiByte(CP_UTF8, 0, pszText, utf8len, pszConverted, utf8len, 0, 0);
WriteFile(hFile, pszConverted, utf8len, &dwBytes, NULL);
}

WideCharToMultiByte(CP_UTF8, 0, pszText, utf8len, pszConverted, utf8len, 0, 0);
The fourth parameter of WideCharToMultiByte (cchWideChar) is the size of the input string. You should leave that to -1 as it is null terminated. Otherwise your output buffer is probably not large enough, and it will contain too much data.

Related

BSTR conversion to UTF-8

I'm working with UIAutomation and I'm struggling with the localized BSTRs. I'm in Germany, so there are some special characters that are represented funny in the BSTRs. I'm logging the information and need to have them in UTF-8 to process later on.
I tried already every version of the answers that I could find regarding to WideCharToMultiByte, but that's just converting the funny character into an even funnier one. I'd really appreciate if anyone could tell me what I'm doing wrong, it's really bugging me.
So I tried both of the following versions and got both times this result (the upper one is the converted one, the lower the original one):
The first word should be "Schaltfläche" and the second "Fünf".
My tried code:
BSTR* origin;
_bstr_t originWrapper(*origin);
char* originChar = originWrapper;
size_t len = strlen(originChar) + 1;
int room = MultiByteToWideChar(CP_ACP, 0, originChar, -1, NULL, 0);
wchar_t* unicodeString = (wchar_t*)malloc((sizeof(wchar_t))*room);
MultiByteToWideChar(CP_ACP, 0, originChar, -1, unicodeString, room);
int size_needed = WideCharToMultiByte(CP_UTF8, 0, unicodeString, -1, NULL, 0, NULL, NULL);
char* utf8Char = (char*) malloc(size_needed);
WideCharToMultiByte(CP_UTF8, 0, unicodeString, -1, utf8Char, size_needed, NULL, NULL);
and
BSTR* origin;
_bstr_t originWrapper(*origin);
int size_needed = WideCharToMultiByte(CP_UTF8, 0, originWrapper, SysStringByteLen(*origin), NULL, 0, NULL, NULL);
std::string resultingString(size_needed, 0);
WideCharToMultiByte(CP_UTF8, 0, *origin, SysStringByteLen(*origin), &resultingString[0], size_needed, NULL, NULL);
BSTR is a pointer to UTF-16 (WCHAR) character data, preceded by the string length. So, your roundtrip through narrow strings is misguided, you should straight use WideCharToMultiByte:
std::string BSTRtoUTF8(BSTR bstr) {
int len = SysStringLen(bstr);
// special case because a NULL BSTR is a valid zero-length BSTR,
// but regular string functions would balk on it
if(len == 0) return "";
int size_needed = WideCharToMultiByte(CP_UTF8, 0, bstr, len, NULL, 0, NULL, NULL);
std::string ret(size_needed, '\0');
WideCharToMultiByte(CP_UTF8, 0, bstr, len, ret.data(), ret.size(), NULL, NULL);
return ret;
}
To check the validity of the conversion don't output the result to the console, as it doesn't support UTF-8 output by default (it interprets narrow strings not even as in CP_ACP, but in CP_OEM, go figure). Instead, write the output to a file and check it with a reliable editor supporting UTF-8.

Unable to obtain POST request response using wininet

I need to make a POST request to an API to get some XML data (http://freecite.library.brown.edu/welcome/api_instructions). This works fine with curl:
curl -H "Accept: application/xml" --data "citation=Udvarhelyi, I.S., Gatsonis, C.A., Epstein, A.M., Pashos, C.L., Newhouse, J.P. and McNeil, B.J. Acute Myocardial Infarction in the Medicare population: process of care and clinical outcomes. Journal of the American Medical Association, 1992; 18:2530-2536. " http://freecite.library.brown.edu:80/citations/create
So I am trying to do a similar thingy using Win32 SDK. This is my code:
void LoadData()
{
wil::unique_hinternet hInternet(InternetOpen(L"Dummy", INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0));
wil::unique_hinternet hConnect(InternetConnect(hInternet.get(), L"http://freecite.library.brown.edu", 80, NULL, NULL, INTERNET_SERVICE_HTTP, 0, 0));
wil::unique_hinternet hRequest(HttpOpenRequest(hConnect.get(), L"POST", L"/citations/create", NULL, NULL, NULL, NULL, NULL));
wstring data = L"citation=Udvarhelyi, I.S., Gatsonis, C.A., Epstein, A.M., Pashos, C.L., Newhouse, J.P. and McNeil, B.J. Acute Myocardial Infarction in the Medicare population: process of care and clinical outcomes. Journal of the American Medical Association, 1992; 18:2530-2536.";
PCWSTR szHeaders = L"Accept: application/xml";
HttpSendRequest(hRequest.get(), szHeaders, 0, (LPVOID)data.c_str(), static_cast<int>(data.length()));
DWORD availableBytes = 0;
InternetQueryDataAvailable(hRequest.get(), &availableBytes, 0, 0);
PBYTE outputBuffer = (PBYTE)HeapAlloc(GetProcessHeap(), 0, availableBytes);
PBYTE nextBytes = outputBuffer;
DWORD bytesUsed = 0; // number of used bytes.
while (availableBytes)
{
DWORD downloadedBytes;
InternetReadFile(hRequest.get(), nextBytes, availableBytes, &downloadedBytes);
bytesUsed = bytesUsed + downloadedBytes;
InternetQueryDataAvailable(hRequest.get(), &availableBytes, 0, 0);
if (availableBytes > 0)
{
// lazy buffer growth here. Only alloc for what we need. could be optimized if we end up with huge payloads (>10MB).
// otherwise, this is good enough.
outputBuffer = (PBYTE)HeapReAlloc(GetProcessHeap(), 0, outputBuffer, bytesUsed + availableBytes);
nextBytes = outputBuffer + bytesUsed; // careful, pointer might have moved! Update cursor.
}
}
// Convert outputed XML to wide char
int size_needed = MultiByteToWideChar(CP_UTF8, 0, (PCCH)outputBuffer, bytesUsed, NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, (PCCH)outputBuffer, bytesUsed, &wstrTo[0], size_needed);
wstring res = wstrTo;
}
The problem is, before entering the for loop, even after the call to InternetQueryDataAvailable, availableBytes comes out to be 0. As a result, I finally end up getting a blank string as response, whereas I was expecting a XML response.
Can anyone point me what am I doing wrongly, and how to fix it?
InternetConnect expects server name or IP address, so don't include "http://" in the address. Change to:
InternetConnect(handle, L"freecite.library.brown.edu"...);
Use UTF-8 for data. Other parameters for WinAPI functions are correctly using UTF-16, they automatically make the necessary conversions.
Change the header:
std::wstring szHeaders = L"Content-Type: application/x-www-form-urlencoded\r\n";
accept should be sent through HttpOpenRequest
const wchar_t *accept[] = { L"text/xml", NULL };
HINTERNET hrequest = HttpOpenRequest(hconnect, L"POST", L"/citations/create",
NULL, NULL, accept, 0, 0);
Note, if you don't specify accept (use NULL in its place) then the result can be in plain html.
The example below should return XML.
Note, for simplicity I put optional as ANSI string, but it should be UTF8, then you convert it to UTF16 woptional and send it. result will be UTF8 string, it should be converted to UTF16 for Windows's display.
#include <iostream>
#include <string>
#include <Windows.h>
#include <WinINet.h>
#pragma comment(lib, "wininet.lib")//include WinINet library
int main()
{
std::string result;
std::wstring server = L"freecite.library.brown.edu";
std::wstring objectname = L"/citations/create"; //file in this case!
std::wstring header = L"Content-Type: application/x-www-form-urlencoded\r\n";
std::string optional = "citation=Udvarhelyi, I.S., Gatsonis, C.A., Epstein";
HINTERNET hsession = InternetOpen(L"appname",
INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0);
HINTERNET hconnect = InternetConnect(hsession, server.c_str(),
INTERNET_DEFAULT_HTTP_PORT, NULL, NULL, INTERNET_SERVICE_HTTP, 0, 0);
const wchar_t* accept[] = { L"text/xml", NULL };
HINTERNET hrequest = HttpOpenRequest(hconnect, L"POST", objectname.c_str(),
NULL, NULL, accept, 0, 0);
if(HttpSendRequest(hrequest, header.c_str(), header.size(),
&optional[0], optional.size()))
{
DWORD blocksize = 4096;
DWORD received = 0;
std::string block(blocksize, 0);
while (InternetReadFile(hrequest, &block[0], blocksize, &received)
&& received)
{
block.resize(received);
result += block;
}
std::cout << result << std::endl;
}
if (hrequest) InternetCloseHandle(hrequest);
if (hconnect) InternetCloseHandle(hconnect);
if (hsession) InternetCloseHandle(hsession);
return 0;
}

How can I fix this code?

This code does not work as expected and I don't know what's wrong.
szTeam should change, but doesn't.
Could anyone explain this?
-----------------------------------------------------
WCHAR szTeam[MAX_PATH] = L"\u7F57\u5207\u8FBE\u5C14\u6D41\u6D6A";
char szMsg[MAX_PATH];
sprintf(szMsg , "%s" , WideStringToMultiByte(szTeam));
swprintf( szTeam , L"%s" , MultiByteToWideString(szMsg));
......
WCHAR* MultiByteToWideString(const char* szSrc)
{
int iSizeOfStr = MultiByteToWideChar(CP_ACP, 0, szSrc, -1, NULL, 0);
wchar_t* wszTgt = new wchar_t[iSizeOfStr];
if(!wszTgt)
return (NULL);
MultiByteToWideChar(CP_ACP, 0, szSrc, -1, wszTgt, iSizeOfStr);
return(wszTgt);
}
char* WideStringToMultiByte(const wchar_t* wszSrc)
{
int iSizeOfStr = WideCharToMultiByte(CP_UTF8, 0, wszSrc, -1, NULL, 0, NULL, NULL);
char* szTgt = new char[iSizeOfStr];
if(!szTgt)
return(NULL);
WideCharToMultiByte(CP_UTF8, 0, wszSrc, -1, szTgt, iSizeOfStr, NULL, NULL);
return szTgt;
}
-----------------------------------------------------
Erm, no szTeam does change. Into something unrecognizable, mojibake. You start out with "罗切达尔流浪" and convert it from its utf-16 encoding to utf-8. That works fine. The debugger won't show you anything recognizable because it neither knows nor cares that szMsg is encoded in utf-8.
You then go wrong though, you are converting that utf-8 string with CP_ACP. Which says that the string is encoded in the default system code page. It is not, it was encoded in utf-8.
Fix your problem with:
WCHAR* MultiByteToWideString(const char* szSrc)
{
int iSizeOfStr = MultiByteToWideChar(CP_UTF8, 0, szSrc, -1, NULL, 0);
// etc..
}
And now szTeam won't change because the string got properly converted back.

How do I use MultiByteToWideChar?

I want to convert a normal string to a wstring. For this, I am trying to use the Windows API function MultiByteToWideChar.
But it does not work for me.
Here is what I have done:
string x = "This is c++ not java";
wstring Wstring;
MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , x.size() , &Wstring , 0 );
The last line produces the compiler error:
'MultiByteToWideChar' : cannot convert parameter 5 from 'std::wstring *' to 'LPWSTR'
How do I fix this error?
Also, what should be the value of the argument cchWideChar? Is 0 okay?
You must call MultiByteToWideChar twice:
The first call to MultiByteToWideChar is used to find the buffer size you need for the wide string. Look at Microsoft's documentation; it states:
If the function succeeds and cchWideChar is 0, the return value is the required size, in characters, for the buffer indicated by lpWideCharStr.
Thus, to make MultiByteToWideChar give you the required size, pass 0 as the value of the last parameter, cchWideChar. You should also pass NULL as the one before it, lpWideCharStr.
Obtain a non-const buffer large enough to accommodate the wide string, using the buffer size from the previous step. Pass this buffer to another call to MultiByteToWideChar. And this time, the last argument should be the actual size of the buffer, not 0.
A sketchy example:
int wchars_num = MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , -1, NULL , 0 );
wchar_t* wstr = new wchar_t[wchars_num];
MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , -1, wstr , wchars_num );
// do whatever with wstr
delete[] wstr;
Also, note the use of -1 as the cbMultiByte argument. This will make the resulting string null-terminated, saving you from dealing with them.
Few common conversions:
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <string>
std::string ConvertWideToANSI(const std::wstring& wstr)
{
int count = WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), wstr.length(), NULL, 0, NULL, NULL);
std::string str(count, 0);
WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), -1, &str[0], count, NULL, NULL);
return str;
}
std::wstring ConvertAnsiToWide(const std::string& str)
{
int count = MultiByteToWideChar(CP_ACP, 0, str.c_str(), str.length(), NULL, 0);
std::wstring wstr(count, 0);
MultiByteToWideChar(CP_ACP, 0, str.c_str(), str.length(), &wstr[0], count);
return wstr;
}
std::string ConvertWideToUtf8(const std::wstring& wstr)
{
int count = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), wstr.length(), NULL, 0, NULL, NULL);
std::string str(count, 0);
WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, &str[0], count, NULL, NULL);
return str;
}
std::wstring ConvertUtf8ToWide(const std::string& str)
{
int count = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0);
std::wstring wstr(count, 0);
MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), &wstr[0], count);
return wstr;
}
You can try this solution below. I tested, it works, detect special characters (example: º ä ç á ) and works on Windows XP, Windows 2000 with SP4 and later, Windows 7, 8, 8.1 and 10.
Using std::wstring instead new wchar_t / delete, we reduce problems with leak resources, overflow buffer and corrupt heap.
dwFlags was set to MB_ERR_INVALID_CHARS to works on Windows 2000 with SP4 and later, Windows XP. If this flag is not set, the function silently drops illegal code points.
std::wstring ConvertStringToWstring(const std::string &str)
{
if (str.empty())
{
return std::wstring();
}
int num_chars = MultiByteToWideChar(CP_ACP, MB_ERR_INVALID_CHARS, str.c_str(), str.length(), NULL, 0);
std::wstring wstrTo;
if (num_chars)
{
wstrTo.resize(num_chars);
if (MultiByteToWideChar(CP_ACP, MB_ERR_INVALID_CHARS, str.c_str(), str.length(), &wstrTo[0], num_chars))
{
return wstrTo;
}
}
return std::wstring();
}
Second question about this, this morning!
WideCharToMultiByte() and MultiByteToWideChar() are a pain to use. Each conversion requires two calls to the routines and you have to look after allocating/freeing memory and making sure the strings are correctly terminated. You need a wrapper!
I have a convenient C++ wrapper on my blog, here, which you are welcome to use.
Here's the other question this morning
The function cannot take a pointer to a C++ string. It will expect a pointer to a buffer of wide characters of sufficient size- you must allocate this buffer yourself.
string x = "This is c++ not java";
wstring Wstring;
Wstring.resize(x.size());
int c = MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , x.size() , &Wstring[0], 0 );

utf-8 to/from utf-16 problem

I based these two conversion functions and an answer on StackOverflow, but converting back-and-forth doesn't work:
std::wstring MultiByteToWideString(const char* szSrc)
{
unsigned int iSizeOfStr = MultiByteToWideChar(CP_ACP, 0, szSrc, -1, NULL, 0);
wchar_t* wszTgt = new wchar_t[iSizeOfStr];
if(!wszTgt) assert(0);
MultiByteToWideChar(CP_ACP, 0, szSrc, -1, wszTgt, iSizeOfStr);
std::wstring wstr(wszTgt);
delete(wszTgt);
return(wstr);
}
std::string WideStringToMultiByte(const wchar_t* wszSrc)
{
int iSizeOfStr = WideCharToMultiByte(CP_ACP, 0, wszSrc, -1, NULL, 0, NULL, NULL);
char* szTgt = new char[iSizeOfStr];
if(!szTgt) return(NULL);
WideCharToMultiByte(CP_ACP, 0, wszSrc, -1, szTgt, iSizeOfStr, NULL, NULL);
std::string str(szTgt);
delete(szTgt);
return(str);
}
[...]
// はてなブ in utf-16
wchar_t wTestUTF16[] = L"\u306f\u3066\u306a\u30d6\u306f\u306f";
// shows the text correctly
::MessageBoxW(NULL, wTestUTF16, L"Message", MB_OK);
// convert to UTF8, and back to UTF-16
std::string strUTF8 = WideStringToMultiByte(wTestUTF16);
std::wstring wstrUTF16 = MultiByteToWideString(strUTF8.c_str());
// this doesn't show the proper text. Should be same as first message box
::MessageBoxW(NULL, wstrUTF16.c_str(), L"Message", MB_OK);
Check the docs for WideCharToMultiByte(). CP_ACP converts using the current system code page. That's a very lossy one. You want CP_UTF8.