Why filename has different bytes after converting UTF16 -> UTF8 -> UTF16 in winapi? - c++

I have next file:
I use ReadDirectoryChangesW for reading changes in current folder.
And I get path to this file: L"TEST Ӡ⬨☐.ipt":
Next, I want to convert this to utf8 and back:
std::string wstringToUtf8(const std::wstring& source) {
const int size = WideCharToMultiByte(CP_UTF8, 0, source.data(), static_cast<int>(source.size()), NULL, 0, NULL, NULL);
std::vector<char> buffer8(size);
WideCharToMultiByte(CP_UTF8, 0, source.data(), static_cast<int>(source.size()), buffer8.data(), size, NULL, NULL);
}
std::wstring utf8ToWstring(const std::string& source) {
const int size = MultiByteToWideChar(CP_UTF8, 0, source.data(), static_cast<int>(source.size()), NULL, 0);
std::vector<wchar_t> buffer16(size);
MultiByteToWideChar(CP_UTF8, 0, source.data(), static_cast<int>(source.size()), buffer16.data(), size);
}
int main() {
// Some code with ReadDirectoryChangesW and
// ...
// std::wstring fileName = "L"TEST Ӡ⬨☐.ipt""
// ...
std::string filenameUTF8 = wstringToUtf8(fileName);
std::wstring filename2 = utf8ToWstring(filenameUTF8);
assert(filenameUTF8 == filename2); // FAIL!
return 0;
}
But I catch assert.
filename2:
Different bits: [29]
Why?

57216 seems to fall in to surrogate pair range, used in UTF-16 to encode non-BMP code points. They need to be given in pairs, or decoding won't give you correct codepoint.
65533 is a special error character which decoder gives because other surrogate is missing.
To put it another way: Your original string is not valid UTF-16 string.
More info on Wikipedia.

Related

HttpSendRequest - POST data not supporting Unicode

I'm working on making a C++ agent that will post information (such as the system hostname) back to a central server using HttpSendRequest(). One of the pieces of information that I want it to post back is the OS. I created the following function that will obtain the system hostname.
wstring getOS()
{
HKEY key;
RegOpenKeyEx(HKEY_LOCAL_MACHINE, L"SOFTWARE\\Microsoft\\Windows NT\\CurrentVersion", 0, KEY_QUERY_VALUE, &key); // Obtains Registry handle
DWORD type;
wchar_t buffer[MAX_PATH]; // MAX_PATH = 260 - The system hostname should never exceed this value
DWORD size = sizeof(buffer);
RegQueryValueEx(key, L"ProductName", NULL, &type, (LPBYTE)&buffer, &size); // Queries Registry key - stores value in "buffer"
wstring os(buffer); // Converts from C-style character array to wstring
return os; // Returns wstring to caller
}
This function will obtain the OS using the Registry and store it as a wstring. I then want to pass the returned "os" wstring to the following post() function, but I noticed that you must use a string instead of a wstring for the HTTP POST data. Below is the code for my post() function:
void post()
{
HINTERNET hInternetOpen = InternetOpen(userAgent.c_str(), INTERNET_OPEN_TYPE_PROXY, L"http://127.0.0.1:9999", NULL, 0);
HINTERNET hInternetConnect = InternetConnect(hInternetOpen, host.c_str(), INTERNET_DEFAULT_HTTP_PORT, NULL, NULL, INTERNET_SERVICE_HTTP, 0, 0);
HINTERNET hHttpOpenRequest = HttpOpenRequest(hInternetConnect, L"POST", file.c_str(), NULL, NULL, NULL, 0, 0);
wstring headers = L"Content-Type: application/x-www-form-urlencoded"; // Content-Type is necessary to POST
string postData = "os="; // Why does this have to be a string and not a wstring?
HttpSendRequest(hHttpOpenRequest, headers.c_str(), headers.length(), (LPVOID)postData.c_str(), postData.size());
InternetCloseHandle(hInternetOpen);
InternetCloseHandle(hInternetConnect);
InternetCloseHandle(hHttpOpenRequest);
}
If I try to make "postData" a wstring, I get something that looks like the image below:
Can someone shed some light onto the easiest way to include a wstring as the POST data?
HttpSendRequest() only knows about raw bytes, not strings. You can send UTF-16 data using a std::wstring, but you have to tell the server that you are sending UTF-16, via a charset attribute in the Content-Type header.
wstring headers = L"Content-Type: application/x-www-form-urlencoded; charset=utf-16";
// TODO: don't forget to URL-encode the value from getOS() to
// escape reserved characters, including '=' and '&'...
wstring postData = L"os=" + getOS();
HttpSendRequest(hHttpOpenRequest, headers.c_str(), headers.length(),
postData.c_str(), postData.length() * sizeof(wchar_t));
Note the use of sizeof(wchar_t) above. In your screenshot, your sniffer is showing the raw data, and the data it shows is what UTF-16 would look like, but you see only half of your wstring data because you are setting the dwOptionalLength parameter of HttpSendRequest() to a character count (7) instead of a byte count (14):
dwOptionalLength [in]
The size of the optional data, in bytes. This parameter can be zero if there is no optional data to send.
When you use std::string, the character count and the byte count are the same value.
What you really should be sending is UTF-8 instead of UTF-16, eg:
string Utf8Encode(const wstring &wstr)
{
// NOTE: C++11 has built-in support for converting between
// UTF-8 and UTF-16. See the std::wstring_convert class...
/*
wstring_convert<codecvt_utf8_utf16<wchar_t>> conv;
return conv.to_bytes(wstr);
*/
string out;
int len = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), wstr.length(), NULL, 0, NULL, NULL);
if (len > 0)
{
out.resize(len);
WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), wstr.length(), &out[0], len, NULL, NULL);
}
return out;
}
wstring headers = L"Content-Type: application/x-www-form-urlencoded; charset=utf-8";
// TODO: don't forget to URL-encode the value from getOS() to
// escape reserved characters, including '=' and '&'...
string postData = "os=" + Utf8Encode(getOS());
HttpSendRequest(hHttpOpenRequest, headers.c_str(), headers.length(),
postData.c_str(), postData.size());

Issues Converting wstring to TCHAR [duplicate]

This question already has answers here:
How to convert std::wstring to a TCHAR*?
(6 answers)
Closed 10 years ago.
I'm fairly new to programming, and I'm trying to write a program where a user inputs a date, then that date is added to the file directory name, then that file directory is searched.
Here is what I'm working with below. I have a number of functions to do this.. I've searched online and tried doing the conversion a few different ways and I'm just not understanding it.... so I left off with (what I know is incorrected) a static_cast.
Maybe I'm just not doing the conversion right... basically this will throw it back to a function that uses the WINAPI handler. Whether I can get that to work is a completely different story... Thanks in advance for any help!
wstring fDate;
wstring fileDin;
const TCHAR* s = _T (fileDin);
std::wstring(fDate);
std::wstring(fileDin) =L"Z:\\software\\A\\AC\\" + fDate;
wcout<< fileDin;
cout <<endl;
//wstring fileDin(&arc[1]);
fileDin = static_cast<TCHAR>(arc[1]);
dir(2, arc);
TCHAR can be either wchar_t (when you use Unicode) or char (when you use Multi-byte).
On the other hand std::wstring always contains characters of type wchar_t, so it's better if you use wchar_t* directly instead of TCHAR* (if possible).
Then wchar_t* to std::wstring conversion can be done by using constructor of std::wstring:
wchar_t* wcstr = L"my string";
std::wstring wstr(wcstr);
and std::wstring to wchar_t* by simple calling c_str() method:
wchar_t* wcstr = wstr.c_str();
Then sometimes you might need to convert between "wide" strings (std::wstrings holding wchar_t characaters) and multi-byte strings (std::strings holding chars). I usually use following helpers:
// multi byte to wide char:
std::wstring s2ws(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
// wide char to multi byte:
std::string ws2s(const std::wstring& wstr)
{
int size_needed = WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), int(wstr.length() + 1), 0, 0, 0, 0);
std::string strTo(size_needed, 0);
WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), int(wstr.length() + 1), &strTo[0], size_needed, 0, 0);
return strTo;
}

How do I use MultiByteToWideChar?

I want to convert a normal string to a wstring. For this, I am trying to use the Windows API function MultiByteToWideChar.
But it does not work for me.
Here is what I have done:
string x = "This is c++ not java";
wstring Wstring;
MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , x.size() , &Wstring , 0 );
The last line produces the compiler error:
'MultiByteToWideChar' : cannot convert parameter 5 from 'std::wstring *' to 'LPWSTR'
How do I fix this error?
Also, what should be the value of the argument cchWideChar? Is 0 okay?
You must call MultiByteToWideChar twice:
The first call to MultiByteToWideChar is used to find the buffer size you need for the wide string. Look at Microsoft's documentation; it states:
If the function succeeds and cchWideChar is 0, the return value is the required size, in characters, for the buffer indicated by lpWideCharStr.
Thus, to make MultiByteToWideChar give you the required size, pass 0 as the value of the last parameter, cchWideChar. You should also pass NULL as the one before it, lpWideCharStr.
Obtain a non-const buffer large enough to accommodate the wide string, using the buffer size from the previous step. Pass this buffer to another call to MultiByteToWideChar. And this time, the last argument should be the actual size of the buffer, not 0.
A sketchy example:
int wchars_num = MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , -1, NULL , 0 );
wchar_t* wstr = new wchar_t[wchars_num];
MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , -1, wstr , wchars_num );
// do whatever with wstr
delete[] wstr;
Also, note the use of -1 as the cbMultiByte argument. This will make the resulting string null-terminated, saving you from dealing with them.
Few common conversions:
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <string>
std::string ConvertWideToANSI(const std::wstring& wstr)
{
int count = WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), wstr.length(), NULL, 0, NULL, NULL);
std::string str(count, 0);
WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), -1, &str[0], count, NULL, NULL);
return str;
}
std::wstring ConvertAnsiToWide(const std::string& str)
{
int count = MultiByteToWideChar(CP_ACP, 0, str.c_str(), str.length(), NULL, 0);
std::wstring wstr(count, 0);
MultiByteToWideChar(CP_ACP, 0, str.c_str(), str.length(), &wstr[0], count);
return wstr;
}
std::string ConvertWideToUtf8(const std::wstring& wstr)
{
int count = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), wstr.length(), NULL, 0, NULL, NULL);
std::string str(count, 0);
WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, &str[0], count, NULL, NULL);
return str;
}
std::wstring ConvertUtf8ToWide(const std::string& str)
{
int count = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0);
std::wstring wstr(count, 0);
MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), &wstr[0], count);
return wstr;
}
You can try this solution below. I tested, it works, detect special characters (example: º ä ç á ) and works on Windows XP, Windows 2000 with SP4 and later, Windows 7, 8, 8.1 and 10.
Using std::wstring instead new wchar_t / delete, we reduce problems with leak resources, overflow buffer and corrupt heap.
dwFlags was set to MB_ERR_INVALID_CHARS to works on Windows 2000 with SP4 and later, Windows XP. If this flag is not set, the function silently drops illegal code points.
std::wstring ConvertStringToWstring(const std::string &str)
{
if (str.empty())
{
return std::wstring();
}
int num_chars = MultiByteToWideChar(CP_ACP, MB_ERR_INVALID_CHARS, str.c_str(), str.length(), NULL, 0);
std::wstring wstrTo;
if (num_chars)
{
wstrTo.resize(num_chars);
if (MultiByteToWideChar(CP_ACP, MB_ERR_INVALID_CHARS, str.c_str(), str.length(), &wstrTo[0], num_chars))
{
return wstrTo;
}
}
return std::wstring();
}
Second question about this, this morning!
WideCharToMultiByte() and MultiByteToWideChar() are a pain to use. Each conversion requires two calls to the routines and you have to look after allocating/freeing memory and making sure the strings are correctly terminated. You need a wrapper!
I have a convenient C++ wrapper on my blog, here, which you are welcome to use.
Here's the other question this morning
The function cannot take a pointer to a C++ string. It will expect a pointer to a buffer of wide characters of sufficient size- you must allocate this buffer yourself.
string x = "This is c++ not java";
wstring Wstring;
Wstring.resize(x.size());
int c = MultiByteToWideChar( CP_UTF8 , 0 , x.c_str() , x.size() , &Wstring[0], 0 );

call avio_open function with non-english filename is invalid

i have been writing unicode based program with libav and i wanna make some file through libav with filename "中.mp4".
this filename is not english, and when i call, function return positive integer(not fail).
but there is "ѱ۰.mp4" instead of "中.mp4". (invalid file name.)
what's the matter?
char * szFilenameA = 0;
#ifdef _UNICODE
CSHArray<char> aFilenameBuffer;
aFilenameBuffer.Alloc(lstrlen(szFileName) * 2);
ZeroMemory(aFilenameBuffer, aFilenameBuffer.GetSize());
WideCharToMultiByte(CP_ACP, 0, szFileName, lstrlen(szFileName), aFilenameBuffer, aFilenameBuffer.GetSize(), NULL, NULL);
szFilenameA = aFilenameBuffer;
#else
szFilenameA = (TCHAR *)szFileName;
#endif
ZeroMemory(m_pOutputFormatCtx->filename,1024);
_snprintf(m_pOutputFormatCtx->filename, strlen(szFilenameA), "%s", szFilenameA);
avio_open(&m_pOutputFormatCtx->pb, szFilenameA, AVIO_FLAG_WRITE)
finally!
it's because of charset.
convert ansi filename to UTF8 and then it works fine.
int ANSIToUTF8(char *pszCode, char *UTF8code)
{
WCHAR Unicode[100]={0,};
char utf8[100]={0,};
// read char Lenth
int nUnicodeSize = MultiByteToWideChar(CP_ACP, 0, pszCode, strlen(pszCode), Unicode, sizeof(Unicode));
// read UTF-8 Lenth
int nUTF8codeSize = WideCharToMultiByte(CP_UTF8, 0, Unicode, nUnicodeSize, UTF8code, sizeof(Unicode), NULL, NULL);
// convert to UTF-8
MultiByteToWideChar(CP_UTF8, 0, utf8, nUTF8codeSize, Unicode, sizeof(Unicode));
return nUTF8codeSize;
}

utf-8 to/from utf-16 problem

I based these two conversion functions and an answer on StackOverflow, but converting back-and-forth doesn't work:
std::wstring MultiByteToWideString(const char* szSrc)
{
unsigned int iSizeOfStr = MultiByteToWideChar(CP_ACP, 0, szSrc, -1, NULL, 0);
wchar_t* wszTgt = new wchar_t[iSizeOfStr];
if(!wszTgt) assert(0);
MultiByteToWideChar(CP_ACP, 0, szSrc, -1, wszTgt, iSizeOfStr);
std::wstring wstr(wszTgt);
delete(wszTgt);
return(wstr);
}
std::string WideStringToMultiByte(const wchar_t* wszSrc)
{
int iSizeOfStr = WideCharToMultiByte(CP_ACP, 0, wszSrc, -1, NULL, 0, NULL, NULL);
char* szTgt = new char[iSizeOfStr];
if(!szTgt) return(NULL);
WideCharToMultiByte(CP_ACP, 0, wszSrc, -1, szTgt, iSizeOfStr, NULL, NULL);
std::string str(szTgt);
delete(szTgt);
return(str);
}
[...]
// はてなブ in utf-16
wchar_t wTestUTF16[] = L"\u306f\u3066\u306a\u30d6\u306f\u306f";
// shows the text correctly
::MessageBoxW(NULL, wTestUTF16, L"Message", MB_OK);
// convert to UTF8, and back to UTF-16
std::string strUTF8 = WideStringToMultiByte(wTestUTF16);
std::wstring wstrUTF16 = MultiByteToWideString(strUTF8.c_str());
// this doesn't show the proper text. Should be same as first message box
::MessageBoxW(NULL, wstrUTF16.c_str(), L"Message", MB_OK);
Check the docs for WideCharToMultiByte(). CP_ACP converts using the current system code page. That's a very lossy one. You want CP_UTF8.