Why I get wrong char array when I use WideCharToMultiByte?

Why I get wrong char array when I use WideCharToMultiByte? - c++

Windows 7, Visual Studio 2015.
#ifdef UNICODE
char *buffer = NULL;
int iBuffSize = WideCharToMultiByte(CP_ACP, 0, result_msg.c_str(),
result_msg.size(), buffer, 0, NULL, NULL);
buffer = static_cast<char*>(malloc(iBuffSize));
ZeroMemory(buffer, iBuffSize);
WideCharToMultiByte(CP_ACP, 0, result_msg.c_str(),
result_msg.size(), buffer, iBuffSize, NULL, NULL);
string result_msg2(buffer);
free(buffer);
throw runtime_error(result_msg2);
#else
throw runtime_error(result_msg);
#endif
result_msg is std::wstring for unicode, and std::string for the multi-byte character set.
For Multi-byte character set:
For Unicode character set:

You specified the input string size as result_msg.size(), which does not include the terminating null character, and so the output won't be null-terminated, either. But when you convert buffer to a string, you are not specifying the size of buffer, so the string constructor expects a null terminator. Without that terminator, it is grabbing data from surrounding memory until it encounters a null byte (or get a memory access error).
Either use result_msg.size() + 1 for the input size, or specify -1 as the input size to let WideCharToMultiByte() determine the input size automatic. Either approach will include a null terminator in the output.
Or, keep using result_msg.size() s the input size, and use the value of iBuffSize when converting buffer to a string, then you don't need a null terminator:
string result_msg2(buffer, iBuffSize);

Related

Converting to UTF-8 from ToUnicodeEx()

I get input using GetAsyncKeyState() which I then convert to unicode using ToUnicodeEx():
wchar_t character[1];
ToUnicodeEx(i, scanCode, keyboardState, character, 1, 0, layout);
I can write this to a file using wfstream like so:
wchar_t buffer[128]; // Will not print unicode without these 2 lines
file.rdbuf()->pubsetbuf(buffer, 128);
file.put(0xFEFF); // BOM needed since it's encoded using UCS-2 LE
file << character[0];
When I open this file in Notepad++ it's in UCS-2 LE, when I want it to be in UTF-8 format. I believe ToUnicodeEx() is returning it in UCS-2 LE format, it also only works with wide chars. Is there any way to do this using either fstream or wfstream by somehow converting into UTF-8 first? Thanks!

You might want to use the WideCharToMultiByte function.
For example:
wchar_t buffer[LEN]; // input buffer
char output_buffer[OUT_LEN]; // output buffer where the utf-8 string will be written
int num = WideCharToMultiByte(
CP_UTF8,
0,
buffer,
number_of_characters_in_buffer, // or -1 if buffer is null-terminated
output_buffer,
size_in_bytes_of_output_buffer,
NULL,
NULL);

Windows API generally refers to UTF-16 as unicode which is a little confusing. This means most unicode Win32 function calls operate on or give utf-16 strings.
So ToUnicodeEx returns a utf-16 string.
If you need this as utf 8 you'll need to convert it using WideCharToMultiByte

Thank you for all the help, I've managed to solve my problem with additional help from a blog post about WideCharToMultiByte() and UTF-8 here.
This function converts wide char arrays to a UTF-8 string:
// Takes in pointer to wide char array and length of the array
std::string ConvertCharacters(const wchar_t* buffer, int len)
{
int nChars = WideCharToMultiByte(CP_UTF8, 0, buffer, len, NULL, 0, NULL, NULL);
if (nChars == 0)
{
return u8"";
}
std::string newBuffer;
newBuffer.resize(nChars);
WideCharToMultiByte(CP_UTF8, 0, buffer, len, const_cast<char*>(newBuffer.c_str()), nChars, NULL, NULL);
return newBuffer;
}

WideCharToMultiByte - required size and bytes written are different for Shift-JIS codepage

I've got a Unicode string containing four Japanese characters and I'm using WideCharToMultiByte to convert it to a multi-byte string specifying the Shift-JIS codepage of 932. In order to get the size of the required buffer I'm calling WideCharToMultiByte first with the cbMultiByte parameter set to 0. This is returning 9 as expected, but then when I actually call WideCharToMultiByte again to do the conversion it's returning the number of bytes written as 13. An example is below, I'm currently hard coding my buffer size to 100:
BSTR value = SysAllocString(L"日経先物");
char *buffer = new char[100];
int sizeRequired = WideCharToMultiByte(932, 0, value, -1, NULL, 0, NULL, NULL);
// sizeRequired is 9 as expected
int bytesWritten = WideCharToMultiByte(932, 0, value, sizeRequired, buffer, 100, NULL, NULL);
// bytesWritten is 13
buffer[8] contains the string terminator \0 as expected. buffer[9-12] contains byte 63.
So if I set the size of my buffer to be sizeRequired it's too small and the second call to WideCharToMultiByte fails. Does anyone know why an extra 4 bytes are written each with a byte value of 63?

You are passing the wrong arguments to WideCharToMultiByte in your second call (the required size of the destination as the length of the source). You need to change
int bytesWritten = WideCharToMultiByte(932, 0, value, sizeRequired, buffer, 100,
NULL, NULL);
to
int bytesWritten = WideCharToMultiByte(932, 0, value, -1, buffer, sizeRequired,
NULL, NULL);

Set Registry Value to a Wide Character String (WCHAR) in C++

I'm trying to add a wide character string to registry in C++. The problem is that the RegSetValueEx() function does not support wide chars, it only supports BYTE type (BYTE = unsigned char).
WCHAR myPath[] = "C:\\éâäà\\éâäà.exe"
RegSetValueExA(HKEY_CURRENT_USER, "MyProgram", 0, REG_SZ, myPath, sizeof(myPath)); // error: cannot convert argument 5 from WCHAR* to BYTE*
And please don't tell me I should convert WCHAR to BYTE because characters such as é and â can't be stored as 8 bit characters.
I'm sure this is possible because I tried opening regedit and adding a new key with value C:\\éâäà\\éâäà.exe and it worked. I wonder how other programs can add themselves to startup on a Russian or Chinese computer.
Is there another way to do so? Or is there a way to format wide character path using wildcards?
Edit: The Unicode version of the function RegSetValueExW() only changes the type of the second argument.

You are calling RegSetValueExA() when you should be calling RegSetValueExW() instead. But in either case, RegSetValueEx() writes bytes, not characters, that is why the lpData parameter is declared as BYTE*. Simply type-cast your character array. The REG_SZ value in the dwType parameter will let RegSetValueEx() know that the bytes represent a Unicode string. And make sure to include the null terminator in the value that you pass to the cbData parameter, per the documentation:
cbSize [in]
The size of the information pointed to by the lpData parameter, in bytes. If the data is of type REG_SZ, REG_EXPAND_SZ, or REG_MULTI_SZ, cbData must include the size of the terminating null character or characters.
For example:
WCHAR myPath[] = L"C:\\éâäà\\éâäà.exe";
RegSetValueExW(HKEY_CURRENT_USER, L"MyProgram", 0, REG_SZ, (LPBYTE)myPath, sizeof(myPath));
Or:
LPCWSTR myPath = L"C:\\éâäà\\éâäà.exe";
RegSetValueExW(HKEY_CURRENT_USER, L"MyProgram", 0, REG_SZ, (LPCBYTE)myPath, (lstrlenW(myPath) + 1) * sizeof(WCHAR));
That being said, you should not be writing values to the root of HKEY_CURRENT_USER itself. You should be writing to a subkey instead, eg:
WCHAR myPath[] = L"C:\\éâäà\\éâäà.exe";
if (RegCreateKeyEx(HKEY_CURRENT_USER, L"Software\\MyProgram", 0, NULL, REG_OPTION_NON_VOLATILE, KEY_SET_VALUE, NULL, &hKey, NULL) == 0)
{
RegSetValueExW(hKey, L"MyValue", 0, REG_SZ, (LPBYTE)myPath, sizeof(myPath));
RegCloseKey(hKey);
}

It seems to me you're trying to use the narrow/non-wide-char version of that function, which will only support ASCII. How about trying RegSetValueExW? Maybe you should also look up how the Windows API tries to supports ASCII and UNICODE as transparently as possible.

Edit: The Unicode version of the function RegSetValueExW() only changes the type of the second argument.
No it does not.
REG_SZ: A null-terminated string. This will be either a Unicode or an ANSI string, depending on whether you use the Unicode or ANSI functions.
From here:
https://learn.microsoft.com/en-us/windows/win32/sysinfo/registry-value-types

Number of bytes of CString in C++

I have a Unicode string stored in CString and I need to know the number bytes this string takes in UTF-8 encoding. I know CString has a method getLength(), but that returns number of characters, not bytes.
I tried (beside other things) converting to char array, but I get (logically, I guess) only array of wchar_t, so this doesn't solve my problem.
To be clear about my goal. For the input lets say "aaa" I want "3" as output (since "a" takes one byte in UTF-8). But for the input "āaa", I'd like to see output "4" (since ā is two byte character).
I think this has to be quite common request, but even after 1,5 hours of search and experimenting, I couldn't find the correct solution.
I have very little experience with Windows programming, so maybe I left out some crucial information. If you feel like that, please let me know, I'll add any information you request.

As your CString contains a series of wchar_t, you can just use WideCharToMultiByte with the output charset as CP_UTF8. The function will return the number of bytes written to the output buffer, or the length of the UTF-8 encoded string
LPWSTR instr;
char outstr[MAX_OUTSTR_SIZE];
int utf8_len = WideCharToMultiByte(CP_UTF8, 0, instr, -1, outstr, MAX_OUTSTR_SIZE, NULL, NULL);
If you don't need the output string, you can simply set the output buffer size to 0
cbMultiByte
Size, in bytes, of the buffer indicated by lpMultiByteStr. If this parameter is set to 0, the function returns the required buffer size for lpMultiByteStr and makes no use of the output parameter itself.
In that case the function will return the number of bytes in UTF-8 without really outputting anything
int utf8_len = WideCharToMultiByte(CP_UTF8, 0, instr, -1, NULL, 0, NULL, NULL);
If your CString is really CStringA, i.e. _UNICODE is not defined, then you need to use MultiByteToWideChar to convert the string to UTF-16 and then convert from UTF-16 to UTF-8 with WideCharToMultibyte. See How do I convert an ANSI string directly to UTF-8? But new code should never be compiled without Unicode support anyway

FILE_NOTIFY_INFORMATION doesn't support Utf-8 file name

I am trying to watch a folder changes and notify the added filename so here is my code
bool FileWatcher::NotifyChange()
{
// Read the asynchronous result of the previous call to ReadDirectory
DWORD dwNumberbytes;
GetOverlappedResult(hDir, &overl, &dwNumberbytes, FALSE);
// Browse the list of FILE_NOTIFY_INFORMATION entries
FILE_NOTIFY_INFORMATION *pFileNotify = (FILE_NOTIFY_INFORMATION *)buffer[curBuffer];
// Switch the 2 buffers
curBuffer = (curBuffer + 1) % (sizeof(buffer)/(sizeof(buffer[0])));
SecureZeroMemory(buffer[curBuffer], sizeof(buffer[curBuffer]));
// start a new asynchronous call to ReadDirectory in the alternate buffer
ReadDirectoryChangesW(
hDir, /* handle to directory */
&buffer[curBuffer], /* read results buffer */
sizeof(buffer[curBuffer]), /* length of buffer */
FALSE, /* monitoring option */
FILE_NOTIFY_CHANGE_FILE_NAME ,
//FILE_NOTIFY_CHANGE_LAST_WRITE, /* filter conditions */
NULL, /* bytes returned */
&overl, /* overlapped buffer */
NULL); /* completion routine */
for (;;) {
(pFileNotify->Action == FILE_ACTION_ADDED)
{
qDebug()<<"in NotifyChange if ";
char szAction[42];
char szFilename[MAX_PATH] ;
memset(szFilename,'\0',sizeof( szFilename));
strcpy(szAction,"added");
wcstombs( szFilename, pFileNotify->FileName, MAX_PATH);
qDebug()<<"pFileNotify->FileName : "<<QString::fromWCharArray(pFileNotify->FileName)<<"\nszFilename : "<<QString(szFilename);
}
// step to the next entry if there is one
if (!pFileNotify->NextEntryOffset)
return false;
pFileNotify = (FILE_NOTIFY_INFORMATION *)((PBYTE)pFileNotify + pFileNotify->NextEntryOffset);
}
pFileNotify=NULL;
return true;
}
It works fine unless a file with Arabic name was added so I get
pFileNotify->FileName : "??? ???????.txt"
szFilename : ""
How can I support the UTF-8 code file name ???
any idea please.

Apart from FILE_NOTIFY_INFORMATION::FileName not being null-terminated, there's nothing wrong with it.
FileName:
A variable-length field that contains the file name relative to the directory handle. The file name is in the Unicode character format and is not null-terminated.
If there is both a short and long name for the file, the function will return one of these names, but it is unspecified which one.
FileNameLength: The size of the file name portion of the record, in bytes. Note that this value does not include the terminating null character.
You'll have to use FILE_NOTIFY_INFORMATION::FileNameLength / sizeof(WCHAR) to get the length of the string in wchars pointed to by FileName. So in your case, the proper way would be:
size_t cchFileNameLength = pFileNotify->FileNameLength / sizeof(WCHAR);
QString::fromWCharArray( pFileNotify->FileName, cchFileNameLength );
If you need to use a function that expects the string to be null-terminated (like wcstombs) you'd have to allocate a temporary buffer with the size of FILE_NOTIFY_INFORMATION::FileNameLength + sizeof(WCHAR) and null-terminate it yourself.
As for the empty szFilename and question marks, that's just the result of converting an UTF16 (NTFS) filename that contains unconvertible characters to ANSI. If there's no conversion possible, wcstombs returns an error and QDebug converts any unconvertible character to ?.
If wcstombs encounters a wide character it cannot convert to a multibyte character, it returns –1 cast to type size_t and sets errno to EILSEQ.
So if you need to support unicode filenames, do not convert them to ANSI and exclusively handle them with functions that support unicode.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why I get wrong char array when I use WideCharToMultiByte? - c++

Related

Converting to UTF-8 from ToUnicodeEx()

WideCharToMultiByte - required size and bytes written are different for Shift-JIS codepage

Set Registry Value to a Wide Character String (WCHAR) in C++

Number of bytes of CString in C++

FILE_NOTIFY_INFORMATION doesn't support Utf-8 file name

Categories

Resources