Converting a string of multibyte characters to widechar's gives unexpected results - c++

I'm trying to read a web-page in UTF-8 encoding using WinInet library.
Here's some of my code:
HINTERNET hUrl = ::InternetOpenUrl(hInet, wurl.c_str(),NULL,NULL,NULL,NULL);
CHAR buffer[65536];
std::wstring full_content;
std::wstring read_content;
DWORD number_of_bytes_read=1;
while(number_of_bytes_read)
{
::InternetReadFile(hUrl, buffer, 65536, &number_of_bytes_read);
// ::InternetReadFileExW(hUrl, &buffersw, IRF_SYNC,NULL);
//((hUrl,buffer,65536,&number_of_bytes_read);
read_content.resize(number_of_bytes_read);
::MultiByteToWideChar(CP_ACP,MB_COMPOSITE,
&buffer[0],number_of_bytes_read,
&read_content[0],number_of_bytes_read);
full_content.append(read_content);
//readed_content.append(buffer,number_of_bytes_read);
}
I correctly see the english symbols, but instead of russian symbols I see a trash. What can it be?
Thanks in advance.

Your web page is UTF-8 and yet you decode it using ANSI code page (CP_ACP). Use CP_UTF8 instead

Change CP_ACP to CP_UTF8 and MB_COMPOSITE to 0
From the docs
For UTF-8 or code page 54936 (GB18030, starting with Windows Vista), dwFlags must be set to either 0 or MB_ERR_INVALID_CHARS. Otherwise, the function fails with ERROR_INVALID_FLAGS.

Do not convert at all. Keep it UTF-8 in memory. Convert to UTF-16 only when interacting with Windows API functions.
More info on this approach in http://utf8everywhere.org.

Related

Convert Japanese wstring to std::string

Can anyone suggest a good method to convert a Japanese std::wstring to std::string?
I used the below code. Japanese strings are not converting properly on an English OS.
std::string WstringTostring(std::wstring str)
{
size_t size = 0;
_locale_t lc = _create_locale(LC_ALL, "ja.JP.utf8");
errno_t err = _wcstombs_s_l(&size, NULL, 0, &str[0], _TRUNCATE, lc);
std::string ret = std::string(size, 0);
err = _wcstombs_s_l(&size, &ret[0], size, &str[0], _TRUNCATE, lc);
_free_locale(lc);
ret.resize(size-1);
return ret;
}
The wstring is "C\\files\\ブ種別.pdf".
The converted string is "C:\\files\\ブ種別.pdf".
It actually looks right to me.
That is the UTF-8-encoded version of your input (which presumably was UTF-16 before conversion), but shown in its ASCII-decoded form due to a mistake somewhere in your toolchain.
You just need to calibrate your file/terminal/display to render text output as if it were UTF-8 (which it is).
Also, remember that std::string is just a container of bytes, and does not inherently specify or imply any particular encoding. So your question is rather "how can I convert UTF-16 (containing Japanese characters) into UTF-8 in Windows" or, as it turns out, "how do I configure my terminal to display UTF-8?".
If your display for this string is the Visual Studio locals window (which you suggest is the case with your comment "I observed the value of the "ret" string in local window while debugging") you are out of luck, because VS has no idea what encoding your string is in (nor does it attempt to find out).
For other aspects of Visual Studio, though, such as the console output window, there are various approaches to work around this (example).
EDIT: some things first. Windows has the notion of the ANSI codepage. It's the default codepage of non-Unicode strings that Windows assumes. Every program that uses non-Unicode versions of Windows API, and doesn't specify the codepage explicitly, uses the ANSI codepage.
The ANSI codepage is driven by the "System default locale" setting in Control Panel. As of Windows 10 May 2020, it's under Region/Administrative/Change system locale. It takes admin rights to change that setting.
By default, Windows with the system default locale set to English uses codepage 1252 as the ANSI codepage. That codepage doesn't contain the Japanese characters. So using Japanese in Unicode unaware programs in that situation is hard or impossible.
It looks like the OP wants or has to use a piece of third part C++ code that uses multibyte strings (std::string and/or char*). That doesn't necessarily mean that it's Unicode unaware, but it might. What the OP is trying to do entirely depends on the way that third party library is coded. It might not be possible at all.
Looks like your problem is that some piece of third party code expects a file name in ANSI, and uses ANSI functions to open that file. In an English system with the default value of the system locale, Japanese can't be converted to ANSI, because the ANSI codepage (CP1252 in practice) doesn't contain the Japanese characters.
What I think you should do, you should get a short file name instead using GetShortPathNameW, convert that file path to ANSI, and pass that string. Like this:
std::string WstringFilenameTostring(std::wstring str)
{
wchar_t ShortPath[MAX_PATH+1];
DWORD dw = GetShortPathNameW(str.c_str(), ShortPath, _countof(ShortPath));
char AnsiPath[MAX_PATH+1];
int n = WideCharToMultiByte(CP_ACP, 0, ShortPath, -1, AnsiPath, _countof(AnsiPath), 0, 0);
return string(AnsiPath);
}
This code is for filenames only. For any other Japanese string, it will return nonsense. In my test, it converted "日本語.txt" to something unreadable but alphanumeric :)

Can't display 'ä' in GLFW window title

glfwSetWindowTitle(win, "Nämen");
Becomes "N?men", where '?' is in a little black, twisted square, indicating that the character could not be displayed.
How do I display 'ä'?
If you want to use non-ASCII letters in the window title, then the string has to be utf-8 encoded.
GLFW: Window title:
The window title is a regular C string using the UTF-8 encoding. This means for example that, as long as your source file is encoded as UTF-8, you can use any Unicode characters.
If you see a little black, twisted square then this indicates that the ä is encoded with some iso encoding that is not UTF-8, maybe something like latin1. To fix this you need to open it in the editor in which you can change the encoding of the file, change it to uft-8 (without BOM) and fix the ä in the title.
It seems like the GLFW implementation does not work according to the specification in this case. Probably the function still uses Latin-1 instead of UTF-8.
I had the same problem on GLFW 3.3 Windows 64 bit precompiled binaries and fixed it like this:
SetWindowTextA(glfwGetWin32Window(win),"Nämen")
The issue does not lie within GLFW but within the compiler. Encodings are handled by the major compilers as follows:
Good guy clang assumes that every file is encoded in UTF-8
Trusty gcc checks the system's settings1 and falls back on UTF-8, when it fails to determine one.
MSVC checks for BOM and uses the detected encoding; otherwise it assumes that file is encoded using the users current code page2.
You can determine your current code page by simply running chcp in your (Windows) console or PowerShell. For me, on a fresh install of Windows 11, it yields "850" (language: English; keyboard: German), which stands for Code Page 850.
To fix this issue you have several solutions:
Change your systems code page. Arguably the worst solution.
Prefix your strings with u8, escape all unicode literals and convert the string to wide char before passing Win32 functions; e.g.:
const char* title = u8"\u0421\u043b\u0430\u0432\u0430\u0020\u0423\u043a\u0440\u0430\u0457\u043d\u0456\u0021";
// This conversion is actually performed by GLFW; see footnote ^3
const int l = MultiByteToWideChar(CP_UTF8, 0, title, -1, NULL, 0);
wchar_t* buf = _malloca(l * sizeof(wchar_t));
MultiByteToWideChar(CP_UTF8, 0, title, -1, buf, l);
SetWindowTextW(hWnd, buf);
_freea(buf);
Save your source files with UTF-8 encoding WITH BOM. This allows you to write your strings without having to escape them. You'll still need to convert the string to a wide char string using the method seen above.
Specify the /utf-8 flag when compiling; this has the same effect as the previous solution, but you don't need the BOM anymore.
The solutions stated above still require you convert your good and nice string to a big chunky wide string.
Another option would be to provide a manifest4 with the activeCodePage set to UTF-8. This way all Win32 functions with the A-suffix (e.g. SetWindowTextA) now accept and properly handle UTF-8 strings, if the running system is at least or newer than Windows Version 1903.
TL;DR
Compile your application with the /utf-8 flag active.
IMPORTANT: This works for the Win32 APIs. This doesn't let you magically write Unicode emojis to the console like a hipster JS developer.
1 I suppose it reads the LC_ALL setting on linux. In the last 6 years I have never seen a distribution, that does NOT specify UTF-8. However, take this information with a grain of salt; I might be entirely wrong on how gcc handles this now.
2 If no byte-order mark is found, it assumes that the source file is encoded in the current user code page [...].
3 GLFW performs the conversion as seen here.
4 More about Win32 ANSI-APIs, Manifest and UTF-8 can be found here.

Get Strings in right encoding in c++

I asked a similar question before.
But I am still in trouble with encodings in c++.
I try to describe the problem as well as possible.
I have a c++ client, communicating with an c# service over TCP.
Now I need to display the Messages from the service in an Messagebox (Win32 API).
The Bytes, sended by the c# service are UTF-8 encoded.
Important to know, the c++ client will only be running on Windows Systems.
This is the code to receive the bytes and to display the Text:
char buffer[1024];
int receivedBytes = recv(socketHandle, buffer, sizeof(buffer) - 1, 0);
char str[receivedBytes];
for (int index = 0; index < receivedBytes; index++)
{
str[index] = buffer[index];
}
MessageBox(mainWindow, (LPCTSTR)str, (LPCTSTR) "Fehler", MB_OK|MB_ICONERROR);
If the Text contains chatacters like üäö, they are not shown in the Messagebox the correct way.
What can I do to receive the message as UTF-8 String in c++?
Is there a possibility to convert the char[] to an UTF-8 String?
Thx for helping
Tobi
If you want to display unicode characters in Windows, you need to translate UTF8 string into UTF16 (older UCS2), as this is the unicode standard Windows handles. You do that with MultiByteToWideChar function.
Also make sure that the #define UNICODE is set before you include Windows headers, so that MessageBox points to MessageBoxW or use MessageBoxW explicitly.

Storing and retrieving UTF-8 strings from Windows resource (RC) files

I created an RC file which contains a string table, I would like to use some special
characters: ö ü ó ú ő ű á é. so I save the string with UTF-8 encoding.
But when I call in my cpp file, something like this:
LoadString("hu.dll", 12, nn, MAX_PATH);
I get a weird result:
How do I solve this problem?
As others have pointed out in the comments, the Windows APIs do not provide direct support for UTF-8 encoded text. You cannot pass the MessageBox function UTF-8 encoded strings and get the output that you expect. It will, instead, interpret them as characters in your local code page.
To get a UTF-8 string to pass to the Windows API functions (including MessageBox), you need to use the MultiByteToWideChar function to convert from UTF-8 to UTF-16 (what Windows calls Unicode, or wide strings). Passing the CP_UTF8 flag for the first parameter is the magic that enables this conversion. Example:
std::wstring ConvertUTF8ToUTF16String(const char* pszUtf8String)
{
// Determine the size required for the destination buffer.
const int length = MultiByteToWideChar(CP_UTF8,
0, // no flags required
pszUtf8String,
-1, // automatically determine length
nullptr,
0);
// Allocate a buffer of the appropriate length.
std::wstring utf16String(length, L'\0');
// Call the function again to do the conversion.
if (!MultiByteToWideChar(CP_UTF8,
0,
pszUtf8String,
-1,
&utf16String[0],
length))
{
// Uh-oh! Something went wrong.
// Handle the failure condition, perhaps by throwing an exception.
// Call the GetLastError() function for additional error information.
throw std::runtime_error("The MultiByteToWideChar function failed");
}
// Return the converted UTF-16 string.
return utf16String;
}
Then, once you have a wide string, you will explicitly call the wide-string variant of the MessageBox function, MessageBoxW.
However, if you only need to support Windows and not other platforms that use UTF-8 everywhere, you will probably have a much easier time sticking exclusively with UTF-16 encoded strings. This is the native Unicode encoding that Windows uses, and you can pass these types of strings directly to any of the Windows API functions. See my answer here to learn more about the interaction between Windows API functions and strings. I recommend the same thing to you as I did to the other guy:
Stick with wchar_t and std::wstring for your characters and strings, respectively.
Always call the W variants of Windows API functions, including LoadStringW and MessageBoxW.
Ensure that the UNICODE and _UNICODE macros are defined either before you include any of the Windows headers or in your project's build settings.

to unicode or not to unicode

I'm getting a value from the registry. This value might have double byte characters in it.
I will later have to transfer this across the network to a C# client to display. C# is all unicode.
The function returns MBCS if you call it non-unicode.
What should I use?
string result = string(cbData);
RegQueryValueExA(h_sub_key, "DisplayName", NULL, NULL, (LPBYTE) &result[0], &cbData)
or
string result = string(cbData);
RegQueryValueExW(h_sub_key, L"DisplayName", NULL, NULL, (LPBYTE) &result[0], &cbData)
Using Unicode whenever possible will make your life easier. The registry contains Unicode natively and converts to MBCS on the fly when you use ReqQueryValueExA, why would you want to do an unneeded conversion?
Converting to UTF-8 from UTF-16 might make sense for information going over the network, but if you control both ends of the connection it wouldn't be necessary.
No, that's not the way it works. The string you get back from the first snippet is encoded according to the current system code page. Could be a double-byte encoding. Could be anything. Big problem of course, the C# code on the other end of that Internet connection has no way to guess what the code page might be.
So do not use the first snippet. The second one gets you the string in utf16, the native encoding used in Windows, result needs to be an std::wstring. Also the encoding used by C# so you could send the binary string. Although that's not typically a good idea, xml is popular. It is up to you to set the wire format.