Storing and retrieving UTF-8 strings from Windows resource (RC) files

Storing and retrieving UTF-8 strings from Windows resource (RC) files - c++

I created an RC file which contains a string table, I would like to use some special
characters: ö ü ó ú ő ű á é. so I save the string with UTF-8 encoding.
But when I call in my cpp file, something like this:
LoadString("hu.dll", 12, nn, MAX_PATH);
I get a weird result:
How do I solve this problem?

As others have pointed out in the comments, the Windows APIs do not provide direct support for UTF-8 encoded text. You cannot pass the MessageBox function UTF-8 encoded strings and get the output that you expect. It will, instead, interpret them as characters in your local code page.
To get a UTF-8 string to pass to the Windows API functions (including MessageBox), you need to use the MultiByteToWideChar function to convert from UTF-8 to UTF-16 (what Windows calls Unicode, or wide strings). Passing the CP_UTF8 flag for the first parameter is the magic that enables this conversion. Example:
std::wstring ConvertUTF8ToUTF16String(const char* pszUtf8String)
{
// Determine the size required for the destination buffer.
const int length = MultiByteToWideChar(CP_UTF8,
0, // no flags required
pszUtf8String,
-1, // automatically determine length
nullptr,
0);
// Allocate a buffer of the appropriate length.
std::wstring utf16String(length, L'\0');
// Call the function again to do the conversion.
if (!MultiByteToWideChar(CP_UTF8,
0,
pszUtf8String,
-1,
&utf16String[0],
length))
{
// Uh-oh! Something went wrong.
// Handle the failure condition, perhaps by throwing an exception.
// Call the GetLastError() function for additional error information.
throw std::runtime_error("The MultiByteToWideChar function failed");
}
// Return the converted UTF-16 string.
return utf16String;
}
Then, once you have a wide string, you will explicitly call the wide-string variant of the MessageBox function, MessageBoxW.
However, if you only need to support Windows and not other platforms that use UTF-8 everywhere, you will probably have a much easier time sticking exclusively with UTF-16 encoded strings. This is the native Unicode encoding that Windows uses, and you can pass these types of strings directly to any of the Windows API functions. See my answer here to learn more about the interaction between Windows API functions and strings. I recommend the same thing to you as I did to the other guy:
Stick with wchar_t and std::wstring for your characters and strings, respectively.
Always call the W variants of Windows API functions, including LoadStringW and MessageBoxW.
Ensure that the UNICODE and _UNICODE macros are defined either before you include any of the Windows headers or in your project's build settings.

Related

Convert Japanese wstring to std::string

Can anyone suggest a good method to convert a Japanese std::wstring to std::string?
I used the below code. Japanese strings are not converting properly on an English OS.
std::string WstringTostring(std::wstring str)
{
size_t size = 0;
_locale_t lc = _create_locale(LC_ALL, "ja.JP.utf8");
errno_t err = _wcstombs_s_l(&size, NULL, 0, &str[0], _TRUNCATE, lc);
std::string ret = std::string(size, 0);
err = _wcstombs_s_l(&size, &ret[0], size, &str[0], _TRUNCATE, lc);
_free_locale(lc);
ret.resize(size-1);
return ret;
}
The wstring is "C\\files\\ブ種別.pdf".
The converted string is "C:\\files\\ãƒ–ç¨®åˆ¥.pdf".

It actually looks right to me.
That is the UTF-8-encoded version of your input (which presumably was UTF-16 before conversion), but shown in its ASCII-decoded form due to a mistake somewhere in your toolchain.
You just need to calibrate your file/terminal/display to render text output as if it were UTF-8 (which it is).
Also, remember that std::string is just a container of bytes, and does not inherently specify or imply any particular encoding. So your question is rather "how can I convert UTF-16 (containing Japanese characters) into UTF-8 in Windows" or, as it turns out, "how do I configure my terminal to display UTF-8?".
If your display for this string is the Visual Studio locals window (which you suggest is the case with your comment "I observed the value of the "ret" string in local window while debugging") you are out of luck, because VS has no idea what encoding your string is in (nor does it attempt to find out).
For other aspects of Visual Studio, though, such as the console output window, there are various approaches to work around this (example).

EDIT: some things first. Windows has the notion of the ANSI codepage. It's the default codepage of non-Unicode strings that Windows assumes. Every program that uses non-Unicode versions of Windows API, and doesn't specify the codepage explicitly, uses the ANSI codepage.
The ANSI codepage is driven by the "System default locale" setting in Control Panel. As of Windows 10 May 2020, it's under Region/Administrative/Change system locale. It takes admin rights to change that setting.
By default, Windows with the system default locale set to English uses codepage 1252 as the ANSI codepage. That codepage doesn't contain the Japanese characters. So using Japanese in Unicode unaware programs in that situation is hard or impossible.
It looks like the OP wants or has to use a piece of third part C++ code that uses multibyte strings (std::string and/or char*). That doesn't necessarily mean that it's Unicode unaware, but it might. What the OP is trying to do entirely depends on the way that third party library is coded. It might not be possible at all.
Looks like your problem is that some piece of third party code expects a file name in ANSI, and uses ANSI functions to open that file. In an English system with the default value of the system locale, Japanese can't be converted to ANSI, because the ANSI codepage (CP1252 in practice) doesn't contain the Japanese characters.
What I think you should do, you should get a short file name instead using GetShortPathNameW, convert that file path to ANSI, and pass that string. Like this:
std::string WstringFilenameTostring(std::wstring str)
{
wchar_t ShortPath[MAX_PATH+1];
DWORD dw = GetShortPathNameW(str.c_str(), ShortPath, _countof(ShortPath));
char AnsiPath[MAX_PATH+1];
int n = WideCharToMultiByte(CP_ACP, 0, ShortPath, -1, AnsiPath, _countof(AnsiPath), 0, 0);
return string(AnsiPath);
}
This code is for filenames only. For any other Japanese string, it will return nonsense. In my test, it converted "日本語.txt" to something unreadable but alphanumeric :)

Printing em-dash to console window using printf? [duplicate]

This question already has answers here:
Is it possible to cout an EM DASH on Linux and Windows? [duplicate]
(2 answers)
Closed 5 years ago.
A simple problem: I'm writing a chatroom program in C++ (but it's primarily C-style) for a class, and I'm trying to print, “#help — display a list of commands...” to the output window. While I could use two hyphens (--) to achieve roughly the same effect, I'd rather use an em-dash (—). printf(), however, doesn't seem to support printing em-dashes. Instead, the console just prints out the character, ù, in its place, despite the fact that entering em-dashes directly into the prompt works fine.
How do I get this simple Unicode character to show up?
Looking at Windows alt key codes, I find it interesting how alt+0151 is "—" and alt+151 is "ù". Is this related to my problem, or a simple coincidence?

the windows is unicode (UTF-16) system. console unicode as well. if you want print unicode text - you need (and this is most effective) use WriteConsoleW
BOOL PrintString(PCWSTR psz)
{
DWORD n;
return WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), psz, (ULONG)wcslen(psz), &n, 0);
}
PrintString(L"—");
in this case in your binary file will be wide character — (2 bytes 0x2014) and console print it as is.
if ansi (multi-byte) function is called for output console - like WriteConsoleA or WriteFile - console first translate multi-byte string to unicode via MultiByteToWideChar and in place CodePage will be used value returned by GetConsoleOutputCP. and here (translation) can be problem if you use characters > 0x80
first of all compiler can give you warning: The file contains a character that cannot be represented in the current code page (number). Save the file in Unicode format to prevent data loss. (C4819). but even after you save source file in Unicode format, can be next:
wprintf(L"ù"); // no warning
printf("ù"); //warning C4566
because L"ù" saved as wide char string (as is) in binary file - here all ok and no any problems and warning. but "ù" is saved as char string (single byte string). compiler need convert wide string "ù" from source file to multi-byte string in binary (.obj file, from which linker create pe than). and compiler use for this WideCharToMultiByte with CP_ACP (The current system default Windows ANSI code page.)
so what happens if you say call printf("ù"); ?
unicode string "ù" will be converted to multi-byte
WideCharToMultiByte(CP_ACP, ) and this will be at compile time. resulting multi-byte string will be saved in binary file
the console it run-time convert your multi-byte string to
wide char by MultiByteToWideChar(GetConsoleOutputCP(), ..) and
print this string
so you got 2 conversions: unicode -> CP_ACP -> multi-byte -> GetConsoleOutputCP() -> unicode
by default GetConsoleOutputCP() == CP_OEMCP != CP_ACP even if you run program on computer where you compile it. (on another computer with another CP_OEMCP especially)
problem in incompatible conversions - different code pages used. but even if you change console code page to your CP_ACP - convertion anyway can wrong translate some characters.
and about CRT api wprintf - here situation is next:
the wprintf first convert given string from unicode to multi-byte by using it internal current locale (and note that crt locale independent and different from console locale). and then call WriteFile with multi-byte string. console convert back this multi-bytes string to unicode
unicode -> current_crt_locale -> multi-byte -> GetConsoleOutputCP() -> unicode
so for use wprintf we need first set current crt locale to GetConsoleOutputCP()
char sz[16];
sprintf(sz, ".%u", GetConsoleOutputCP());
setlocale(LC_ALL, sz);
wprintf(L"—");
but anyway here i view (on my comp) - on screen instead —. so will be -— if call PrintString(L"—"); (which used WriteConsoleW) just after this.
so only reliable way print any unicode characters (supported by windows) - use WriteConsoleW api.

After going through the comments, I've found eryksun's solution to be the simplest (...and the most comprehensible):
#include <stdio.h>
#include <io.h>
#include <fcntl.h>
int main()
{
//other stuff
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"#help — display a list of commands...");
Portability isn't a concern of mine, and this solves my initial problem—no more ù—my beloved em-dash is on display.
I acknowledge this question is essentially a duplicate of the one linked by sata300.de. Albeit, with printf in the place of cout, and unnecessary ramblings in the place of relevant information.

Convert wide CString to char*

There are lots of times this question has been asked and as many answers - none of which work for me and, it seems, many others. The question is about wide CStrings and 8bit chars under MFC. We all want an answer that will work in ALL cases, not a specific instance.
void Dosomething(CString csFileName)
{
char cLocFileNamestr[1024];
char cIntFileNamestr[1024];
// Convert from whatever version of CString is supplied
// to an 8 bit char string
cIntFileNamestr = ConvertCStochar(csFileName);
sprintf_s(cLocFileNamestr, "%s_%s", cIntFileNamestr, "pling.txt" );
m_KFile = fopen(LocFileNamestr, "wt");
}
This is an addition to existing code (by somebody else) for debugging.
I don't want to change the function signature, it is used in many places.
I cannot change the signature of sprintf_s, it is a library function.

You are leaving out a lot of details, or ignoring them. If you are building with UNICODE defined (which it seems you are), then the easiest way to convert to MBCS is like this:
CStringA strAIntFileNameStr = csFileName.GetString(); // uses default code page
CStringA is the 8-bit/MBCS version of CString.
However, it will fill with some garbage characters if the unicode string you are translating from contains characters that are not in the default code page.
Instead of using fopen(), you could use _wfopen() which will open a file with a unicode filename. To create your file name, you would use swprintf_s().

an answer that will work in ALL cases, not a specific instance...
There is no such thing.
It's easy to convert "ABCD..." from wchar_t* to char*, but it doesn't work that way with non-Latin languages.
Stick to CString and wchar_t when your project is unicode.
If you need to upload data to webpage or something, then use CW2A and CA2W for utf-8 and utf-16 conversion.
CStringW unicode = L"Россия";
MessageBoxW(0,unicode,L"Russian",0);//should be okay
CStringA utf8 = CW2A(unicode, CP_UTF8);
::MessageBoxA(0,utf8,"format error",0);//WinApi doesn't get UTF-8
char buf[1024];
strcpy(buf, utf8);
::MessageBoxA(0,buf,"format error",0);//same problem
//send this buf to webpage or other utf-8 systems
//this should be compatible with notepad etc.
//text will appear correctly
ofstream f(L"c:\\stuff\\okay.txt");
f.write(buf, strlen(buf));
//convert utf8 back to utf16
unicode = CA2W(buf, CP_UTF8);
::MessageBoxW(0,unicode,L"okay",0);

Can I retrieve a path, containing other than Latin characters?

I call GetModuleFileName function, in order to retrieve the fully qualified path of a specified module, in order to call another .exe in the same file, via Process::Start method.
However, .exe cannot be called when the path contains other than Latin characters (in my case Greek characters).
Is there any way I can fix this?
Code:
TCHAR path[1000];
GetModuleFileName(NULL, path, 1000) ; // Retrieves the fully qualified path for the file that
// contains the specified module.
PathRemoveFileSpec(path); // Removes the trailing file name and backslash from a path (TCHAR).
CHAR mypath[1000];
// Convert TCHAR to CHAR.
wcstombs(mypath, path, wcslen(path) + 1);
// Formatting the string: constructing a string by substituting computed values at various
// places in a constant string.
CHAR mypath2[1000];
sprintf_s(mypath2, "%s\\Client_JoypadCodesApplication.exe", mypath);
String^ result;
result = marshal_as<String^>(mypath2);
Process::Start(result);

Strings in .NET are encoded in UTF-16. The fact that you are calling wcstombs() means your app is compiled for Unicode and TCHAR maps to WCHAR, which is what Windows uses for UTF-16. So there is no need to call wcstombs() at all. Retrieve and format the path as UTF-16, then marshal it as UTF-16. Stop using TCHAR altogether (unless you need to compile for Windows 9x/ME):
WCHAR path[1000];
GetModuleFileNameW(NULL, path, 1000);
PathRemoveFileSpecW(path);
WCHAR mypath[1000];
swprintf_s(mypath, 1000, L"%s\\Client_JoypadCodesApplication.exe", path);
String^ result;
result = marshal_as<String^>(mypath);
Process::Start(result);
A better option would be to use a native .NET solution instead (untested):
String^ path = Path::DirectoryName(Application->StartupPath); // uses GetModuleFileName() internally
// or:
//String^ path = Path::DirectoryName(Process::GetCurrentProcess()->MainModule->FileName);
Process::Start(path + L"\\Client_JoypadCodesApplication.exe");

You must use GetModuleFileNameW and store the result in a wchar_t string.
Most Win32 API functions have a "Unicode" variant, which takes/gives UTF-16 strings. Using the ANSI versions is highly discouraged.

_wfopen equivalent under Mac OS X

I'm looking to the equivalent of Windows _wfopen() under Mac OS X. Any idea?
I need this in order to port a Windows library that uses wchar* for its File interface. As this is intended to be a cross-platform library, I am unable to rely on how the client application will get the file path and give it to the library.

POSIX API in Mac OS X are usable with UTF-8 strings. In order to convert a wchar_t string to UTF-8, it is possible to use the CoreFoundation framework from Mac OS X.
Here is a class that will wrap an UTF-8 generated string from a wchar_t string.
class Utf8
{
public:
Utf8(const wchar_t* wsz): m_utf8(NULL)
{
// OS X uses 32-bit wchar
const int bytes = wcslen(wsz) * sizeof(wchar_t);
// comp_bLittleEndian is in the lib I use in order to detect PowerPC/Intel
CFStringEncoding encoding = comp_bLittleEndian ? kCFStringEncodingUTF32LE
: kCFStringEncodingUTF32BE;
CFStringRef str = CFStringCreateWithBytesNoCopy(NULL,
(const UInt8*)wsz, bytes,
encoding, false,
kCFAllocatorNull
);
const int bytesUtf8 = CFStringGetMaximumSizeOfFileSystemRepresentation(str);
m_utf8 = new char[bytesUtf8];
CFStringGetFileSystemRepresentation(str, m_utf8, bytesUtf8);
CFRelease(str);
}
~Utf8()
{
if( m_utf8 )
{
delete[] m_utf8;
}
}
public:
operator const char*() const { return m_utf8; }
private:
char* m_utf8;
};
Usage:
const wchar_t wsz = L"Here is some Unicode content: éà€œæ";
const Utf8 utf8 = wsz;
FILE* file = fopen(utf8, "r");
This will work for reading or writing files.

You just want to open a file handle using a path that may contain Unicode characters, right? Just pass the path in filesystem representation to fopen.
If the path came from the stock Mac OS X frameworks (for example, an Open panel whether Carbon or Cocoa), you won't need to do any conversion on it and will be able to use it as-is.
If you're generating part of the path yourself, you should create a CFStringRef from your path and then get that in filesystem representation to pass to POSIX APIs like open or fopen.
Generally speaking, you won't have to do a lot of that for most applications. For example, many applications may have auxiliary data files stored the user's Application Support directory, but as long as the names of those files are ASCII, and you use standard Mac OS X APIs to locate the user's Application Support directory, you don't need to do a bunch of paranoid conversion of a path constructed with those two components.
Edited to add: I would strongly caution against arbitrarily converting everything to UTF-8 using something like wcstombs because filesystem encoding is not necessarily identical to the generated UTF-8. Mac OS X and Windows both use specific (but different) canonical decomposition rules for the encoding used in filesystem paths.
For example, they need to decide whether "é" will be stored as one or two code units (either LATIN SMALL LETTER E WITH ACUTE or LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT). These will result in two different — and different-length — byte sequences, and both Mac OS X and Windows work to avoid putting multiple files with the same name (as the user perceives them) in the same directory.
The rules for how to perform this canonical decomposition can get pretty hairy, so rather than try to implement it yourself it's best to leave it to the functions the system frameworks have provided for you to do the heavy lifting.

#JKP:
Not all functions in MacOS X accept UTF8, but filenames and filepaths may be UTF8, thus all POSIX functions dealing with file access (open, fopen, stat, etc.) accept UTF8.
See here. Quote:
How a file name looks at the API level
depends on the API. Current Carbon
APIs handle file names as an array of
UTF-16 characters; POSIX ones handle
them as an array of UTF-8, which is
why UTF-8 works well in Terminal. How
it's stored on disk depends on the
disk format; HFS+ uses UTF-16, but
that's not important in most cases.
Some other POSIX functions handle UTF8 as well. E.g. functions dealing with user names, group names or user passwords use UTF8 to store the information (thus a user name can be Japanese and your password can be Chinese, no problem).
But not all handle UTF8. E.g. for all string functions an UTF8 string is just a normal C String and characters above 126 have no special meaning. They don't understand the concept of multiple bytes (chars in C) forming a single Unicode character. How other APIs handle char * pointer being passed to them is different from API to API. However, as a rule as the thumb you can say:
Either the function only accepts C strings with pure ASCII characters (only in the range 0 to 126) or it will accept UTF8. Usually functions don't allow characters above 126 and interpret them in any other encoding than UTF8. If this really was the case, it is documented and then there must be a way to pass the encoding along with the string.

If you're using Cocoa it's fairly easy with NSString. Just load the UTF16 data in using -initWithBytes:length:encoding: (or perhaps -initWithCString:encoding:) and then get a UTF8 version by calling UTF8String on the result. Then, just call fopen with your new UTF8 string as the param.
You can definitely call fopen with a UTF-8 string, regardless of language - can't help with C++ on OSX though - sorry.

I have read file name from configuration UTF8 file through wifstream (it uses wchar_t buffer).
Mac implementation is different from Linux and Windows.
wifstream reads each byte from file to separate wchar_t cell in the buffer. So we have 3 empty bytes, although open requires char string. Thus programmer can use wcstombs function to convert wide character string to multi-byte string.
The API supports UTF8. For better understanding use memory watcher and hex editor for your file.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js