Visualisation of uft-8 (Polish) not working properly - c++

My software supports multiple languages (English, German, Polish, Russian, ...). For this reason I have some language specific files with the dialog texts in the specific language (Encoded as UTF-8).
In my mfc application I open and read those files and insert the text into my AfxMessageBoxes and other UI-Windows.
// Get the codepage number. 65001 = UTF-8
// In the real code this is a parameter in the function I call (just for clarification)
LANGID languageID = 65001;
TCHAR szCodepage[10];
GetLocaleInfo (MAKELCID (languageID, SORT_DEFAULT), LOCALE_IDEFAULTANSICODEPAGE, szCodepage, 10);
int nAnsiCodePage = _ttoi (szCodepage);
// Open the file
CFile file;
CString filename = getName();
if (!file.Open(FileName, CFile::modeRead, NULL))
{
//Check if everything is fine, else break
}
// Read the file
CString inString;
int len = file.GetLength ();
UINT n = file.Read (inString.GetBuffer(len), len);
inString.ReleaseBuffer ();
int size = MultiByteToWideChar (CP_ACP, 0, strAllItems, -1, NULL, 0);
WCHAR *ubuf = new WCHAR[size + 1];
MultiByteToWideChar ((UINT) nAnsiCodePage, (nAnsiCodePage == CP_UTF8 ?
0 : MB_PRECOMPOSED), inString, -1, ubuf, (int) size);
outString = ubuf;
file.Close ();
Result:
This mechanism is working fine for special letters of russian and german, but not for polish. I already checked the utf-8 site (http://www.utf8-chartable.de/unicode-utf8-table.pl?number=1024) and the polish characters are part of it.
I also checked the hex values of my CString and everything seems to be alright, but it is not visualized in the correct way. Just for testing I changed the used codepage from utf-8 to 1250 (Eastern Europe, Polish included) and it also did not work.
What am I doing wrong?
EDIT:
When I use:
MultiByteToWideChar (CP_UTF8 , 0, inString, -1, ubuf, (int) size);
The hex-values are shortend to the "best match" letters. Meaning my result is: mezczyzna
I am using windows 7 with the english language selected.

Well, you have two options:
A. Make your application Unicode. You don't tell us whether it actually is, but I conclude it's not. This is the 'best" solution technically, but it may require a lot of effort, and it may even not be feasible at all (eg use of non-Unicode libraries).
B. If your app is non-Unicode, you have some limitations:
- Your application will only be capable of displaying correctly one codepage using the non-unicode APIs & messages, and this unfortunately cannot be set per application, it's globally set in Windows with the "Language for non-Unicode programs" option, and requires a reboot.
- To display correctly strings containing characters not in the default codepage, you need to convert them to Unicode and use the "wide" versions of APIs & messages explicitly, to display them (eg MessageBoxW()). A little cumbersome, but doable, if the operation concerns only a small number of controls.
The machine you're working on has some western european language as the "Language for non-Unicode programs", and I come to this conclusion because "This mechanism is working fine for special letters of russian and german" and "Using MessageBoxA(0, "mężczyzna", 0, 0) does not work", as you said (though i'm not sure at all about russian, as it's a different codepage).
Apart from this, as IInspectable said, int size = MultiByteToWideChar (CP_ACP, 0, strAllItems, -1, NULL, 0); makes not sense at all, as the string is known to be UTF-8, and not of the default codepage. You may also need to remove the UTF-8 BOM header, if your file contains it.

Related

Convert Japanese wstring to std::string

Can anyone suggest a good method to convert a Japanese std::wstring to std::string?
I used the below code. Japanese strings are not converting properly on an English OS.
std::string WstringTostring(std::wstring str)
{
size_t size = 0;
_locale_t lc = _create_locale(LC_ALL, "ja.JP.utf8");
errno_t err = _wcstombs_s_l(&size, NULL, 0, &str[0], _TRUNCATE, lc);
std::string ret = std::string(size, 0);
err = _wcstombs_s_l(&size, &ret[0], size, &str[0], _TRUNCATE, lc);
_free_locale(lc);
ret.resize(size-1);
return ret;
}
The wstring is "C\\files\\ブ種別.pdf".
The converted string is "C:\\files\\ブ種別.pdf".
It actually looks right to me.
That is the UTF-8-encoded version of your input (which presumably was UTF-16 before conversion), but shown in its ASCII-decoded form due to a mistake somewhere in your toolchain.
You just need to calibrate your file/terminal/display to render text output as if it were UTF-8 (which it is).
Also, remember that std::string is just a container of bytes, and does not inherently specify or imply any particular encoding. So your question is rather "how can I convert UTF-16 (containing Japanese characters) into UTF-8 in Windows" or, as it turns out, "how do I configure my terminal to display UTF-8?".
If your display for this string is the Visual Studio locals window (which you suggest is the case with your comment "I observed the value of the "ret" string in local window while debugging") you are out of luck, because VS has no idea what encoding your string is in (nor does it attempt to find out).
For other aspects of Visual Studio, though, such as the console output window, there are various approaches to work around this (example).
EDIT: some things first. Windows has the notion of the ANSI codepage. It's the default codepage of non-Unicode strings that Windows assumes. Every program that uses non-Unicode versions of Windows API, and doesn't specify the codepage explicitly, uses the ANSI codepage.
The ANSI codepage is driven by the "System default locale" setting in Control Panel. As of Windows 10 May 2020, it's under Region/Administrative/Change system locale. It takes admin rights to change that setting.
By default, Windows with the system default locale set to English uses codepage 1252 as the ANSI codepage. That codepage doesn't contain the Japanese characters. So using Japanese in Unicode unaware programs in that situation is hard or impossible.
It looks like the OP wants or has to use a piece of third part C++ code that uses multibyte strings (std::string and/or char*). That doesn't necessarily mean that it's Unicode unaware, but it might. What the OP is trying to do entirely depends on the way that third party library is coded. It might not be possible at all.
Looks like your problem is that some piece of third party code expects a file name in ANSI, and uses ANSI functions to open that file. In an English system with the default value of the system locale, Japanese can't be converted to ANSI, because the ANSI codepage (CP1252 in practice) doesn't contain the Japanese characters.
What I think you should do, you should get a short file name instead using GetShortPathNameW, convert that file path to ANSI, and pass that string. Like this:
std::string WstringFilenameTostring(std::wstring str)
{
wchar_t ShortPath[MAX_PATH+1];
DWORD dw = GetShortPathNameW(str.c_str(), ShortPath, _countof(ShortPath));
char AnsiPath[MAX_PATH+1];
int n = WideCharToMultiByte(CP_ACP, 0, ShortPath, -1, AnsiPath, _countof(AnsiPath), 0, 0);
return string(AnsiPath);
}
This code is for filenames only. For any other Japanese string, it will return nonsense. In my test, it converted "日本語.txt" to something unreadable but alphanumeric :)

Can't display 'ä' in GLFW window title

glfwSetWindowTitle(win, "Nämen");
Becomes "N?men", where '?' is in a little black, twisted square, indicating that the character could not be displayed.
How do I display 'ä'?
If you want to use non-ASCII letters in the window title, then the string has to be utf-8 encoded.
GLFW: Window title:
The window title is a regular C string using the UTF-8 encoding. This means for example that, as long as your source file is encoded as UTF-8, you can use any Unicode characters.
If you see a little black, twisted square then this indicates that the ä is encoded with some iso encoding that is not UTF-8, maybe something like latin1. To fix this you need to open it in the editor in which you can change the encoding of the file, change it to uft-8 (without BOM) and fix the ä in the title.
It seems like the GLFW implementation does not work according to the specification in this case. Probably the function still uses Latin-1 instead of UTF-8.
I had the same problem on GLFW 3.3 Windows 64 bit precompiled binaries and fixed it like this:
SetWindowTextA(glfwGetWin32Window(win),"Nämen")
The issue does not lie within GLFW but within the compiler. Encodings are handled by the major compilers as follows:
Good guy clang assumes that every file is encoded in UTF-8
Trusty gcc checks the system's settings1 and falls back on UTF-8, when it fails to determine one.
MSVC checks for BOM and uses the detected encoding; otherwise it assumes that file is encoded using the users current code page2.
You can determine your current code page by simply running chcp in your (Windows) console or PowerShell. For me, on a fresh install of Windows 11, it yields "850" (language: English; keyboard: German), which stands for Code Page 850.
To fix this issue you have several solutions:
Change your systems code page. Arguably the worst solution.
Prefix your strings with u8, escape all unicode literals and convert the string to wide char before passing Win32 functions; e.g.:
const char* title = u8"\u0421\u043b\u0430\u0432\u0430\u0020\u0423\u043a\u0440\u0430\u0457\u043d\u0456\u0021";
// This conversion is actually performed by GLFW; see footnote ^3
const int l = MultiByteToWideChar(CP_UTF8, 0, title, -1, NULL, 0);
wchar_t* buf = _malloca(l * sizeof(wchar_t));
MultiByteToWideChar(CP_UTF8, 0, title, -1, buf, l);
SetWindowTextW(hWnd, buf);
_freea(buf);
Save your source files with UTF-8 encoding WITH BOM. This allows you to write your strings without having to escape them. You'll still need to convert the string to a wide char string using the method seen above.
Specify the /utf-8 flag when compiling; this has the same effect as the previous solution, but you don't need the BOM anymore.
The solutions stated above still require you convert your good and nice string to a big chunky wide string.
Another option would be to provide a manifest4 with the activeCodePage set to UTF-8. This way all Win32 functions with the A-suffix (e.g. SetWindowTextA) now accept and properly handle UTF-8 strings, if the running system is at least or newer than Windows Version 1903.
TL;DR
Compile your application with the /utf-8 flag active.
IMPORTANT: This works for the Win32 APIs. This doesn't let you magically write Unicode emojis to the console like a hipster JS developer.
1 I suppose it reads the LC_ALL setting on linux. In the last 6 years I have never seen a distribution, that does NOT specify UTF-8. However, take this information with a grain of salt; I might be entirely wrong on how gcc handles this now.
2 If no byte-order mark is found, it assumes that the source file is encoded in the current user code page [...].
3 GLFW performs the conversion as seen here.
4 More about Win32 ANSI-APIs, Manifest and UTF-8 can be found here.

Capture spawned process stdout as unicode

In my C++/WinAPI code, I want to run some commands and capture their output. To test non-ASCII output, I renamed my network connection to Ethérnét אבג БбГгДд and run ipconfig. When running in command prompt, the output comes out correctly (visible when using a supporting font like Courier New):
C:\>ipconfig
Windows IP Configuration
Ethernet adapter Ethérnét אבג БбГгДд:
(...)
I tried to redirect the output to a pipe, following the example in this answer. But the byte array returned from ReadFile() is not unicode - it's encoded in CP_OEMCP (CP437 in my case), and so the Hebrew and Russian characters come out as '?'s. Since the characters are already lost, no further handling can restore them.
Obviously it's possible, since cmd in a console window does it. How can I do it?
It would seem that ipconfig produces Unicode output when it detects that the output device is the console, and ANSI output otherwise. This is likely to be a backwards-compatibility measure.
Most other built-in command-line tools are likely to either be ANSI-only or to behave in the same way as ipconfig, for the same reason. In Windows, command-line tools are meant, well, for use on the command line; programmers are discouraged from shelling out to them and parsing the output. Instead, you should use the corresponding APIs.
If you know which language you are expecting, you might be able to choose a code page that will preserve the content.
Added by #Jonathan: Undocumented: Turns out you can control the encoding of built-in commands using the environment variable OutputEncoding. I tested with ipconfig, but presumably it works with other built-in tools as well:
> for %e in ("" Unicode Ansi UTF8) do (set OutputEncoding=%~e& ipconfig >ipconfig-%~e.txt)
> (set OutputEncoding= & ipconfig 1>ipconfig-.txt )
> (set OutputEncoding=Unicode & ipconfig 1>ipconfig-Unicode.txt )
> (set OutputEncoding=Ansi & ipconfig 1>ipconfig-Ansi.txt )
> (set OutputEncoding=UTF8 & ipconfig 1>ipconfig-UTF8.txt )
And indeed, ipconfig-*.txt are enconded as expected! Note that this is undocumented, but it does work for me.
Addendum: as of Windows 10 v1809, another alternative is to create a pseudoconsole.
console application can use different ways for output.
for console handle we can use WriteConsoleW for output already in
UNICODE.
if we want use WriteConsoleA or WriteFile for console
handle need first convert UNICODE text to multi-bytes by
WideCharToMultiByte with CodePage :=
GetConsoleOutputCP()
if we have not UNICODE text initially for output (say UTF-8 or
Ansi), need first convert it to UNICODE by
MultiByteToWideChar (with CP_UTF8 or CP_ACP) and then
already again convert it to multi-byte WideCharToMultiByte(GetConsoleOutputCP(), ..)
usual (by default) GetConsoleOutputCP() return same value as GetOEMCP(), so have the same effect in MultiByteToWideChar and WideCharToMultiByte as CP_OEMCP (this constant value is translated to GetOEMCP() )
when output handle is redirected to a file need only use WriteFile only. however application can write data to file in any format: UNICODE, Ansi (CP_ACP) , UTF-8 (CP_UTF8) etc. what is format will be used - very depend from concrete application. you can not full control this. usual you will receive multi-byte output in CP_OEMCP encoding. then you need decide how process it - faster of all you will be need first convert it to UNICODE and use unicode form. if you need Ansi - you will be need do else one conversion.
say if you try use pipe output in CP_OEMCP encoding with OutputDebugStringA - you got error (not readable) output for non english text.
but after 2 conversions CP_OEMCP -> UNICODE -> CP_ACP you can correct displayed text with OutputDebugStringA
but because OutputDebugStringW exist - here enough only to UNICODE convert
also some applications have special options for control output to file format. say ipconfig.exe looking for "OutputEncoding" Environment Variable and depended from it string value ("Unicode", "Ansi", "UTF-8") produce different output. by default (if this Environment Variable not exist or unknown value) CP_OEMCP used
example of pipe read procedure. assume that input data in CP_OEMCP encoding:
void OnRead(PVOID buf, ULONG cbTransferred)
{
if (cbTransferred)
{
if (int len = MultiByteToWideChar(CP_OEMCP, 0, (PSTR)buf, cbTransferred, 0, 0))
{
PWSTR pwz = (PWSTR)alloca((1 + len) * sizeof(WCHAR));
if (len = MultiByteToWideChar(CP_OEMCP, 0, (PSTR)buf, cbTransferred, pwz, len))
{
if (g_bUseAnsi)
{
if (cbTransferred = WideCharToMultiByte(CP_ACP, 0, pwz, len, 0, 0, 0, 0))
{
PSTR psz = (PSTR)alloca(cbTransferred + 1);
if (cbTransferred = WideCharToMultiByte(CP_ACP, 0, pwz, len, psz, cbTransferred, 0, 0))
{
DoPrint(psz, cbTransferred, OutputDebugStringA);
}
}
}
else
{
DoPrint(pwz, len, OutputDebugStringW);
}
}
}
}
}
// debugger can incomplete print too big buffer, so split it on small chunks
template<typename T> void DoPrint(T* p, ULONG len, void (WINAPI* fnOutput)(const T*))
{
ULONG cb;
T* q = p;
do
{
cb = min(len, 256);
q = p + cb;
T c = *q;
*q = 0;
fnOutput(p);
*q = c;
p = q;
} while (len -= cb);
}
about your concrete case - ipconfig.exe used WriteConsoleW for output to console. as result it not depended from current system locale and can correct display multilanguage text. but another tools, like route.exe used WriteFile for ouput (both to console and file) and convert before this UNICODE text to multi-byte by WideCharToMultiByte(CP_OEMCP,..) - as result here will be problems, if try display characters which not exist in CP_OEMCP code page (current system locale). if you have CP437 - Hebrew and Russian characters will be lost if use UNICODE -> CP_OEMCP, need only direct ouput with unicode to console and file. are this possible - dependend from concrete application. for say route.exe this not possible. for ipconfig.exe this possible, because it always write to console in unicode format, and can write to file also in unicode or utf-8 if you set "OutputEncoding" to "Unicode" or "UTF-8"

c++ Lithuanian language, how to get more than ascii

I am trying to use Lithuanian in my c++ application, but every try is unsuccesfull.
Multi-byte character set is used. I have tryed everything i have tought of, i am new in c++. Never ever tryed to do something in Lithuanian.
Tryed every setlocale(LC_ALL, "en_US.utf8"); setlocale(LC_ALL, "Lithuanian");...
Researched for 2 hours and didnt found proper examples, solution.
I do have a average sized project which needs Lithuanian translation from database and it cant understand most of "ĄČĘĖĮŠŲŪąčęėįšųū".
Compiler - "Visual studio 2013"
Database - sqlite3.
I cant get simple strings to work(defined myself), and output as Lithuanian to win32 application, even.
In Windows use wide character strings (1UTF-16 encoding, wchar_t type) for internal text handling, and preferably UTF-8 for external text files and networking.
Note that Visual C++ will translate narrow text literals from the source encoding to Windows ANSI, which is a platform-dependent usually single-byte encoding (you can check which one via the GetACP API function), i.e., Visual C++ has the platform-specific Windows ANSI as its narrow C++ execution character set.
But also do note that for an app restricted to non-Windows platforms, i.e. Unix-land, it makes practical sense to do everything in UTF-8, based on char type.
For the database communication you may need to translate to and from the program's internal text representation.
This depends on what the database interface requires, which is not stated.
Example for console output in Windows:
#include <iostream>
#include <fcntl.h>
#include <io.h>
auto main() -> int
{
_setmode( _fileno( stdout ), _O_WTEXT );
using namespace std;
wcout << L"ĄČĘĖĮŠŲŪąčęėįšųū" << endl;
}
To make this compile by default with g++, the source code encoding needs to be UTF-8. Then, to make it produce correct results with Visual C++ the source code encoding needs to be UTF-8 with BOM, which happily is also accepted by modern versions of g++. For otherwise the Visual C++ compiler will assume the Windows ANSI encoding and produce an incorrect UTF-16 string.
Not coincidentally this is the default meaning of UTF-8 in Windows, e.g. in the Notepad editor, namely UTF-8 with BOM.
But note that while in Windows the problem is that the main system compiler requires a BOM for UTF-8, in Unix-land the problem is the opposite, that many old tools can't handle the BOM (for example, even MinGW g++ 4.9.1 isn't yet entirely up to speed: it sometimes includes the BOM bytes, then incorrectly interpreted, in error messages).
1) On other platforms wide character text can be encoded in other ways, e.g. with UTF-32. In fact the Windows convention is in direct conflict with the C and C++ standards which require that a single wchar_t should be able to encode any character in the extended character set. However, this requirement was, AFAIK, imposed after Windows adopted UTF-16, so the fault probably lies with the politics of the C and C++ standardization process, not yet another Microsoft'ism.
Complexity of internationalisation
There are several related but distinct topics that can cause mismatches between them, making try and error approach very tedious:
type used for storing strings and chars: windows iuses wchar_t by default, but for most APIs you have also char equivalents functions
character set encoding this defines how the chars stored in the type are to be understood. For exemple unicode (UTF8, UTF16, UTF32), 7 bits ascii, 8 bit ansii. In windows, by default it is UTF16 for wchar_t and ansi/windows for char
locale defines, among other things, the character set asumptions, when processing strings. This permit to use language independent functions like isalpha(i, loc), islower(i, loc), ispunct(i, loc) to find out if a given character is alphanumeric, a lower case alphabetic, or a punctuation, for example to bereak down a user text into words. C++ offers here portable functions.
output codepage or font used to show a character to the user. This assumes that the font used shows the characters using the same character set used in the code internals.
source code encoding. For example your editor could assume an ansi encoding, with windows 1252 character set.
Most typical errors
The problem n°1 is Win32 console output, as unicode is not well supported by the console. But this is not your problem here.
Another cause of mismatch is the encoding of your text editor. It might not be unicode, but use a windows code page. In this case, you type "Č", the deditor displays it as such, but editor might use windows 1257 encoding for lithuanian and store 0xC8 in the file. If you then display this literal with a windows unicode function, it will interpret 0xC8 as "latin E grave accent" and print something else, as the right unicode encoding for "Č" is 0x010C !
I can be even worse: the compiler may have its own assumption about character set encoding used and convert your litterals into unicode using false assumptions (it happened to me when I used some exotic code generation switch).
How to do ?
To figure out what goes wront, proceed by elimination:
First, for plain windows, use the native unicode setting. Ok it's UTF16 and wchar_t instead of UTF8 and as thus comes with some drawbacks, but it's native and well supported.
Then use explict unicode coding in litterals, for example TEXT("\u010C") instead of TEXT("Č"). This avoids editor and compiler mismatch.
If it's still not the right character, make sure that your font FULLY supports unicode. The default system font for instance doesn't while most other do. You can easily check with the windows font pannel (WindowKey+R fonts then click on "search char") to display the character table of your font.
Set fonts explicitely in your code
For example, a very tiny experiment :
...
case WM_PAINT:
{
hdc = BeginPaint(hWnd, &ps);
auto hf = CreateFont(24, 0, 0, 0, 0, TRUE, 0, 0, 0, 0, 0, 0, 0, L"Times New Roman");
auto hfOld = SelectObject(hdc, hf); // if you comment this out, € and Č won't display
TextOut(hdc, 50, 50, L"Test with éç € \u010C special chars", 30);
SelectObject(hdc, hfOld);
DeleteObject(hf);
EndPaint(hWnd, &ps);
break;
}

Why Non-Unicode apps system locale makes Unicode fonts with symbol charset displayed incorrectly?

I'm trying to display Unicode chars from Wingdings font (it's Unicode TrueType font supporting symbol charset only).
It's displayed correctly on my Win7/64 system using corresponding regional OS settings:
Formats: Russian
Location: Russia
System locale (AKA Language for Non-Unicode applications): English
But if I switch System locale to Russian, Unicode characters with codes > 127 are displayed incorrectly (replaced with boxes).
My application is created as using Unicode Charset in Visual Studio, it calls only Unicode Windows API functions.
Also I noted that several Windows apps also display such chars incorrectly with symbol fonts (Symbol, Wingdings, Webdings etc), e.g. Notepad, Beyond Compare 3. But WordPad and MS Office apps aren't affected.
Here is minimal code snippet (resources cleanup skipped for brevity):
LOGFONTW lf = { 0 };
lf.lfCharSet = SYMBOL_CHARSET;
lf.lfHeight = 50;
wcscpy_s(lf.lfFaceName, L"Wingdings");
HFONT f = CreateFontIndirectW(&lf);
SelectObject(hdc, f);
// First two chars displayed OK, 3rd and 4th aren't (replaced with boxes) if
// Non-Unicode apps language is NOT English.
TextOutW(hdc, 10, 10, L"\x7d\x7e\x81\xfc");
So the question is: why the hell Non-Unicode apps language setting affects Unicode apps?
And what is the correct (and most simple) way to display SYMBOL_CHARSET fonts without dependency to OS system locale?
The root cause of the problem is that Wingdings font is actually non-Unicode font. It supports Unicode partially, so some symbols are still displayed correctly. See #Adrian McCarthy's answer for details about how it's probably works under the hood.
Also see more info here: http://www.fileformat.info/info/unicode/font/wingdings
and here: http://www.alanwood.net/demos/wingdings.html
So what can we do to avoid such problems? I found several ways:
1. Quick & dirty
Fall back to ANSI version of API, as #user1793036 suggested:
TextOutA(hdc, 10, 10, "\x7d\x7e\x81\xfc"); // Displayed correctly!
2. Quick & clean
Use special Unicode range F0 (Private Use Area) instead of ASCII character codes. It's supported by Wingdings:
TextOutW(hdc, 10, 10, L"\xf07d\xf07e\xf081\xf0fc"); // Displayed correctly!
To explore which Unicode symbols are actually supported by font some font viewer can be used, e.g. dp4 Font Viewer
3. Slow & clean, but generic
But what to do if you don't know which characters you have to display and which font actually will be used? Here is most universal solution - draw text by glyphs to avoid any undesired translations:
void TextOutByGlyphs(HDC hdc, int x, int y, const CStringW& text)
{
CStringW glyphs;
GCP_RESULTSW gcpRes = {0};
gcpRes.lStructSize = sizeof(GCP_RESULTS);
gcpRes.lpGlyphs = glyphs.GetBuffer(text.GetLength());
gcpRes.nGlyphs = text.GetLength();
const DWORD flags = GetFontLanguageInfo(hdc) & FLI_MASK;
GetCharacterPlacementW(hdc, text.GetString(), text.GetLength(), 0,
&gcpRes, flags);
glyphs.ReleaseBuffer(gcpRes.nGlyphs);
ExtTextOutW(hdc, x, y, ETO_GLYPH_INDEX, NULL, glyphs.GetString(),
glyphs.GetLength(), NULL);
}
TextOutByGlyphs(hdc, 10, 10, L"\x7d\x7e\x81\xfc"); // Displayed correctly!
Note GetCharacterPlacementW() function usage. For some unknown reason similar function GetGlyphIndicesW() would not work returning 'unsupported' dummy values for chars > 127.
Here's what I think is happening:
The Wingdings font doesn't have Unicode mappings (a cmap table?). (You can see this by using charmap.exe: the Character set drop down control is grayed out.)
For fonts without Unicode mappings, I think Windows assumes that it depends on the "Language for Non-Unicode applications" setting.
When that's English, Windows (probably) uses code page 1252, and all the values map to themselves.
When that's Russian, Windows (probably) uses code page 1251, and then tries to remap them.
The '\x81' value in code page 1251 maps to U+0403, which obviously doesn't exist in the font, so you get a box. Similarly the, '\xFC' maps to U+044C.
I assumed that if you used ExtTextOutW with the ETO_GLYPH_INDEX flag, Windows wouldn't try to interpret the values at all and just treat them as glyph indexes into the font. But that assumption is wrong.
However, there is another flag called ETO_IGNORELANGUAGE, which is reserved, but, empirically, it seems to solve the problem.