Character encoding not consistent (Win32 API) - c++

I am developing a Win32 desktop application with C++ where the target-language is Swedish. As a result of that I rely on Unicode to properly format and display any given string. However, I encountered a very peculiar issue where strings consisting of non-english characters (such as å, ä, ö) work in one part of the code, but breaks in another part of the code.
More specifically, the code works in one file, but not the other. If I move the code-block to the other file, then the text is properly displayed again. See the following example:
CreateWindowW(L"Button", L"Lägg till kund", WS_TABSTOP | WS_VISIBLE | WS_CHILD,
0, 0, 80, 25, windowHandle, (HMENU)0, NULL, NULL);
-Which is called from the Application.cpp file.
AppendMenu(m_MainMenuBar, MF_POPUP, (UINT_PTR)m_ArkivMenu, L"&Arkiv");
AppendMenu(m_MainMenuBar, MF_POPUP, (UINT_PTR)m_VisaMenu, L"&Visa");
AppendMenu(m_MainMenuBar, MF_POPUP, (UINT_PTR)m_ArkivMenu, L"&Hjälp");
AppendMenu(m_MainMenuBar, MF_POPUP, (UINT_PTR)m_ArkivMenu, L"&Sök");
-Which is called from the Window.cpp file.
All of this results in the following:
Note that the code called from Window.cpp generates text with invalid characters, which indicates that the symbol(s) couldn't be found. This is very strange since the problem ceases to exist if I move the code from Window.cpp to Application.cpp.
Thus, the only logical conclusion is that the character encoding must differ between these two files, but why?

The problem is indeed, as the title states, "Character encoding not consistent".
And to be precise, that's the character encoding of Window.cpp compared to the character encoding of Application.cpp. I suspect the environment is Visual Studio, which can handle files in multiple encodings. But for source code that contains Unicode, you probably want to use UTF-8 with "signature" (BOM).
This is available in Visual Studio 2017+ via "Save As", and in that save dialog choose "Save With Encoding" instead of plain "Save".

Related

Win32 Resource Dialog Text - UTF-8 - Only displays first character of each string

I'm looking to move from ASCII to UTF-8 everywhere in my Windows Desktop (Win32/MFC) application. This is as opposed to doing the usual move to UTF-16. The idea being, fewer changes will need to be made, and interfacing with external systems that talk in UTF-8 will require less work.
The problem is that the static control and button in the dialog box from the resource file only ever displays the first character of its kanji text. Should resource files work just fine using UTF-8?
Dialog illustrating the problem
UTF-8 strings appear to be read and displayed correctly coming from the String Table in the resource file, but not text directly on dialogs themselves.
I am testing using kanji characters.
How the dialog appears in the resource editor
I have:
Used a font in the dialog box that should support the kanji characters I'm trying to use.
Ensured the .rc file is in UTF-8 using Notepad++'s conversion function.
Added the #pragma code_page(65001) at the top of the .rc file as per The Resource Compiler defaults to CP_ACP, even in the face of subtle hints that the file is UTF-8
Added the manifest for UTF-8 support in "Manifest Tool/Input and Output/Additional Manifest Files" as per Use UTF-8 code pages in Windows apps
Added /utf-8 to "C/C++/Command Line/Additional Options" as per Set source and execution character sets to UTF-8
Using UTF-8 everywhere means std::string, CStringA and the -A Win32 functions implicitly by using the "Advanced/Character Set" value of "Not Set". Additionally, the resource file is in UTF-8, including dialogs with their text, String Tables etc. If I set it to "Use Unicode Character Set", my understanding is that UTF-16 and -W functions will be the default everywhere - the standard Windows way of supporting Unicode historically.
The pragma appears to work, as the Resource Editor in Visual Studio does not clobber the .rc file into UTF-16LE. Also, the manifest appears to work as the MessageBox() (MessageBoxA) function displays text from the String Table correctly. Without the manifest, the MessageBox() displays question marks.
TCHAR buffer[512];
LoadString(hInst, IDS_TESTKANJI, buffer, 512 - 1);
MessageBox(hWnd, buffer, _T("Caption"), MB_OK);
Successful message box
If I set the Character Encoding to "Use Unicode Character Set", everything appears to work as expected - all characters are displayed.
Dialog successfully showing kanji
My suspicion is that the encoding is going UTF-8(.rc file) -> UTF-16(internal representation) -> ASCII (Dialog text loading?), meets a null character from the UTF-16 representation, and stops after reading the first character.
If I call SetDlgItemText() on my static control using text from the String Table, the static control will show all the characters correctly:
case WM_COMMAND:
if (LOWORD(wParam) == IDOK)
{
TCHAR buffer[512];
LoadString(hInst, IDS_TESTKANJI, buffer, 512 - 1);
SetDlgItemText(hDlg, IDC_STATIC, buffer);
...
Windows OS Build: 19044.2130
Visual Studio 2022 17.4.2
Windows SDK Version: 10.0.22621.0
Platform Toolset: Visual Studio 2022 (v143)
It seems like the current answer to displaying UTF-8 text on dialogs is to manually - in code - set the text using a function like SetDlgItemText() with the UTF-8 string, and not rely on the resource loading of the dialog creation code itself. With the UTF-8 manifest, the -A functions are called, and they'll set the UTF-8 text just fine.
Can also call a -W function explicitly, and convert UTF-8 -> UTF-16 before calling. See UTF-8 text in MFC application that uses Multibyte character set.
See also Microsoft CreateDialogIndirectA macro (winuser.h) which is unusually explicit in relation to this: "All character strings in the dialog box template, such as titles for the dialog box and buttons, must be Unicode strings."

Text on MFC Controls - Unicode Characters such as Japanese get cut off

Background
I'm working on a C++/MFC application and we've been converting it to display unicode characters to support foreign languages. For the most part this has been successful and unicode characters are displayed correctly. But I've encountered an issue where certain text on certain controls gets cut off.
Example
Here you can see a button that should display "ログアウト/終了" but gets cutoff and displays an unknown character in it's place.
But if I pad the string with spaces it displays fine. The number of spaces needed varies by string. This string needed 4 spaces to display correctly, whereas another string with one less character needed 5 spaces; there doesn't seem to be a correlation or pattern with the number of spaces needed. And also, I don't want to pad strings randomly throughout the code, especially when other languages don't need this at all.
What I've tried (doesn't work)
Shrinking the font size
Resizing the control
Changing the font facename
Changing the font character set
Copying the control properties from another control in the application that does not have this issue
Add extra null terminators
Padding with zero-width characters
Using SetWindowTextW
Changing source and execution character sets
Changing system locale
The only thing I've found that works is padding with an arbitrary amount of spaces which is certainly not an ideal solution.
Other info
I've only noticed this issue for Japanese characters, but have only tested English, German, and Japanese.
Japanese characters use 3 bytes of data, which I suspect has something to do with this but I don't know what or why. English characters use 1 byte and certain German characters use 2 bytes.
A control (button/label/etc) in one place may have an issue whereas a control in a different place that contains the same text does not have the issue, even if they're both buttons..etc.
When the text is cutoff, it typically either displays a question mark box (like the first image) or a random character/letter at the end. This character changes each time I run the application, but the question box is the most common.
For my padding "fix", it doesn't matter if the spaces are at the beginning or end of the string, as long as the number of spaces is enough. It also doesn't need to be spaces, any non-zero-width character works.
Compiled using MBCS (Multibyte Character Set) and the Windows 10 UTF-8 Unicode Support setting enabled. (As opposed to compiling with UNICODE defined which isn't an option. Large old codebase)
EDIT: Here is an example on how the text is set
GetDlgItem(IDC_SOME_CTRL_ID)->SetWindowText(GetTranslation("Some String"));
Where GetTranslation() is our own function to look up the translation of "Some String" (basically a lookup table) and return a CString. Using a debugger I can see the returned CString always has the correct string value. I can replace GetTranslation with a hardcoded Japanese string and the issue will still happen.
EDIT 2: I got complaints that this code wasn't enough.
myapp.rc
// Microsoft Visual C++ generated resource script.
//
#include "resource.h"
#define APSTUDIO_READONLY_SYMBOLS
#include "afxres.h"
#undef APSTUDIO_READONLY_SYMBOLS
IDD_VIEW_MENU DIALOGEX 0, 0, 50, 232
STYLE DS_SETFONT | WS_CHILD
FONT 14, "Verdana", 0, 0, 0x1
BEGIN
CONTROL "btn0",IDC_BUTTON_MENU_0,"Button",BS_3STATE | BS_PUSHLIKE,12,38,25,13
END
#endif
resource.h
#define IDC_BUTTON_MENU_0 6040
ViewMenu.cpp
#include "stdafx.h"
#include "ViewMenu.h"
CViewMenu::CViewMenu() : CFormView(CViewMenu::IDD)
{
}
void CViewMenu::DoDataExchange(CDateExchange* pDX)
{
CFormView::DoDataExchange(pDX);
DDX_Control(pDX, IDC_BUTTON_MENU_0, m_ctrlMenuButton0);
}
void CViewMenu::OnInitialUpdate()
{
CFormView::OnInitialUpdate();
}
void CViewMenu::OnDraw(CDC* pDC)
{
CFormView::OnDraw(pDC);
GetDlgItem(IDC_BUTTON_MENU_0)->SetWindowText("ログアウト/終了");
return;
}
ViewMenu.h
#include "resource.h"
class CViewMenu : public CFormView
{
protected:
CViewMenu();
public:
enum { IDD = IDD_VIEW_MENU };
CButton m_ctrlMenuButton0;
}
The following should work in Windows 10 versions 1903 and later, regardless of the default system locale, and fulfills OP's requirements (string literals, MBCS build, no Unicode windows etc). It was verified to work in version 2004 set to En-US locale, without "Beta: Use Unicode UTF-8 for worldwide language support" checked, using VS 2019 16.7.5 to build.
Save source files containing characters outside the active codepage in UTF-8 encoding, with or without BOM.
Compile with _MBCS defined (in the IDE: Properties / Advanced / Character Set = MBCS).
Compile with the /utf-8 switch (C/C++ / Command Line / Additional Options = /utf-8).
Create a manifest file declaring UTF-8 as the target codepage for the process (per the activeCodePage documentation).
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1" xmlns:asmv3="urn:schemas-microsoft-com:asm.v3">
<asmv3:application>
<asmv3:windowsSettings xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">
<activeCodePage>UTF-8</activeCodePage>
</asmv3:windowsSettings>
</asmv3:application>
</assembly>
Add the manifest file to the project (in the IDE: Manifest Tool / General / Input and Output / Additional Manifest Files = manifest file created at the previous step).
This ain't Python. With C++ you need to know, why your code works. Otherwise it doesn't.
GetDlgItem(IDC_BUTTON_MENU_0)->SetWindowText("ログアウト/終了");
That's where you and your compiler start to disagree. You think this should be UTF-8. Your compiler, on the other hand, trusts you, and assumes that you are using the source character set.
While you are unaware of a concept called source character set, you get all confused about something that should be the norm: Garbage in, garbage out.
If you feel like fixing the "Garbage in" part (now, clearly, that is your job), read up on C++ string literals. In case you don't make it to the end, the quickest way to fix your ungodly workaround is to use a u8 prefix.
Seriously, though, the real solutions is to use Windows' native character encoding. Which, oddly, you seem to reject, even though you could use it, given a string literal. I mean, it's not like you have to change anything global. Just call SetWindowTextW and use an L prefix.
Just saying, you know...

C++ Win32api output Unicode from user's input

I realize there may be similar questions, but I have been really stuck for a while now. I'm operating on Windows 10 using C++ and the Win32 API.
I need to read a user textbox for a hexadecimal code, something like Ѱ or U+0470, although the 'U+' is not going to be present. After receiving this value, I need to take the code and print the appropriate symbol.
Using the code below I can print any Unicode symbol:
createWindow(L"Static", L"\u2230", WS_VISIBLE | WS_CHILD, 10,10, 190,40, hWnd, NULL,NULL,NULL);
I can also read the textbox using GetWindowText(). However, I can not figure out how to get the user's input 0470 and print this as a Unicode character in the textbox.
You'll need to convert the string L"0470" into a number. It's a base-16 number, so you'd use std::wcstoul from <cwchar>. You can then cast this number to a WCHAR. I assume you know how to turn a single character into a string.

Use of word with accent give an error

i'm creating a simple program in C++ using WinAPI, see this code below:
CreateWindowW(L"STATIC", L"Portão", WS_CHILD | WS_VISIBLE, 10, 10, 100, 20, hwnd, (HMENU)ID_LABEL1, NULL, NULL);
The above code is to create a static control on the main form, the problem is that the 2nd parameter uses a brazilian portuguese word with accent (Portão means Gate), and it give an error, the error is:
C:\CBProjects\ListF\main.cpp|46|error: converting to execution character set: Invalid argument|
i'm using wide character(wchar_t*), but if i replace "Portão" for "Portao" (without accent), it works just fine, Why? How can i solve this?
i'm using Code::Blocks IDE with MinGW Compilator.
C++ has notions of source character set and execution character set. Basically source character set is about characters in file with code and execution character set is about internal string representation in compiler.
Please look at this stack overflow question for more details on this topic.

Why Non-Unicode apps system locale makes Unicode fonts with symbol charset displayed incorrectly?

I'm trying to display Unicode chars from Wingdings font (it's Unicode TrueType font supporting symbol charset only).
It's displayed correctly on my Win7/64 system using corresponding regional OS settings:
Formats: Russian
Location: Russia
System locale (AKA Language for Non-Unicode applications): English
But if I switch System locale to Russian, Unicode characters with codes > 127 are displayed incorrectly (replaced with boxes).
My application is created as using Unicode Charset in Visual Studio, it calls only Unicode Windows API functions.
Also I noted that several Windows apps also display such chars incorrectly with symbol fonts (Symbol, Wingdings, Webdings etc), e.g. Notepad, Beyond Compare 3. But WordPad and MS Office apps aren't affected.
Here is minimal code snippet (resources cleanup skipped for brevity):
LOGFONTW lf = { 0 };
lf.lfCharSet = SYMBOL_CHARSET;
lf.lfHeight = 50;
wcscpy_s(lf.lfFaceName, L"Wingdings");
HFONT f = CreateFontIndirectW(&lf);
SelectObject(hdc, f);
// First two chars displayed OK, 3rd and 4th aren't (replaced with boxes) if
// Non-Unicode apps language is NOT English.
TextOutW(hdc, 10, 10, L"\x7d\x7e\x81\xfc");
So the question is: why the hell Non-Unicode apps language setting affects Unicode apps?
And what is the correct (and most simple) way to display SYMBOL_CHARSET fonts without dependency to OS system locale?
The root cause of the problem is that Wingdings font is actually non-Unicode font. It supports Unicode partially, so some symbols are still displayed correctly. See #Adrian McCarthy's answer for details about how it's probably works under the hood.
Also see more info here: http://www.fileformat.info/info/unicode/font/wingdings
and here: http://www.alanwood.net/demos/wingdings.html
So what can we do to avoid such problems? I found several ways:
1. Quick & dirty
Fall back to ANSI version of API, as #user1793036 suggested:
TextOutA(hdc, 10, 10, "\x7d\x7e\x81\xfc"); // Displayed correctly!
2. Quick & clean
Use special Unicode range F0 (Private Use Area) instead of ASCII character codes. It's supported by Wingdings:
TextOutW(hdc, 10, 10, L"\xf07d\xf07e\xf081\xf0fc"); // Displayed correctly!
To explore which Unicode symbols are actually supported by font some font viewer can be used, e.g. dp4 Font Viewer
3. Slow & clean, but generic
But what to do if you don't know which characters you have to display and which font actually will be used? Here is most universal solution - draw text by glyphs to avoid any undesired translations:
void TextOutByGlyphs(HDC hdc, int x, int y, const CStringW& text)
{
CStringW glyphs;
GCP_RESULTSW gcpRes = {0};
gcpRes.lStructSize = sizeof(GCP_RESULTS);
gcpRes.lpGlyphs = glyphs.GetBuffer(text.GetLength());
gcpRes.nGlyphs = text.GetLength();
const DWORD flags = GetFontLanguageInfo(hdc) & FLI_MASK;
GetCharacterPlacementW(hdc, text.GetString(), text.GetLength(), 0,
&gcpRes, flags);
glyphs.ReleaseBuffer(gcpRes.nGlyphs);
ExtTextOutW(hdc, x, y, ETO_GLYPH_INDEX, NULL, glyphs.GetString(),
glyphs.GetLength(), NULL);
}
TextOutByGlyphs(hdc, 10, 10, L"\x7d\x7e\x81\xfc"); // Displayed correctly!
Note GetCharacterPlacementW() function usage. For some unknown reason similar function GetGlyphIndicesW() would not work returning 'unsupported' dummy values for chars > 127.
Here's what I think is happening:
The Wingdings font doesn't have Unicode mappings (a cmap table?). (You can see this by using charmap.exe: the Character set drop down control is grayed out.)
For fonts without Unicode mappings, I think Windows assumes that it depends on the "Language for Non-Unicode applications" setting.
When that's English, Windows (probably) uses code page 1252, and all the values map to themselves.
When that's Russian, Windows (probably) uses code page 1251, and then tries to remap them.
The '\x81' value in code page 1251 maps to U+0403, which obviously doesn't exist in the font, so you get a box. Similarly the, '\xFC' maps to U+044C.
I assumed that if you used ExtTextOutW with the ETO_GLYPH_INDEX flag, Windows wouldn't try to interpret the values at all and just treat them as glyph indexes into the font. But that assumption is wrong.
However, there is another flag called ETO_IGNORELANGUAGE, which is reserved, but, empirically, it seems to solve the problem.