How to use UTF-8 encoding in VC++ MFC application - c++

I'm trying to add "static text component" (simple label) with UTF-8 encoding text into my windows application. If I use the designer tool in visual studio 2017 and put the text through properties - everything looks just fine - after opening the .rc file, the text is different (bad encoding).
I read that I need to change the encoding of the file to utf-8 with bom, but I have nothing like it there... If I change the file encoding to CP1252, the program cannot compile - so I'm using Unicode (UTF-8 with signature) - Codepage 65001 now.
SetWindowTextA("ěščřžýáíé");
GetDlgItem(IDC_SERIAL_NUMBER_TITLE)->SetWindowTextA("ěščřžýáíé");
this code will do this:
so it does 2 different things in the title and label.
and this code works only for the title & messageboxes
CStringA utf8 = CW2A(L"ěščřžýáíé", CP_UTF8);
CStringW utf16 = CA2W(utf8, CP_UTF8);
MessageBoxW(0, utf16, 0, 0);
Why is it so complicated? Is it not possible to normally use utf8 text?
Can anyone help me solve this problem? Thanks!

You have to save your .rc file encoding to Unicode - Codepage 1200

In fact, it was very simple.
// from:
SetWindowTextA("ěščřžýáíé");
// to
SetWindowTextW(m_hWnd, "ščřžýáí");
// also
// std::string -> std::stringw
// CString -> CStringW
// etc.
and that's it :D
and also this, it was very helpful and good to understand what was going on there!

Related

Win32 Resource Dialog Text - UTF-8 - Only displays first character of each string

I'm looking to move from ASCII to UTF-8 everywhere in my Windows Desktop (Win32/MFC) application. This is as opposed to doing the usual move to UTF-16. The idea being, fewer changes will need to be made, and interfacing with external systems that talk in UTF-8 will require less work.
The problem is that the static control and button in the dialog box from the resource file only ever displays the first character of its kanji text. Should resource files work just fine using UTF-8?
Dialog illustrating the problem
UTF-8 strings appear to be read and displayed correctly coming from the String Table in the resource file, but not text directly on dialogs themselves.
I am testing using kanji characters.
How the dialog appears in the resource editor
I have:
Used a font in the dialog box that should support the kanji characters I'm trying to use.
Ensured the .rc file is in UTF-8 using Notepad++'s conversion function.
Added the #pragma code_page(65001) at the top of the .rc file as per The Resource Compiler defaults to CP_ACP, even in the face of subtle hints that the file is UTF-8
Added the manifest for UTF-8 support in "Manifest Tool/Input and Output/Additional Manifest Files" as per Use UTF-8 code pages in Windows apps
Added /utf-8 to "C/C++/Command Line/Additional Options" as per Set source and execution character sets to UTF-8
Using UTF-8 everywhere means std::string, CStringA and the -A Win32 functions implicitly by using the "Advanced/Character Set" value of "Not Set". Additionally, the resource file is in UTF-8, including dialogs with their text, String Tables etc. If I set it to "Use Unicode Character Set", my understanding is that UTF-16 and -W functions will be the default everywhere - the standard Windows way of supporting Unicode historically.
The pragma appears to work, as the Resource Editor in Visual Studio does not clobber the .rc file into UTF-16LE. Also, the manifest appears to work as the MessageBox() (MessageBoxA) function displays text from the String Table correctly. Without the manifest, the MessageBox() displays question marks.
TCHAR buffer[512];
LoadString(hInst, IDS_TESTKANJI, buffer, 512 - 1);
MessageBox(hWnd, buffer, _T("Caption"), MB_OK);
Successful message box
If I set the Character Encoding to "Use Unicode Character Set", everything appears to work as expected - all characters are displayed.
Dialog successfully showing kanji
My suspicion is that the encoding is going UTF-8(.rc file) -> UTF-16(internal representation) -> ASCII (Dialog text loading?), meets a null character from the UTF-16 representation, and stops after reading the first character.
If I call SetDlgItemText() on my static control using text from the String Table, the static control will show all the characters correctly:
case WM_COMMAND:
if (LOWORD(wParam) == IDOK)
{
TCHAR buffer[512];
LoadString(hInst, IDS_TESTKANJI, buffer, 512 - 1);
SetDlgItemText(hDlg, IDC_STATIC, buffer);
...
Windows OS Build: 19044.2130
Visual Studio 2022 17.4.2
Windows SDK Version: 10.0.22621.0
Platform Toolset: Visual Studio 2022 (v143)
It seems like the current answer to displaying UTF-8 text on dialogs is to manually - in code - set the text using a function like SetDlgItemText() with the UTF-8 string, and not rely on the resource loading of the dialog creation code itself. With the UTF-8 manifest, the -A functions are called, and they'll set the UTF-8 text just fine.
Can also call a -W function explicitly, and convert UTF-8 -> UTF-16 before calling. See UTF-8 text in MFC application that uses Multibyte character set.
See also Microsoft CreateDialogIndirectA macro (winuser.h) which is unusually explicit in relation to this: "All character strings in the dialog box template, such as titles for the dialog box and buttons, must be Unicode strings."

Can't display 'ä' in GLFW window title

glfwSetWindowTitle(win, "Nämen");
Becomes "N?men", where '?' is in a little black, twisted square, indicating that the character could not be displayed.
How do I display 'ä'?
If you want to use non-ASCII letters in the window title, then the string has to be utf-8 encoded.
GLFW: Window title:
The window title is a regular C string using the UTF-8 encoding. This means for example that, as long as your source file is encoded as UTF-8, you can use any Unicode characters.
If you see a little black, twisted square then this indicates that the ä is encoded with some iso encoding that is not UTF-8, maybe something like latin1. To fix this you need to open it in the editor in which you can change the encoding of the file, change it to uft-8 (without BOM) and fix the ä in the title.
It seems like the GLFW implementation does not work according to the specification in this case. Probably the function still uses Latin-1 instead of UTF-8.
I had the same problem on GLFW 3.3 Windows 64 bit precompiled binaries and fixed it like this:
SetWindowTextA(glfwGetWin32Window(win),"Nämen")
The issue does not lie within GLFW but within the compiler. Encodings are handled by the major compilers as follows:
Good guy clang assumes that every file is encoded in UTF-8
Trusty gcc checks the system's settings1 and falls back on UTF-8, when it fails to determine one.
MSVC checks for BOM and uses the detected encoding; otherwise it assumes that file is encoded using the users current code page2.
You can determine your current code page by simply running chcp in your (Windows) console or PowerShell. For me, on a fresh install of Windows 11, it yields "850" (language: English; keyboard: German), which stands for Code Page 850.
To fix this issue you have several solutions:
Change your systems code page. Arguably the worst solution.
Prefix your strings with u8, escape all unicode literals and convert the string to wide char before passing Win32 functions; e.g.:
const char* title = u8"\u0421\u043b\u0430\u0432\u0430\u0020\u0423\u043a\u0440\u0430\u0457\u043d\u0456\u0021";
// This conversion is actually performed by GLFW; see footnote ^3
const int l = MultiByteToWideChar(CP_UTF8, 0, title, -1, NULL, 0);
wchar_t* buf = _malloca(l * sizeof(wchar_t));
MultiByteToWideChar(CP_UTF8, 0, title, -1, buf, l);
SetWindowTextW(hWnd, buf);
_freea(buf);
Save your source files with UTF-8 encoding WITH BOM. This allows you to write your strings without having to escape them. You'll still need to convert the string to a wide char string using the method seen above.
Specify the /utf-8 flag when compiling; this has the same effect as the previous solution, but you don't need the BOM anymore.
The solutions stated above still require you convert your good and nice string to a big chunky wide string.
Another option would be to provide a manifest4 with the activeCodePage set to UTF-8. This way all Win32 functions with the A-suffix (e.g. SetWindowTextA) now accept and properly handle UTF-8 strings, if the running system is at least or newer than Windows Version 1903.
TL;DR
Compile your application with the /utf-8 flag active.
IMPORTANT: This works for the Win32 APIs. This doesn't let you magically write Unicode emojis to the console like a hipster JS developer.
1 I suppose it reads the LC_ALL setting on linux. In the last 6 years I have never seen a distribution, that does NOT specify UTF-8. However, take this information with a grain of salt; I might be entirely wrong on how gcc handles this now.
2 If no byte-order mark is found, it assumes that the source file is encoded in the current user code page [...].
3 GLFW performs the conversion as seen here.
4 More about Win32 ANSI-APIs, Manifest and UTF-8 can be found here.

VS2019 compiler misinterprets UTF8 without BOM file as ANSI

I used to compile my C++ wxWidgets-3.1.1 application (Win10x64) with VS2015 Express. I wanted to upgrade my IDE to VS2019 community, which seemed to work quite well.
My project files are partly from older projects, so their encoding differs (Windows-1252, UTF-8 without BOM, ANSI).
With VS2015 I was able to compile and give out messages (hardcoded in my .cpp files), which displayed unicode characters correctly.
The same app compiled with VS2019 community shows for example the german word "übergabe" as "übergabe" which is uninterpreted UTF8.
Saving the .cpp file, which contains the unicode, explicitly as UTF8 WITH BOM solves this issue. But I don't want run through all files in all projects. Can I change the expected input from a "without BOM" file to UTF-8 to get the same behaviour that VS2015 had?
[EDIT]
It seems there is no such option. As I said before, converting all .cpp/.h files to UTF-8-BOM is a solution.
Thus, so far the only suitable way is to loop through the directory, rewrite the files in UTF-8 while prepending the BOM.
Using C++ wxWidgets, this is (part of) my attempt to automate the process:
//Read in the file, convert its content to UTF8 if necessary
wxFileInputStream fis(fileFullPath);
wxFile file(fileFullPath);
size_t dataSize = file.Length();
void* data = malloc(dataSize);
if (!fis.ReadAll(data, dataSize))
{
wxString sErr;
sErr << "Couldn't read file: " << fileFullPath;
wxLogError(sErr);
}
else
{
wxString sData((char*)data, dataSize);
wxString sUTF8Data;
if (wxEmptyString == wxString::FromUTF8(sData))
{
sUTF8Data = sData.ToUTF8();
}
else
{
sUTF8Data = sData;
}
wxFFileOutputStream out(fileFullPath);
wxBOM bomType = wxConvAuto::DetectBOM(sUTF8Data, sUTF8Data.size());
if (wxBOM_UTF8 != bomType)
{
if (wxBOM_None == bomType)
{
unsigned char utf8bom[] = { 0xEF,0xBB,0xBF };
out.Write((char*)utf8bom, sizeof(utf8bom));
}
else
{
wxLogError("File already contains a different BOM: " + fileFullPath);
}
}
}
Note that this can not convert all encodings, basically afaik it can only convert ANSI files or add the BOM to UTF-8 files without BOM. For all other encodings, I open the project in VS2019, select the file and go (freely translated into english, names might differ):
-> File -> XXX.cpp save as... -> Use the little arrow in the "Save" button -> Save with encoding... -> Replace? Yes! -> "Unicode (UTF-8 with signature) - Codepage 65001"
(Don't take "UTF-8 without signature" which is also Codepage 65001, though!)
The option /utf-8 specifies both the source character set and the execution character set as UTF-8.
Check the Microsoft docs
The C++ team blog that explains the charset problem

CStdioFile problems with encoding on read file

I can't read a file correctly using CStdioFile.
I open notepad.exe, I type àèìòùáéíóú and I save twice, once I set codification as ANSI (really is CP-1252) and other as UTF-8.
Then I try to read it from MFC with the following block of code
BOOL ReadAllFileContent(const CString &FilePath, CString *fileContent)
{
CString sLine;
BOOL isSuccess = false;
CStdioFile input;
isSuccess = input.Open(FilePath, CFile::modeRead);
if (isSuccess) {
while (input.ReadString(sLine)) {
fileContent->Append(sLine);
}
input.Close();
}
return isSuccess;
}
When I call it, with ANSI file I've got the expected result àèìòùáéíóú
but when I try to read the UTF8 encoded file I've got à èìòùáéíóú
I would like my function works with all files regardless of the encoding.
Why I need to implement?
.EDIT.
Unfortunately, in the real app, files come from external app so change the file encoding isn't an option.I must be able to read both UTF-8 and CP-1252 files.
Any file is valid ANSI, what notepad told ANSI is really Windows-1252 encode.
I've figured out a way to read UTF-8 and CP-1252 right based on the example provided here. Although it works, I need to pass the file encode which I don't know in advance.
Thnks!
I personally use the class as advertised here:
https://www.codeproject.com/Articles/7958/CTextFileDocument
It has excellent support for reading and writing text files of various encodings including unicode in its various flavours.
I have not had a problem with it.

Reading file with cyrillic

I have to open file with cyrillic symbols. I've encoded file into utf8. Here is example:
en: Couldn't your family afford a
costume for you
ru: Не ваша семья
позволить себе костюм для вас
How do I open file:
ifstream readFile(fileData.c_str());
while (!readFile.eof())
{
std::getline(readFile, buffer);
...
}
The first trouble, there is some symbol before text 'en' (I saw this in debugger):
"en: least"
And another trouble is cyrillic symbols:
" ru: наименьший"
What's wrong?
there is some symbol before text 'en'
That's a faux-BOM, the result of encoding a U+FEFF BYTE ORDER MARK character into UTF-8.
Since UTF-8 is an encoding that does not have a byte order, the faux-BOM shouldn't ever be used, but unfortunately quite a bit of existing software (especially in the MS world) does nonetheless. Load the messages file into a text editor and save it back out again as UTF-8, using a “UTF-8 without BOM” encoding if one is especially listed.
ru: наименьший
That's what you get when you've got a UTF-8 byte string (representing наименьший) and you print it as if it were a Code Page 1252 (Windows Western European) byte string. It's not an input problem; you have read in the string OK and have a UTF-8 byte string. But then, in code you haven't quoted, it gets output as cp1252.
If you're just printing it to the console, this is to be expected, as the console always uses the system default code page (1252 on a Western Windows install), and not UTF-8. If you need to send Unicode to the console you'll have to convert the bytes to native-Unicode wchar​s and write them from there. I don't know what the final destination for your strings is though... if you're just going to write them to another file or something you could just keep them as bytes and not care about what encoding they're in.
i suppose that your os is windows. exists several ways simple:
Use wchar_t, wstring, wifstream, etc.
Use icu library
Use other super puper library (them really many)
Note: for console printing you must use WinApi functions to convert UTF-8 to cp866 (my default cyrilic windows encoding cp1251) because of windows console supports only dos encodings.
Note: for file printing you need to know what encoding use your file
Use libiconv to convert the text to a usable encoding after reading.
Use icu to convert the text.