Unicode console output with wxwidgets - c++

I'm trying to use wxwidgets 3.0.3 with mingw-w64 on Windows.
wxPrintf is patched like https://github.com/wxWidgets/wxWidgets/commit/06458cb89fb8449f377b0b782404b9a9cbe3ae2d#diff-9cf4eef4822377649a928c11237e38f6
Source code is saved in UTF-8
I init wxLocale like:
wxLocale m_locale;
m_locale.Init(wxLANGUAGE_RUSSIAN , wxLOCALE_DONT_LOAD_DEFAULT )
The console output has 8 bit encoding (CP866), that I can check with GetConsoleOutputCP() and GetConsoleCP(). So I have correct output for Latin and Russian characters, but not for Greek (Lucida Console font is used):
wxString s = L"Latin, Русский, \u03BE ρπξ\n\n";
or
wxString s = wxString::FromUTF8("Latin, Русский, \u03BE ρπξ\n\n");
wxPrintf(s.utf8_str()); // not correct output for Greek
If I force console output to be UTF-8:
SetConsoleCP(CP_UTF8);
SetConsoleOutputCP(CP_UTF8);
the wxPrintf doesn't work correct.
(I can have correct output once using std::cout << s.utf8_str().data(). Some memory leak?)
Using SetConsoleOutputCP(CP_WINUNICODE); doesn't change the console encoding (remain cp866).
Is there way to use standard wxWidgets means (wxPrintf and Stream classes provided by wxWidgets library) to use Unicode console output?

Related

How do I use system("chcp 936") in my dialog based project?

The code below is supposed to convert a wstring "!" to a string and output it,
setlocale(LC_ALL, "Chinese_China.936");
//system("chcp 936");
std::wstring ws = L"!";
string as((ws.length()) * sizeof(wchar_t), '-');
auto rs = wcstombs((char*)as.c_str(), ws.c_str(), as.length());
as.resize(rs);
cout << rs << ":" << as << endl;
If you run it without system("chcp 936");, the converted string is "£¡" rather than "!". If with system("chcp 936");, the result is correct in a console project.
But on my Dialog based project, system("chcp 936")is useless, even if it's workable, I can't use it, because it would popup a console.
PS: the IDE is Visual Studio 2019, and my source code is stored as in UTF-8 with signature.
My operation system language is English and language for non-unicode programs is English (United States).
Edit: it's interesting, even with "en-US" locale, "!" can be converted to an ASCII "!".
But I don't get where "£¡" I got in the dialog based project.
There are two distinct points to considere with locales:
you must tell the program what charset should be used when converting unicode characters to plain bytes (this is the role for setlocale)
you must tell the terminal what charset it should render (this is the role for chcp in Windows console)
The first point depends on the language and optionaly libraries that you use in your program (here the C++ language and Standard Library)
The second point depends on the console application and underlying system. Windows console uses chcp, and you will find in that other post how you can configure xterm in a Unix-like system.
I found out the cause, the wstring-to-string conversion is no problem, the problem was I used CA2T to convert the Chinese punctuation mark and it failed. So it showed "£¡" in the UI finally.
By means of mbstowcs, the counterpart of wcstombs, it would work.

Why does LC_ALL setlocale setting affect cout output in Powershell?

I'm trying to understand some behavior I'm seeing.
I have this C++ program:
// Outputter.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <iostream>
int main()
{
// UTF-8 bytes for "日本語"
std::cout << (char)0xE6 << (char)0x97 << (char)0xA5 << (char)0xE6 << (char)0x9C << (char)0xAC << (char)0xE8 << (char)0xAA << (char)0x9E;
return 0;
}
If I run the following in Powershell:
[System.Console]::OutputEncoding = [System.Console]::InputEncoding = [System.Text.Encoding]::UTF8
.\print_it.exe # This is the above program ^
日本語 # This is the output as displayed in Powershell
Then 日本語 is printed and displayed correctly in Powershell.
However if I add setlocale(LC_ALL, "English_United States.1252"); to the code, like this:
int main()
{
setlocale(LC_ALL, "English_United States.1252");
// UTF-8 bytes for "日本語"
std::cout << (char)0xE6 << (char)0x97 << (char)0xA5 << (char)0xE6 << (char)0x9C << (char)0xAC << (char)0xE8 << (char)0xAA << (char)0x9E;
return 0;
}
The program now prints garbage to Powershell (日本語 to be precise, which is the code page 1252 misinterpretation of those bytes).
BUT if I pipe the output to a file and then cat the file, it looks fine:
.\print_it.exe > out.txt
cat out.txt
日本語 # It displays fine, like this, if I redirect to a file and cat the file.
Also, Git bash displays the output properly no matter what I setlocale to.
Could someone please help me understand why setlocale is affecting how the output is displayed in Powershell, even though the same bytes are being written to stdout? It seems like Powershell is somehow able to access the locale of the program and uses that to interpret output?
Powershell version is 5.1.17763.592.
It is all about encoding. The reason why you are getting correct characters with the > redirect is due to the fact the > redirect uses UTF-16LE by default. So your set encoding 1252 is automagically converted to UTF-16.
Depending on your PowerShell version you can or can not change the encoding of the redirect.
If you would use Out-File with -Encoding switch you could change the encoding of the destination file (again depends on your PowerShell version).
I recommend reading SO excellent mklement0's post on this topic here.
Edit based on comment
Taken from cppreference
std::setlocale
C++ Localizations library Defined in header <clocale>
char* setlocale( int category, const char* locale);
The setlocale function installs the specified system locale or its portion as the new C locale. The modifications remain in effect and influences the
execution of all locale-sensitive C library functions until the next
call to setlocale. If locale is a null pointer, setlocale queries the
current C locale without modifying it.
The bytes you are sending to std::cout are the same, but std::cout is a locale-sensitive function so it take precedence over your PowerShell UTF-8 settings. If you leave out the setlocale() function the std::cout obeys the shell encoding.
If you have Powershell 5.1 and above the > is an alias for Out-File. You can set the encoding via $PSDefaultParameterValues:
like this:
$PSDefaultParameterValues['Out-File:Encoding'] = 'UTF8'
Then you would get an UTF-8 file (with BOM which can be annoying!) instead of the default UTF-16LE.
Edit - adding some details as requested by OP
PowerShell is using OEM code page so by default you are getting what you have setup at your windows. I recommend reading an excelent post on encoding on windows. The point is that without your UTF8 setting to the powershell you are on your code page which you have.
The output.exe is setting the locales to English_United States.1252 within the c++ program and output_original.exe is not doing any changes to it:
Here is the output without the UTF8 PowerShell setting:
c:\t>.\output.exe
æ-¥æo¬èªz --> nonsese within the win1252 code page
c:\t>.\output.exe | hexdump
0000000 97e6 e6a5 ac9c aae8 009e --> both hex outputs are the same!
0000009
c:\t>.\output_original.exe
日本語 --> nonsense but different one! (depens on your locale setup - my was English)
c:\t>.\output_original.exe | hexdump
0000000 97e6 e6a5 ac9c aae8 009e --> both hex outputs are the same!
0000009
So what happens here? Your program gives out an output based either on the locale set in the program itself or windows (which is OEM code 1252 at my virtual machine). Notice that in both versions the hexdump is the same, but not the output (with encoding).
If you set your PowerShell to UTF8 with the [System.Text.Encoding]::UTF8:
PS C:\t> [System.Console]::OutputEncoding = [System.Console]::InputEncoding = [System.Text.Encoding]::UTF8
PS C:\t> .\output.exe
日本語 --> the english locales 1252 set within program notice that the output is similar to the above one (but the hexdump is different)
PS C:\t> .\output.exe | hexdump
0000000 bbef 3fbf 3f3f 0a0d -> again hex dump is same for both so they are producing the same output!
0000008
PS C:\t> .\output_original.exe
日本語 --> correct output due to the fact you have forced the PowerShell encoding to UTF8, thus removing the output dependence on the OEM code (windows)
PS C:\t> .\output_original.exe | hexdump
0000000 bbef 3fbf 3f3f 0a0d -> again hex dump is same for both so they are producing the same output!
0000008
What happens here? If you force the locales at your c++ application the std:cout will be formatted with that locales (1252) those characters are then transformed into UTF8 formatting (that is the reason why the first and second examples are little bit different). When you do not force the locales in your c++ application then the PowerShell encoding is taken, which is now UTF8 and you get correct output.
One thing that is I found interesting is if you change your windows system locales to chinese compatible ones (PRC, Macao, Tchaiwan, Hongkong, etc.) you will get some chinese charactes when not forcing UTF8, but different ones. That means that those bytes are Unicode only and thus only there it works. If you force the UTF8 at PowerShell even with the chinese windows system locales it works correctly.
I hope this answers your question to greater extent.
Rant:
It took me so long to investigate because the VS 2019 community edition got expired (WFT MS?) and I could not registre it because the register window was completely blank. Thanks MS but no thanks.

printing Unicode characters C++

I'm trying to write a simple command line app to teach myself Japanese, but can't seem to get Unicode characters to print. What am I missing?
#include <iostream>
using namespace std;
int main()
{
wcout << L"こんにちは世界\n";
wcout << L"Hello World\n"
system("pause");
}
In this example only "Press any key to continue" is displayed. Tested on Visual C++ 2013.
This is not easy on Windows. Even when you manage to get the text to the Windows console you still need to configure cmd.exe to be able to display Japanese characters.
#include <iostream>
int main() {
std::cout << "こんにちは世界\n";
}
This works fine on any system where:
The compiler's source and execution encodings include the characters.
The output device (e.g., the console) expects text in the same encoding as the compiler's execution encoding.
A font with the appropriate characters is available (usually not a problem).
Most platforms these days use UTF-8 by default for all these encodings and so can support the entire Unicode range with code similar to the above. Unfortunately Windows is not one of these platforms.
wcout << L"こんにちは世界\n";
In this line the string literal data is (at compile time) converted from the source encoding to the execution wide encoding and then (at run time) wcout uses the locale it is imbued with to convert the wchar_t data to char data for output. Where things go wrong is that the default locale is only required to support characters from the basic source character set, which doesn't even include all ASCII characters, let alone non-ASCII characters.
So the conversion results in an error, putting wcout into a bad state. The error has to be cleared before wcout will function again, which is why the second print statement does not output anything.
You can work around this for a limited range of characters by imbuing wcout with a locale that will successfully convert the characters. Unfortunately the encoding that is needed to support the entire Unicode range this way is UTF-8; Although Microsoft's implementation of streams supports other multibyte encodings it very specifically does not support UTF-8.
For example:
wcout.imbue(std::locale(std::locale::classic(), new std::codecvt_utf8_utf16<wchar_t>()));
SetConsoleOutputCP(CP_UTF8);
wcout << L"こんにちは世界\n";
Here wcout will correctly convert the string to UTF-8, and if the output were written to a file instead of the console then the file would contain the correct UTF-8 data. However the Windows console, even though configured here to accept UTF-8 data, simply will not accept UTF-8 data written in this way.
There are a few options:
Avoid the standard library entirely:
DWORD n;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), L"こんにちは世界\n", 8, &n, nullptr);
Use non-standard magical incantation that will break standard code:
#include <fcntl.h>
#include <io.h>
_setmode(_fileno(stdout), _O_U8TEXT);
std::wcout << L"こんにちは世界\n";
After setting this mode std::cout << "Hello, World"; will crash.
Use a low level IO API along with manual conversion:
#include <codecvt>
#include <locale>
SetConsoleOutputCP(CP_UTF8);
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
std::puts(convert.to_bytes(L"こんにちは世界\n"));
Using any of these methods, cmd.exe will display the correct text to the best of its ability, by which I mean it will display unreadable boxes. Seven little boxes, for the given string.
You can copy the text out of cmd.exe and into notepad.exe or whatever to see the correct glyphs.
There's a whole article about dealing with Unicode in Windows console
http://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/
http://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/
Basically, you may implement you own streambuf for std::cout (or std::wcout) in terms of WriteConsoleW and enjoy writing UTF-8 (or whatever Unicode you want) to Windows console without depending on locales, console code pages and even without using wide characters.
It may not look very straightforward, but it's convenient and reusable solution, which is also able to give you a portable utf8-everywhere style user code. Please, don't beat me for my English :)
Or you can change Windows locale to Japanese.

std::wcout strange error: truncated output of std::wstring

I'm rather curious about the phenomenon, std::wcout can't output the whole content of std::wstring. Am I missing something?
Here is my output:
F:\
F:\
My code snippet is as follows:
std::wstring ws(L"F:\\右旋不规则.pdf");
std::wcout << ws << std::endl;
std::wcout << ws.data() << std::endl;
There are already several threads on this topic:
Output unicode strings in Windows console app
Using Unicode font in C++ console app
Output Unicode to console Using C++, in Windows
The point is you need the system to be able to display your Chinese characters (they are Chinese, right?). I don't think that the default fonts available for the console are able to do that. Lucinda Console could be used for many Unicode characters, but I don't think it's able to display Chinese. If you have a font for that, you can add it to the Console.
How to display japanese Kanji inside a cmd window under windows?
https://superuser.com/questions/5035/how-to-change-the-windows-xp-console-font

How can I display unicode characters in a linux terminal using C++?

I'm working on a chess game in C++ on a linux environment and I want to display the pieces using unicode characters in a bash terminal. Is there any way to display the symbols using cout?
An example that outputs a knight would be nice: ♞ = U+265E.
To output Unicode characters you just use output streams, the same way you would output ASCII characters. You can store the Unicode codepoint as a multi-character string:
std::string str = "\u265E";
std::cout << str << std::endl;
It may also be convenient to use wide character output if you want to output a single Unicode character with a codepoint above the ASCII range:
setlocale(LC_ALL, "en_US.UTF-8");
wchar_t codepoint = 0x265E;
std::wcout << codepoint << std::endl;
However, as others have noted, whether this displays correctly is dependent on a lot of factors in the user's environment, such as whether or not the user's terminal supports Unicode display, whether or not the user has the proper fonts installed, etc. This shouldn't be a problem for most out-of-the-box mainstream distros like Ubuntu/Debian with Gnome installed, but don't expect it to work everywhere.
Sorry misunderstood your question at first. This code prints a white king in terminal (tested it with KDE Konsole)
#include <iostream>
int main(int argc, char* argv[])
{
std::cout <<"\xe2\x99\x94"<<std::endl;
return 0;
}
Normally encoding is specified through a locale. Try to set environment variables.
In order to tell applications to use
UTF-8 encoding, and assuming U.S.
English is your preferred language,
you could use the following command:
export LC_ALL=en_US.UTF-8
Are you using a "bare" terminal or something running under X-Server?