C++ - UTF 8 on windows 10 [duplicate]

C++ - UTF 8 on windows 10 [duplicate] - c++

If I want to make the following work on Windows, what is the correct locale and how do I detect that it is actually present:
Does this code work universaly, or is it just my system?

Although there isn't good support for named locales, Visual Studio 2010 does include the UTF-8 conversion facets required by C++11: std::codecvt_utf8 for UCS2 and std::codecvt_utf8_utf16 for UTF-16:
#include <fstream>
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
void prepare_file()
{
// UTF-8 data
char utf8[] = {'\x7a', // latin small letter 'z' U+007a
'\xe6','\xb0','\xb4', // CJK ideograph "water" U+6c34
'\xf0','\x9d','\x84','\x8b'}; // musical sign segno U+1d10b
std::ofstream fout("text.txt");
fout.write(utf8, sizeof utf8);
}
void test_file_utf16()
{
std::wifstream fin("text.txt");
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
std::cout << "Read from file using UTF-8/UTF-16 codecvt\n";
for(wchar_t c; fin >> c; )
std::cout << std::hex << std::showbase << c << '\n';
}
void test_file_ucs2()
{
std::wifstream fin("text.txt");
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8<wchar_t>));
std::cout << "Read from file using UTF-8/UCS2 codecvt\n";
for(wchar_t c; fin >> c; )
std::cout << std::hex << std::showbase << c << '\n';
}
int main()
{
prepare_file();
test_file_utf16();
test_file_ucs2();
}
this outputs, on my Visual Studio 2010 EE SP1
Read from file using UTF-8/UTF-16 codecvt
0x7a
0x6c34
0xd834
0xdd0b
Read from file using UTF-8/UCS2 codecvt
0x7a
0x6c34
0xd10b
Press any key to continue . . .

Basically, you are out of luck: http://www.siao2.com/2007/01/03/1392379.aspx

In the past UTF-8 (and some other code pages) wasn't allowed as the system locale because
Microsoft said that a UTF-8 locale might break some functions as they were written to assume multibyte encodings used no more than 2 bytes per character, thus code pages with more bytes such as UTF-8 (and also GB 18030, cp54936) could not be set as the locale.
https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8
However Microsoft has gradually introduced UTF-8 locale support and started recommending the ANSI APIs (-A) again instead of the Unicode (-W) versions like before
Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps. If the ANSI code page is configured for UTF-8, -A APIs operate in UTF-8. This model has the benefit of supporting existing code built with -A APIs without any code changes.
-A vs. -W APIs
Firstly they added a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox since Windows 10 insider build 17035 for setting the locale code page to UTF-8
To open that dialog box open start menu, type "region" and select Region settings > Additional date, time & regional settings > Change date, time, or number formats > Administrative
After enabling it you can call setlocal as normal:
Starting in Windows 10 build 17134 (April 2018 Update), the Universal C Runtime supports using a UTF-8 code page. This means that char strings passed to C runtime functions will expect strings in the UTF-8 encoding. To enable UTF-8 mode, use "UTF-8" as the code page when using setlocale. For example, setlocale(LC_ALL, ".utf8") will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
UTF-8 Support
You can also use this in older Windows versions
To use this feature on an OS prior to Windows 10, such as Windows 7, you must use app-local deployment or link statically using version 17134 of the Windows SDK or later. For Windows 10 operating systems prior to 17134, only static linking is supported.
Later in 2019 they added the ability for programs to use the UTF-8 locale without even setting the UTF-8 beta flag above. You can use the /execution-charset:utf-8 or /utf-8 options when compiling with MSVC or set the ActiveCodePage property in appxmanifest

Per MSDN, it would be named "english_us.65001". But code page 65001 is somewhat flaky on Windows.

Related

Using UTF-16 for I/O with Visual Studio instead of code pages

I have this working on Visual Studio 2019 using code pages:
#include <windows.h>
#include <iostream>
int main()
{
UINT oldcp = GetConsoleOutputCP();
SetConsoleOutputCP(932); //932 = Japanese.
//1200 for little-, 1201 big-, endian UTF-16
DWORD used;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE),L"私の犬\n", 4,&used, 0);
std::cout << "Hit enter to end."; std::cin.get();
SetConsoleOutputCP(oldcp);
return 0;
}
But I am seeing from Microsoft that I should not be using code pages except to interface with legacy code -- use UTF-16 instead. I can find code pages for UTF-16 (little endian or big endian), but using them doesn't work and it's still using code pages.
So what can I use that accomplishes what my program does, but is up-to-date?

Set stdin and stdout to wide mode in Windows and use wcout and wcin with wide strings. You'll need to switch to a console font to support the characters and and IME to type them as well, which can be accomplished by installing the appropriate language support. You're getting that switch automatically by setting a code page, but the characters output correctly even in the "wrong" code page. If you select a font that supports the characters it will work.
#include <iostream>
#include <string>
#include <io.h>
#include <fcntl.h>
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
_setmode(_fileno(stdin), _O_WTEXT);
std::wcout << L"私の犬" << std::endl;
std::wstring a;
std::wcout << L"Type a string: ";
std::getline(std::wcin, a);
std::wcout << a << std::endl;
getwchar();
}
Output (terminal using code page 437 but NSimSun font):
私の犬
Type a string: 马克
马克

Technically every character encoding is a code page. To use UTF-16 you still have to specify the UTF-16 "code page". But you also need to _setmode first
Output unicode strings in Windows console app
How do I print Unicode to the output console in C with Visual Studio?
_setmode(_fileno(stdout), _O_U16TEXT);
std::cout << L"私の犬\n";
But is it up-to-date? No!!! The most reasonable way to print Unicode is to use the UTF-8 code page which will make your app cross-platform and is easier to maintain. See What is the Windows equivalent for en_US.UTF-8 locale? for details on this. Basically just
target Windows SDK v17134 or newer, or use static linking to work on older Windows versions
change the code page to UTF-8
use the -A Win32 APIs instead of -W ones if you're calling those directly (recommended by MS for portability, as everyone else was using UTF-8 for decades)
set the /execution-charset:utf-8 and/or /utf-8 flags while compiling
std::setlocale(LC_ALL, ".UTF8");
std::cout << "私の犬\n";
See also Is it possible to set "locale" of a Windows application to UTF-8?

How do I use system("chcp 936") in my dialog based project?

The code below is supposed to convert a wstring "！" to a string and output it,
setlocale(LC_ALL, "Chinese_China.936");
//system("chcp 936");
std::wstring ws = L"！";
string as((ws.length()) * sizeof(wchar_t), '-');
auto rs = wcstombs((char*)as.c_str(), ws.c_str(), as.length());
as.resize(rs);
cout << rs << ":" << as << endl;
If you run it without system("chcp 936");, the converted string is "£¡" rather than "！". If with system("chcp 936");, the result is correct in a console project.
But on my Dialog based project, system("chcp 936")is useless, even if it's workable, I can't use it, because it would popup a console.
PS: the IDE is Visual Studio 2019, and my source code is stored as in UTF-8 with signature.
My operation system language is English and language for non-unicode programs is English (United States).
Edit: it's interesting, even with "en-US" locale, "！" can be converted to an ASCII "!".
But I don't get where "£¡" I got in the dialog based project.

There are two distinct points to considere with locales:
you must tell the program what charset should be used when converting unicode characters to plain bytes (this is the role for setlocale)
you must tell the terminal what charset it should render (this is the role for chcp in Windows console)
The first point depends on the language and optionaly libraries that you use in your program (here the C++ language and Standard Library)
The second point depends on the console application and underlying system. Windows console uses chcp, and you will find in that other post how you can configure xterm in a Unix-like system.

I found out the cause, the wstring-to-string conversion is no problem, the problem was I used CA2T to convert the Chinese punctuation mark and it failed. So it showed "£¡" in the UI finally.
By means of mbstowcs, the counterpart of wcstombs, it would work.

C++ decimal points

I have this function:
#include <cstdlib>
#include <clocale>
double SimplifiedExample () {
setlocale (LC_ALL, "");
char txt [] = "3.14";
for (int x=0; txt [x]; x++)
if (txt [x] == '.')
txt [x] = localeconv ()->decimal_point [0];
char *endptr;
return strtod (txt, &endptr);
}
This works on every Windows 10 and Linux system I've tried it on. It fails on at least one Windows 7 machine (which reports "," for localeconv ()->decimal_point [0], but uses "." in both sprintf and strtod). Is the use of localeconv to supply the system decimal point to strtod correct, or not?
Note that I can't remove the setlocale() call. The application MUST respect locale settings.

Have you tried "man strtod"? I have Slackware linux and the strtod man page has strtod, which mentions locale for the decimal point, and says it lives in stdlib.h.

You're falling into a bog of implementation defined behaviour.
The problem might be in that the setlocale on Windows doesn't behave how it is proposed in standards, due to it being a library wrapper around OS's idiosincrazie, e.g. UTF-8 supported in C++ runtime only in Win10. More of, some C++ library implementations of sprintf\printf\strtod respects only std::locale::global. More of, some runtimes do not support anything but "C" and "" and formatted output only ANSI or IEEE-754 (hence , a dot). Old Win10 and older versions do not even respect IEEE-754.
E.g, an arbitrary MinGW64 on Windows-7 would respond to this code:
auto old_locale = std::locale::global(std::locale(""));
cout << old_locale.name() << "\n";
cout << std::locale("").name() << "\n";
with reporting an ANSI locale
C
C
output regardless of system locale and doesn't recognize OEM locale ".OCP" or any present locale name. This behaviour is hard-coded in library linked with application,not in some system-wide one.
Note that you can try check returned value of setlocale:
auto loc = setlocale(LC_ALL, "");
if (loc == NULL){
printf ("setlocale failed!\n");
}
else
printf ("setlocale %s!\n",loc );
And in my case it had returned... "English_United States.1252" while I have a US locale set for user and "Russian_Russia.1251" for russian user locale, which doesn't match more usual "En-US" format. Windows uses underscores.
There would be another problem to solve while interfacing with some database service which would be run as a different user. Existing databases already implement support in defining what locale the particular database uses.
More of output of all library functions didn't change with the change of locale.
Developers of multiplatform applications are urged to use proper library packages for formatted input\output which would eliminate the struggle with platform and implementation specific behaviour.

printing Unicode characters C++

I'm trying to write a simple command line app to teach myself Japanese, but can't seem to get Unicode characters to print. What am I missing?
#include <iostream>
using namespace std;
int main()
{
wcout << L"こんにちは世界\n";
wcout << L"Hello World\n"
system("pause");
}
In this example only "Press any key to continue" is displayed. Tested on Visual C++ 2013.

This is not easy on Windows. Even when you manage to get the text to the Windows console you still need to configure cmd.exe to be able to display Japanese characters.
#include <iostream>
int main() {
std::cout << "こんにちは世界\n";
}
This works fine on any system where:
The compiler's source and execution encodings include the characters.
The output device (e.g., the console) expects text in the same encoding as the compiler's execution encoding.
A font with the appropriate characters is available (usually not a problem).
Most platforms these days use UTF-8 by default for all these encodings and so can support the entire Unicode range with code similar to the above. Unfortunately Windows is not one of these platforms.
wcout << L"こんにちは世界\n";
In this line the string literal data is (at compile time) converted from the source encoding to the execution wide encoding and then (at run time) wcout uses the locale it is imbued with to convert the wchar_t data to char data for output. Where things go wrong is that the default locale is only required to support characters from the basic source character set, which doesn't even include all ASCII characters, let alone non-ASCII characters.
So the conversion results in an error, putting wcout into a bad state. The error has to be cleared before wcout will function again, which is why the second print statement does not output anything.
You can work around this for a limited range of characters by imbuing wcout with a locale that will successfully convert the characters. Unfortunately the encoding that is needed to support the entire Unicode range this way is UTF-8; Although Microsoft's implementation of streams supports other multibyte encodings it very specifically does not support UTF-8.
For example:
wcout.imbue(std::locale(std::locale::classic(), new std::codecvt_utf8_utf16<wchar_t>()));
SetConsoleOutputCP(CP_UTF8);
wcout << L"こんにちは世界\n";
Here wcout will correctly convert the string to UTF-8, and if the output were written to a file instead of the console then the file would contain the correct UTF-8 data. However the Windows console, even though configured here to accept UTF-8 data, simply will not accept UTF-8 data written in this way.
There are a few options:
Avoid the standard library entirely:
DWORD n;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), L"こんにちは世界\n", 8, &n, nullptr);
Use non-standard magical incantation that will break standard code:
#include <fcntl.h>
#include <io.h>
_setmode(_fileno(stdout), _O_U8TEXT);
std::wcout << L"こんにちは世界\n";
After setting this mode std::cout << "Hello, World"; will crash.
Use a low level IO API along with manual conversion:
#include <codecvt>
#include <locale>
SetConsoleOutputCP(CP_UTF8);
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
std::puts(convert.to_bytes(L"こんにちは世界\n"));
Using any of these methods, cmd.exe will display the correct text to the best of its ability, by which I mean it will display unreadable boxes. Seven little boxes, for the given string.
You can copy the text out of cmd.exe and into notepad.exe or whatever to see the correct glyphs.

There's a whole article about dealing with Unicode in Windows console
http://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/
http://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/
Basically, you may implement you own streambuf for std::cout (or std::wcout) in terms of WriteConsoleW and enjoy writing UTF-8 (or whatever Unicode you want) to Windows console without depending on locales, console code pages and even without using wide characters.
It may not look very straightforward, but it's convenient and reusable solution, which is also able to give you a portable utf8-everywhere style user code. Please, don't beat me for my English :)

Or you can change Windows locale to Japanese.

What is the Windows equivalent for en_US.UTF-8 locale?

If I want to make the following work on Windows, what is the correct locale and how do I detect that it is actually present:
Does this code work universaly, or is it just my system?

Although there isn't good support for named locales, Visual Studio 2010 does include the UTF-8 conversion facets required by C++11: std::codecvt_utf8 for UCS2 and std::codecvt_utf8_utf16 for UTF-16:
#include <fstream>
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
void prepare_file()
{
// UTF-8 data
char utf8[] = {'\x7a', // latin small letter 'z' U+007a
'\xe6','\xb0','\xb4', // CJK ideograph "water" U+6c34
'\xf0','\x9d','\x84','\x8b'}; // musical sign segno U+1d10b
std::ofstream fout("text.txt");
fout.write(utf8, sizeof utf8);
}
void test_file_utf16()
{
std::wifstream fin("text.txt");
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
std::cout << "Read from file using UTF-8/UTF-16 codecvt\n";
for(wchar_t c; fin >> c; )
std::cout << std::hex << std::showbase << c << '\n';
}
void test_file_ucs2()
{
std::wifstream fin("text.txt");
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8<wchar_t>));
std::cout << "Read from file using UTF-8/UCS2 codecvt\n";
for(wchar_t c; fin >> c; )
std::cout << std::hex << std::showbase << c << '\n';
}
int main()
{
prepare_file();
test_file_utf16();
test_file_ucs2();
}
this outputs, on my Visual Studio 2010 EE SP1
Read from file using UTF-8/UTF-16 codecvt
0x7a
0x6c34
0xd834
0xdd0b
Read from file using UTF-8/UCS2 codecvt
0x7a
0x6c34
0xd10b
Press any key to continue . . .

Basically, you are out of luck: http://www.siao2.com/2007/01/03/1392379.aspx

Per MSDN, it would be named "english_us.65001". But code page 65001 is somewhat flaky on Windows.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js