Multiple calls to setlocale - c++

I am trying to figure out how Unicode is supported in C++.
When I want to output multilingual text to console, I call std::setlocale. However I noticed that the result depends on prior calls to setlocale.
Consider the following example. If run without arguments it calls setlocale once, otherwise it makes a prior call to setlocale to get the value of current locale and restore it at the end of the function.
#include <iostream>
#include <locale>
using namespace std;
int main(int argc, char** argv)
{
char *current_locale = 0;
if (argc > 1) {
current_locale = setlocale(LC_ALL, NULL);
wcout << L"Current output locale: " << current_locale << endl;
}
char* new_locale = setlocale(LC_ALL, "ru_RU.UTF8");
if (! new_locale)
wcout << L"failed to set new locale" << endl;
else
wcout << L"new locale: " << new_locale << endl;
wcout << L"Привет!" << endl;
if (current_locale) setlocale(LC_ALL, current_locale);
return 0;
}
The output is different:
:~> ./check_locale
new locale: ru_RU.UTF8
Привет!
:~> ./check_locale 1
Current output locale: C
new locale: ru_RU.UTF8
??????!
Is there something that setlocale(LC_ALL, NULL) does that needs to be taken care of in future setlocale calls?
The compiler is g++ 7.5.0 or clang++ 7.0.1. And the console is a linux console in a graphical terminal.
More details on the system config: OpenSUSE 15.1, linux 4.12, glibc 2.26, libstdc++6-10.2.1

Is there something that setlocale(LC_ALL, NULL) does that needs to be taken care of in future setlocale calls?
No, setlocale(..., NULL) does not modify the current locale. The following code is fine:
setlocale(LC_ALL, NULL);
setlocale(LC_ALL, "ru_RU.UTF8");
wprintf(L"Привет!\n");
However the following code will fail:
wprintf(L"anything"); // or even just `fwide(stdout, 1);`
setlocale(LC_ALL, "ru_RU.UTF8");
wprintf(L"Привет!\n");
The problem is that stream has it's own locale that is determined at the point the stream orientation is changed to wide.
// here stdout has no orientation and no locale associated with it
wprintf(L"anything");
// `stdout` stream orientation switches to wide stream
// current locale is used - `stdout` has C locale
setlocale(LC_ALL, "ru_RU.UTF8");
wprintf(L"Привет!\n");
// `stdout` is wide oriented
// current locale is ru_RU.UTF-8
// __but__ the locale of `stdout` is still C and cannot be changed!
The only documentation I found of this gnu.org Stream and I18N emphasis mine:
Since a stream is created in the unoriented state it has at that point no conversion associated with it. The conversion which will be used is determined by the LC_CTYPE category selected at the time the stream is oriented. If the locales are changed at the runtime this might produce surprising results unless one pays attention. This is just another good reason to orient the stream explicitly as soon as possible, perhaps with a call to fwide.
You can:
Use separate locale for C++ stream and C FILE (see here):
std::ios_base::sync_with_stdio(false);
std::wcout.imbue(std::locale("ru_RU.utf8"));
Reopen stdout:
wprintf(L""); // stdout has C locale
char* new_locale = setlocale(LC_ALL, "ru_RU.UTF8");
freopen("/dev/stdout", "w", stdout); // stdout has no stream orientation
wprintf(L"Привет!\n"); // stdout is wide and ru_RU locale
I think (untested) that in glibc you can even reopen stdout with explicit locale (see GNU opening streams):
freopen("/dev/stdout", "w,css=ru_RU.UTF-8", stdout);
std::wcout << L"Привет!\n"; // fine
In any case, try to set locale as soon as possible before doing anything else.

Related

Reading a file that contains chinese characters (C++)

I got issues reading a file that contains chinese characters. I know that the encoding of the file is Big5.
Here is my example file (test.txt), I can't include it here because of the chinese characters: https://gist.github.com/haruka98/974ca2c034ebd8fe7eeac4124739fc41
This is my minimal code example (main.cpp), the one I'm actually using breaks down each line and does things with the different fields.
#include <string>
#include <fstream>
#include <iostream>
int main(int argc, char* argv[]) {
setlocale(LC_ALL, "Chinese-traditional");
std::wstring wstr;
std::wifstream input_file("test.txt");
std::wofstream output_file("test_output.txt");
int counter = 0;
while(std::getline(input_file, wstr)) {
for(int i = 0; i < wstr.size(); i++) {
if(wstr[i] == L'|') {
counter++;
}
}
output_file << wstr << std::endl;
}
input_file.close();
output_file.close();
std::cout << counter << std::endl;
return 0;
}
To compile my program:
g++ -o test main.cpp -std=c++17
On Windows 10 I got my expected output. I got the entire file copied to "test_output.txt" and the 129 output in the terminal.
On Linux (Debian 9) I got the terminal output 4 and the file "test_output.txt" only contains the first line and the "1|" from the second.
Here is what I tried:
My first guess was the CR LF and LF issue when using both Windows and Linux. But testing both CR LF and LF with the file did not help.
Then I thought that the "Chinese-traditional" might not work on Linux. I replaced it with "zh_TW.BIG5" but did not get the expected result either.
First check you have the locale for "Chinese-traditional" installed. On Linux this is zh_TW.UTF-8. You can check using locale -a. If it's not listed, install it:
sudo locale-gen zh_TW.UTF-8
sudo update-locale
(There's a list of locales here with their names on Linux and Windows.)
Then use imbue with the input and output streams to set the locale of the streams.
By default, std::wcout is synchronized to the underlying stdout C stream, which uses an ASCII mapping and displays ? in place of Unicode characters it cannot handle. If you want to print Unicode characters to the terminal, you have to turn that synchronization off. You can do that with one line and set the locale of the terminal:
std::ios_base::sync_with_stdio(false);
std::wcout.imbue(loc);
Amended version of your code:
#include <string>
#include <locale>
#include <fstream>
#include <iostream>
int main(int argc, char* argv[])
{
auto loc = std::locale("zh_TW.utf8");
//Disable synchronisation with stdio & set locale
std::ios::sync_with_stdio(false);
std::wcout.imbue(loc);
//Set locale of input stream
std::wstring wstr;
std::wifstream input_file("test.txt");
input_file.imbue(loc);
//Set locale of outputput stream
std::wofstream output_file("test_output.txt");
output_file.imbue(loc);
int counter = 0;
while(std::getline(input_file, wstr)) {
for(int i = 0; i < wstr.size(); i++) {
if(wstr[i] == L'|') {
counter++;
}
}
std::wcout << wstr << std::endl;
output_file << wstr << std::endl;
}
input_file.close();
output_file.close();
std::wcout << counter << std::endl;
return 0;
}
setlocale affects the locale of your program.
It has no effect on the default encoding of the text displayed by the terminal window. The terminal window is an independent application, with its own locale.
Pretty much all modern Linux distributions default to UTF-8 as the encoding for the system console and the terminal windows (gnome-terminal, Konsole, xfce4-terminal, etc...).
Changing your program's locale only affects how your application interprets text, but the terminal still expects your application to produce UTF-8 output. The terminal window has no knowledge of the internal locale of the application running in the terminal window. Terminal windows expect applications to produce output using the system locale's character encoding.
It is theoretically possible for the C library to know the default system encoding and silently transcode all the output, however it does not work this way.
You will have to do all the work of transcoding big5 to UTF-8, using the iconv library, on Linux.
A low cost, cheap shortcut, would be for your program to fork and run the iconv command line tool as a child process, and pipe its output to it, then let iconv do the transcoding on the fly.
use std::wcout to print std::wstring instead of std::cout :-)

How to change Locale in MinGW

It seems like only the "C" locale is working with MinGW. I tried the example found here and no commas were added even though the system locale is set to Canada.
#include <iostream>
#include <locale>
int main()
{
std::wcout << "User-preferred locale setting is " << std::locale("").name().c_str() << '\n';
// on startup, the global locale is the "C" locale
std::wcout << 1000.01 << '\n';
// replace the C++ global locale as well as the C locale with the user-preferred locale
std::locale::global(std::locale(""));
// use the new global locale for future wide character output
std::wcout.imbue(std::locale());
// output the same number again
std::wcout << 1000.01 << '\n';
}
The output is
User-preferred locale setting is C
100000
100000
I tried std::locale("en-CA") and get locale::facet::_S_create_c_locale name not valid at run time. I compile from CMD using g++. I'm running 64bit Windows 10.
Also I tried compiling this program found in the accepted answer here and got the compiler error 'LOCALE_ALL' was not declared in this scope.
How can I set the locale in MinGW to the system default or something explicit?

`std::wcout << L"\u25a0" << std::endl;` outputs nothing, and anything <<'d to wcout thereafter also outputs nothing [duplicate]

Consider the following code snippet, compiled as a Console Application on MS Visual Studio 2010/2012 and executed on Win7:
#include "stdafx.h"
#include <iostream>
#include <string>
const std::wstring test = L"hello\xf021test!";
int _tmain(int argc, _TCHAR* argv[])
{
std::wcout << test << std::endl;
std::wcout << L"This doesn't print either" << std::endl;
return 0;
}
The first wcout statement outputs "hello" (instead of something like "hello?test!")
The second wcout statement outputs nothing.
It's as if 0xf021 (and other?) Unicode characters cause wcout to fail.
This particular Unicode character, 0xf021 (encoded as UTF-16), is part of the "Private Use Area" in the Basic Multilingual Plane. I've noticed that Windows Console applications do not have extensive support for Unicode characters, but typically each character is at least represented by a default character (e.g. "?"), even if there is no support for rendering a particular glyph.
What is causing the wcout stream to choke? Is there a way to reset it after it enters this state?
wcout, or to be precise, a wfilebuf instance it uses internally, converts wide characters to narrow characters, then writes those to the file (in your case, to stdout). The conversion is performed by the codecvt facet in the stream's locale; by default, that just does wctomb_s, converting to the system default ANSI codepage, aka CP_ACP.
Apparently, character '\xf021' is not representable in the default codepage configured on your system. So the conversion fails, and failbit is set in the stream. Once failbit is set, all subsequent calls fail immediately.
I do not know of any way to get wcout to successfully print arbitrary Unicode characters to console. wprintf works though, with a little tweak:
#include <fcntl.h>
#include <io.h>
#include <string>
const std::wstring test = L"hello\xf021test!";
int _tmain(int argc, _TCHAR* argv[])
{
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(test.c_str());
return 0;
}
Setting the mode for stdout to _O_U16TEXT will allow you to write Unicode characters to the wcout stream as well as wprintf. (See Conventional wisdom is retarded, aka What the ##%&* is _O_U16TEXT?) This is the right way to make this work.
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << L"hello\xf021test!" << std::endl;
std::wcout << L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd" << std::endl;
std::wcout << L"Now this prints!" << std::endl;
It shouldn't be necessary anymore but you can reset a stream that has entered an error state by calling clear:
if (std::wcout.fail())
{
std::wcout.clear();
}

Output unicode symbol π and ≈ in c++ win32 console application

I am fairly new to programming, but it seems like the π(pi) symbol is not in the standard set of outputs that ASCII handles.
I am wondering if there is a way to get the console to output the π symbol, so as to express exact answers regarding certain mathematical formulas.
I'm not really sure about any other methods (such as those that use the STL) but you can do this with Win32 using WriteConsoleW:
HANDLE hConsoleOutput = GetStdHandle(STD_OUTPUT_HANDLE);
LPCWSTR lpPiString = L"\u03C0";
DWORD dwNumberOfCharsWritten;
WriteConsoleW(hConsoleOutput, lpPiString, 1, &dwNumberOfCharsWritten, NULL);
The Microsoft CRT is not very Unicode-savvy, so it may be necessary to bypass it and use WriteConsole() directly. I'm assuming you already compile for Unicode, else you need to explicitly use WriteConsoleW()
I'm in the learning phase of this, so correct me if I get something wrong.
It seems like this is a three step process:
Use wide versions of cout, cin, string and so on. So: wcout, wcin, wstring
Before using a stream set it to an Unicode-friendly mode.
Configure the targeted console to use an Unicode-capable font.
You should now be able to rock those funky åäös.
Example:
#include <iostream>
#include <string>
#include <io.h>
// We only need one mode definition in this example, but it and several other
// reside in the header file fcntl.h.
#define _O_WTEXT 0x10000 /* file mode is UTF16 (translated) */
// Possibly useful if we want UTF-8
//#define _O_U8TEXT 0x40000 /* file mode is UTF8 no BOM (translated) */
void main(void)
{
// To be able to write UFT-16 to stdout.
_setmode(_fileno(stdout), _O_WTEXT);
// To be able to read UTF-16 from stdin.
_setmode(_fileno(stdin), _O_WTEXT);
wchar_t* hallå = L"Hallå, värld!";
std::wcout << hallå << std::endl;
// It's all Greek to me. Go UU!
std::wstring etabetapi = L"η β π";
std::wcout << etabetapi << std::endl;
std::wstring myInput;
std::wcin >> myInput;
std:: wcout << myInput << L" has " << myInput.length() << L" characters." << std::endl;
// This character won't show using Consolas or Lucida Console
std::wcout << L"♔" << std::endl;
}

How to get current locale of my environment?

Had tried following code in Linux, but always return 'C' under different LANG settings.
#include <iostream>
#include <locale.h>
#include <locale>
using namespace std;
int main()
{
cout<<"locale 1: "<<setlocale(LC_ALL, NULL)<<endl;
cout<<"locale 2: "<<setlocale(LC_CTYPE, NULL)<<endl;
locale l;
cout<<"locale 3: "<<l.name()<<endl;
}
$ ./a.out
locale 1: C
locale 2: C
locale 3: C
$
$ export LANG=zh_CN.UTF-8
$ ./a.out
locale 1: C
locale 2: C
locale 3: C
What should I do to get current locale setting in Linux(like Ubuntu)?
Another question is, is it the same way to get locale in Windows?
From man 3 setlocale (New maxim: "When in doubt, read the entire manpage."):
If locale is "", each part of the locale that should be modified is set according to the environment variables.
So, we can read the environment variables by calling setlocale at the beginning of the program, as follows:
#include <iostream>
#include <locale.h>
using namespace std;
int main()
{
setlocale(LC_ALL, "");
cout << "LC_ALL: " << setlocale(LC_ALL, NULL) << endl;
cout << "LC_CTYPE: " << setlocale(LC_CTYPE, NULL) << endl;
return 0;
}
My system does not support the zh_CN locale, as the following output reveals:
$ ./a.out
LC_ALL: en_US.utf8
LC_CTYPE: en_US.utf8
$ export LANG=zh_CN.UTF-8
$ ./a.out
LC_ALL: C
LC_CTYPE: C
Windows: I have no idea about Windows locales. I suggest starting with an MSDN search, and then opening a separate Stack Overflow question if you still have questions.
Just figured out how to get locale by C++, simply use an empty string "" to construct std::locale, which does the same thing as setlocale(LC_ALL, "").
locale l("");
cout<<"Locale by C++: "<<l.name()<<endl;
This link described differences in details between C locale and C++ locale.
For Windows use the following code:
LCID lcid = GetThreadLocale();
wchar_t name[LOCALE_NAME_MAX_LENGTH];
if (LCIDToLocaleName(lcid, name, LOCALE_NAME_MAX_LENGTH, 0) == 0)
error(GetLastError());
std::wcout << L"Locale name = " << name << std::endl;
This is going to print something like "en-US".
To purge sublanguage information use the following:
wchar_t parentLocateName[LOCALE_NAME_MAX_LENGTH];
if (GetLocaleInfoEx(name, LOCALE_SPARENT, parentLocateName, LOCALE_NAME_MAX_LENGTH) == 0)
error(GetLastError());
std::wcout << L"parentLocateName = " << parentLocateName << std::endl;
This will give you just "en".
A good alternative to consider instead of std::locale is boost::locale which is capable of returning more reliable information - see http://www.boost.org/doc/libs/1_52_0/libs/locale/doc/html/locale_information.html
boost::locale::info has the following member functions:
std::string name() -- the full name of the locale, for example en_US.UTF-8
std::string language() -- the ISO-639 language code of the current locale, for example "en".
std::string country() -- the ISO-3199 country code of the current locale, for example "US".
std::string variant() -- the variant of current locale, for example "euro".
std::string encoding() -- the encoding used for char based strings, for example "UTF-8"
bool utf8() -- a fast way to check whether the encoding is UTF-8.
The default constructor of std::locale creates a copy of the global C++ locale.
So to get the name of the current locale:
std::cout << std::locale().name() << '\n';