stdout and stderr character encoding - c++

i working on a c++ string library that have main 4 classes that deals with ASCII, UTF8, UTF16, UTF32 strings, every class has Print function that format an input string and print the result to stdout or stderr. my problem is i don't know what is the default character encoding for those streams.
for now my classes work in windows, later i'll add support for mac and linux so if you know anything about those stream encoding i'll appreciate it.
so my question is: what is the default encoding for stdout and stderr and can i change that encoding later and if so what would happens to data stored there?
thank you.

stdout and stderr use "C" locale. "C" locale is netural, and in most system translated into the current user's locale. You can force the program to use a specific locale using setlocale function:
// Set all categories and return "English_USA.1252"
setlocale( LC_ALL, "English" );
// Set only the LC_MONETARY category and return "French_France.1252"
setlocale( LC_MONETARY, "French" );
setlocale( LC_ALL, NULL );
The locale strings supported are system and compiler specific. Only "C" and "" are required to be supported.
http://www.cplusplus.com/reference/clibrary/clocale/

You might take a look at this SO answer (most upvoted answer).
This is not exactly your question, but it is surely related and gives a lot of useful information.
I'm not expert here, but I guess we can assume you should use std::cout whenever you use std::string and std::wcout whenever you use std::wstring.

Related

What exactly is the expected format of the locale in setlocale()?

If I use the fputws without setlocale, only ASCII letters get printed. It seems that setlocale is necessary, and according to this site, both setlocale(LC_CTYPE, "UTF-8") and setlocale(LC_CTYPE, "en_us.UTF-8") would work.
But when I tried it, setlocale(LC_CTYPE, "UTF-8") did not work, and it printed "Hello ". Only with setlocale(LC_CTYPE, "en_us.UTF-8"), I get "Hello ねこ 2". I read the comment on that page, and he also said "UTF-8" did not work (he used GCC, I used VC++). I read some manual pages, but they did not specifically say what the format should be. On CPP reference, the example is using only formats like "en_US.UTF-8", but on tutorials point, the example uses formats like "en_GB".
Is setlocale(LC_CTYPE, "UTF-8") a correct call?
setlocale(LC_CTYPE, "UTF-8");
//setlocale(LC_CTYPE, "en_us.UTF-8");
FILE* output = _wfopen(L"output.txt", L"w");
fputws(L"Hello ねこ 2", output);
fclose(output);
The answer is system specific. To quote the cppreference page you linked to:
system-specific locale identifier. Can be "" for the user-preferred locale or "C" for the minimal locale
I think "UTF-8" is an invalid locale name generally (it needs the language identifier at minimum, not 100% on that though). However generally you will want to simply use "" in order to default to the user preference (which, I would hope, is correct for displaying their input).

How can I use std::imbue to set the locale for std::wcout?

I am trying to use the std::locale mechanism in C++11 to count words in different languages. Specifically, I have std::wstringstream which contains the title of a famous Russian novel ("Crime and Punishment" in English). What I want to do is to use the appropriate locale (ru_RU.utf8 on my Linux machine) to read the stringstream, count the words and print the results. I should also probably note that my system is set to use the en_US.utf8 locale.
The desired result is this:
0: "Преступление"
1: "и"
2: "наказание"
I counted 3 words.
and the last word was "наказание"
That all works when I set the global locale, but not when I attempt to imbue the wcout stream. When I try that, I get this result instead:
0: "????????????"
1: "?"
2: "?????????"
I counted 3 words.
and the last word was "?????????"
Also, when I attempt to use a solution suggested in the comments, (which can be activate by changing #define USE_CODECVT 0 to #define USE_CODECVT 1) I get the error mentioned in this other question.
Those interested in experimenting with the code, or with compiler settings or both may wish to use this live code.
My questions
Why does that not work? Is it because wcout is already open?
Is there way to use imbue rather than setting the global locale to do what I want?
If it makes a difference, I'm using g++ 4.8.3. The full code is shown below.
getwords.cpp
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <locale>
#define USE_CODECVT 0
#define USE_IMBUE 1
#if USE_CODECVT
#include <codecvt>
#endif
using namespace std;
int main()
{
#if USE_CODECVT
locale ru("ru_RU.utf8",
new codecvt_utf8<wchar_t, 0x10ffff, consume_header>{});
#else
locale ru("ru_RU.utf8");
#endif
#if USE_IMBUE
wcout.imbue(ru);
#else
locale::global(ru);
#endif
wstringstream in{L"Преступление и наказание"};
in.imbue(ru);
wstring word;
unsigned wordcount = 0;
while (in >> word) {
wcout << wordcount << ": \"" << word << "\"\n";
++wordcount;
}
wcout << "\nI counted " << wordcount << " words.\n"
<< "and the last word was \"" << word << "\"\n";
}
First I did some more test using your code and I can confirm that L"Преступление и наказание" is a correct UTF16 string. I controlled the code of the individual characters, and they are correctly 0x41f, 0x440, 0x435, 0x441, 0x442, 0x443, 0x43f, 0x43b, 0x435, 0x43d, 0x438, 0x435, 0x20, 0x438, 0x20, 0x43d, 0x430, 0x43a, 0x430, 0x437, 0x430, 0x43d, 0x438, 0x435
I could not find any reference about it, but it looks like simply calling imbue is not enough. imbue it a method from basic_ios which is an ancestor of cout and wcout. It does act on numeric conversions, but on all my tests, it has no effect on the charset used for output.
By default, the locale used in a C++ (or C) program is ... the C locale which knows nothing about unicode. All printable ASCII characters (below 128) are outputted as is, and others are replaced with a ?. It is exactly what your program does.
To make it work correctly, you have to select a locale that knows about unicode characters with setlocale. Once this is done, you can change the numeric conversion by calling imbue, and as you selected a unicode charset all will be fine.
So provided your current locale uses an UTF-8 charset, you only have to add
setlocale(LC_ALL, "");
as first line in your program, and the output will be as expected :
0: "Преступление"
1: "и"
2: "наказание"
I counted 3 words.
and the last word was "наказание"
If your current locale does not use UTF-8, choose one that is installed on you system and that supports it. I used setlocale(LC_ALL, "fr_FR.UTF-8");, or even setlocale(LC_ALL, "en_US.UTF-8"); and both worked.
Edit :
In fact, the best way to correctly output unicode to screen is to use setlocale(LC_ALL, "");. It automatically adapts to the current charset. I tested with a stripped down variant using Latin1 charset (my system speaks natively french and not russian ...)
#include <iostream>
#include <locale>
using namespace std;
int main() {
setlocale(LC_ALL, "");
wchar_t ws[] = { 0xe8, 0xe9, 0 };
wcout << ws << endl;
}
I tried it under Linux using UTF-8 charset and ISO-8859-1 (latin1) (resp export LANG=fr_FR.UTF-8 and export LANG=fr_FR.ISO-8859-1) and I got correctly èé in the proper charset. I tried it also under Windows XP, with codepage 851 (oem) and 1252 (ansi) (resp. chcp 850 and chcp 1252 with Lucida console charset), and got èé on the console too.
Edit 2 :
Of course, you can also set a global C++ locale with locale::global(locale(""); with default locale or locale::global(locale("ru_RU.UTF-8"); with russian locale, but it is more than simply calling setlocale. According to the documentation of Gnu implementation of C++ Standard Library about locale : there is only one relation (of the C++ locale mechanism) to the C locale mechanism: the global C locale is modified if a named C++ locale object is set as the global locale", that is: std::locale::global(std::locale("")); affects the C functions as if the following call was made: std::setlocale(LC_ALL, "");. On the other hand, there is no vice versa, that is, calling setlocale has no whatsoever on the C++ locale mechanism, in particular on the working of locale("").
So it really looks like there was an underlying C library mechanizme that should be first enabled with setlocale to allow imbue conversion to work correctly.
In this answer, I'm taking the questions in reverse order, and adding another (with answer) that came up along the way.
Is there way to use imbue rather than setting the global locale to do what I want?
Yes. By default, std::wcout is synchronized to the underlying stdout C stream. So std::wcout can use imbue if that synchronization is turned off, allowing the C++ stream to operate independently. So to modify the original code to use imbue and work as intended only a single line need be added, calling std::ios_base::sync_with_stdio:
std::ios_base::sync_with_stdio(false);
std::wcout.imbue(ru);
Why didn't the original version work?
The standard (I'm referring to INCITS/ISO/IEC 14882-2011[2012]) says very little about the tie to the underlying stdio stream, but in 27.4.3 it says
The object wcout controls output to a stream buffer associated with the object stdout, declared in <cstdio>
Further, without explicitly setting a global locale, the locale is the "C" locale which is US English ASCII, so this appears to imply that stdout will, by default, have an ASCII mapping. Since no Cyrillic characters are represented in ASCII, the underlying stdout is what converts the proper Russian into a series of ? characters.
Why must the sync_with_stdio call precede imbue?
According to 27.5.3.4 of the standard:
If any input or output operation has occurred using the standard streams prior to the call,
the effect is implementation-defined. Otherwise, called with a false argument, it allows the standard streams to operate independently of the standard C streams.
I don't know what languages you're planning on supporting, but there are languages where your algorithm doesn't apply, eg. Japanese. I suggest checking out the word iterators in International Components for Unicode. http://userguide.icu-project.org/boundaryanalysis

printing Unicode characters C++

I'm trying to write a simple command line app to teach myself Japanese, but can't seem to get Unicode characters to print. What am I missing?
#include <iostream>
using namespace std;
int main()
{
wcout << L"こんにちは世界\n";
wcout << L"Hello World\n"
system("pause");
}
In this example only "Press any key to continue" is displayed. Tested on Visual C++ 2013.
This is not easy on Windows. Even when you manage to get the text to the Windows console you still need to configure cmd.exe to be able to display Japanese characters.
#include <iostream>
int main() {
std::cout << "こんにちは世界\n";
}
This works fine on any system where:
The compiler's source and execution encodings include the characters.
The output device (e.g., the console) expects text in the same encoding as the compiler's execution encoding.
A font with the appropriate characters is available (usually not a problem).
Most platforms these days use UTF-8 by default for all these encodings and so can support the entire Unicode range with code similar to the above. Unfortunately Windows is not one of these platforms.
wcout << L"こんにちは世界\n";
In this line the string literal data is (at compile time) converted from the source encoding to the execution wide encoding and then (at run time) wcout uses the locale it is imbued with to convert the wchar_t data to char data for output. Where things go wrong is that the default locale is only required to support characters from the basic source character set, which doesn't even include all ASCII characters, let alone non-ASCII characters.
So the conversion results in an error, putting wcout into a bad state. The error has to be cleared before wcout will function again, which is why the second print statement does not output anything.
You can work around this for a limited range of characters by imbuing wcout with a locale that will successfully convert the characters. Unfortunately the encoding that is needed to support the entire Unicode range this way is UTF-8; Although Microsoft's implementation of streams supports other multibyte encodings it very specifically does not support UTF-8.
For example:
wcout.imbue(std::locale(std::locale::classic(), new std::codecvt_utf8_utf16<wchar_t>()));
SetConsoleOutputCP(CP_UTF8);
wcout << L"こんにちは世界\n";
Here wcout will correctly convert the string to UTF-8, and if the output were written to a file instead of the console then the file would contain the correct UTF-8 data. However the Windows console, even though configured here to accept UTF-8 data, simply will not accept UTF-8 data written in this way.
There are a few options:
Avoid the standard library entirely:
DWORD n;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), L"こんにちは世界\n", 8, &n, nullptr);
Use non-standard magical incantation that will break standard code:
#include <fcntl.h>
#include <io.h>
_setmode(_fileno(stdout), _O_U8TEXT);
std::wcout << L"こんにちは世界\n";
After setting this mode std::cout << "Hello, World"; will crash.
Use a low level IO API along with manual conversion:
#include <codecvt>
#include <locale>
SetConsoleOutputCP(CP_UTF8);
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
std::puts(convert.to_bytes(L"こんにちは世界\n"));
Using any of these methods, cmd.exe will display the correct text to the best of its ability, by which I mean it will display unreadable boxes. Seven little boxes, for the given string.
You can copy the text out of cmd.exe and into notepad.exe or whatever to see the correct glyphs.
There's a whole article about dealing with Unicode in Windows console
http://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/
http://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/
Basically, you may implement you own streambuf for std::cout (or std::wcout) in terms of WriteConsoleW and enjoy writing UTF-8 (or whatever Unicode you want) to Windows console without depending on locales, console code pages and even without using wide characters.
It may not look very straightforward, but it's convenient and reusable solution, which is also able to give you a portable utf8-everywhere style user code. Please, don't beat me for my English :)
Or you can change Windows locale to Japanese.

Why printf can display non-ASCII characters when "C" locale is used?

Note: I'm asking an implementation defined behavior which is on Microsoft Visual C++ 2008(possibly the same on 2005+). OS: simplified Chinese installation of Win7.
It surprises me when I'm performing non-ASCII I/O w/ printf. E.g.
// This won't be necessary as it's the system default code page.
//system("chcp 936");
// NULL to show current locale, which is "C"
printf ("%s\n", setlocale(LC_ALL, NULL));
printf ("中\n");
printf ("%s\n", setlocale(LC_ALL, "English"));
printf ("中\n");
Output:
Active code page: 936
C
中
English_United States.1252
?D
The memory footprint in debugger shows that "中" is encoded in two bytes: 0xD6, 0xD0, which is the code point of that character in code page 936, for simplified Chinese. It shouldn't be in the code point range of "C" locale which, most likely, is 0x0 ~ 0x7F.
Question:
Why can it still display the character correctly in "C" locale? So I made a guess that locale had no bearing on printf? But then, I shall ask, why can't it display anymore when changing to "English" locale, which is also different from 936? Interesting?
Edit:
I redirected the standard output to a file and took some test. It shows that whatever locale is set, the correct character "中" is saved in the file. It suggests that setlocale() is connected to the way console displays the character, which contradicts my understanding of how it works: printf puts the bytes/code points into input buffer of console, which interprets these bytes using its own code page(what chcp returns).
936 is rather tricky codepage, it allows 2 symbols character (similar it is done by UTF-8). For example Cyrillic (866) - doesn't allows two-byte characters and it behavior will be the same as "English".
So when you use default(936) codepage it knows how to process 2-symbol character, while "English" deals with 0x0 ~ 0x7f only.
Let me also answer why wprintf(L"中") fails. There are big difference between console application and Windows-window application, they use different codepages
Follow is matches between console and windows:
DOS | Windows
------+----------
850 | 1252
936 | 54936
866 | 1251
So if you would like to see in console correct symbols use WideCharToMultiByte first - that provides expected conversion to allow console work in 936
The fact that the C locale prints out the string exactly as given is not surprising. That's what I would expect. What is surprising is that the English locale would do something different.
According do the locale documentation on MSDN, the only effect that locale should have on printf is in determining the radix character for numeric values (i.e. the decimal point).
I suspect perhaps that it's a bug in Microsoft's Compiler. Or at the very least it's undocumented behaviour.
For what it's worth, on my compiler (Borland) the locale has no effect on the output of those strings. It does effect the radix though.
OK. For the default "C" locale, CRT assumes that characters passed to printf don't need any conversion. It has a reason because the ASCII characters almost always fall into the basic character set of the execution system(shared among different Windows code pages). When switched to "English", it assumes the input is encoded in code page 1252, and thus tries to perform a conversion from "English" to "Chinese", which is the locale used by the console. But CRT just cannot find the character 中 in code page 1252. That's why it outputs a question mark.
When redirected to a file, CRT knows it and won't do the conversion, because the console code page is no longer used. It just passes through the bytes as-is. How those bytes are interpreted is up to the program you use(e.g., care about BOM or not) when you open the file.
Refer to this MSDN forum link: Why printf can display non-ASCII characters when “C” locale is used?

stupid bug with fscanf - comma instead of point

I have a stupid bug in one of my c++ source for a project. I do in this part of the source I/O operations. I have a stupid bug where I print the fscanf read value. Below in this part :
Firstly, I don't read the good value and when I print a float value, I get a decimal value with a comma ',' instead of a point '.' between the integer part and the floating part.
FILE* file3;
file3=fopen("test.dat","r");
float test1;
fscanf(file3," %f ",&test1);
printf("here %f\n",test1);
float test3 = 1.2345;
printf("here %f\n",test3);
fclose(file3);
where test.dat file contains "1.1234" and I get at the execution :
here 1,000000
here 1,234500
So, I did a simple test C program compiled with g++ :
#include <stdio.h>
#include <stdlib.h>
int main()
{
FILE* file3;
float test3;
file3=fopen("test.dat","r");
fscanf(file3,"%f",&test3);
printf("here %f\n",test3);
fclose(file3);
}
and it gives :
here 1.123400
This is the first time I have a bug like this. Anyone could see what's wrong ?
Is your C++ locale somehow set to use a European convention? They use commas where we use points and points for thousand's separators.
Have a look at settings of environment variables
LANG
LC_CTYPE
LC_ALL
try setting en_GB or en_US. Having established that it is a locale problem, next decide what behaviour makes sense. Is diplaying 1224,45 a bug at all? The user has locale set for a reason.
You're using code that using the locale set for the programs environment. In some locale's such as in French-speaking locale's, the comma is a the decimal separator. So this code is doing what its locale is presumably telling it to.
In your simple code, you have not initialise the locale support, so this does not happen.
Assuming a Unix-like environment, what is the value of the environment variable LANG, and the various LC_* environment variables?
env | grep -e ^LANG -e ^LC_
For some background reading, try some of the GNU Libc manual (Locales and Internationalisation)
http://www.gnu.org/software/libc/manual/html_node/Locales.html#Locales
My guess is that the application is setting the locale to the
users preference, with std::locale::global( std::locale( "" )
). Which is what console applications should
do, always; they should also imbue std::cin, std::cout and
std::cerr with this locale. Most linguistic communities use
the comma for the decimal, rather than the point. (For demon
processes, it is often more appropriate to use the "C" locale,
or the "Posix" locale under Unix, regardless of where the
server is actually running. Since the "C" locale is the
default, doing nothing is generally fine for demons and
servers.)
The global locale affects all C style input and output, which
is yet another reason for using C++'s iostream instead. With
std::ifstream, just imbue the stream with the locale in which
the file was written before doing your first input. For
machine generated files, the "C" locale is the usual default,
so your code would become something like:
std::ifstream file3( "test.dat" );
if ( ! file3.is_open() ) {
// error handling...
}
file3.imbue( std::locale( "C" ) );
float test1;
file3 >> test1;
// ...
For output to the terminal, expect the locale conventions to be
followed. And set the environment variables to specify the
locale you want to see.