Why can std::wcout not output all UCS-2 characters? - c++

#include <iostream>
#include <locale>
using namespace std;
int main()
{
wcout.imbue(/* What to place here? */);
for (wchar_t c = 0; c <= 0xFFFF; c++)
{
if (IsHumanReadable(c))
{
wcout << c; // c may be a Chinese or Arabic character.
}
}
}
My machine is Windows 7, which is unicode-based.
The code above doesn't output any Arabic characters, whereas the same character can be showed correctly in the source file, which proves my machine supports displaying Arabic characters.
Why can Arabic characters not displayed in the console window?

The first line should be
_setmode(_fileno(stdout), _O_WTEXT);
which is the Windows equivalent of the appropriate imbue() for wide character output (they still don't support Unicode in standard C++ except for what C++11 made a requirement).
Check MSDN for the headers to #include
Also note that the font installed in console window often lacks much of other Windows programs can display. When in doubt, redirect program output to a file and open that file with Wordpad etc.

Related

Using UTF-16 for I/O with Visual Studio instead of code pages

I have this working on Visual Studio 2019 using code pages:
#include <windows.h>
#include <iostream>
int main()
{
UINT oldcp = GetConsoleOutputCP();
SetConsoleOutputCP(932); //932 = Japanese.
//1200 for little-, 1201 big-, endian UTF-16
DWORD used;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE),L"私の犬\n", 4,&used, 0);
std::cout << "Hit enter to end."; std::cin.get();
SetConsoleOutputCP(oldcp);
return 0;
}
But I am seeing from Microsoft that I should not be using code pages except to interface with legacy code -- use UTF-16 instead. I can find code pages for UTF-16 (little endian or big endian), but using them doesn't work and it's still using code pages.
So what can I use that accomplishes what my program does, but is up-to-date?
Set stdin and stdout to wide mode in Windows and use wcout and wcin with wide strings. You'll need to switch to a console font to support the characters and and IME to type them as well, which can be accomplished by installing the appropriate language support. You're getting that switch automatically by setting a code page, but the characters output correctly even in the "wrong" code page. If you select a font that supports the characters it will work.
#include <iostream>
#include <string>
#include <io.h>
#include <fcntl.h>
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
_setmode(_fileno(stdin), _O_WTEXT);
std::wcout << L"私の犬" << std::endl;
std::wstring a;
std::wcout << L"Type a string: ";
std::getline(std::wcin, a);
std::wcout << a << std::endl;
getwchar();
}
Output (terminal using code page 437 but NSimSun font):
私の犬
Type a string: 马克
马克
Technically every character encoding is a code page. To use UTF-16 you still have to specify the UTF-16 "code page". But you also need to _setmode first
Output unicode strings in Windows console app
How do I print Unicode to the output console in C with Visual Studio?
_setmode(_fileno(stdout), _O_U16TEXT);
std::cout << L"私の犬\n";
But is it up-to-date? No!!! The most reasonable way to print Unicode is to use the UTF-8 code page which will make your app cross-platform and is easier to maintain. See What is the Windows equivalent for en_US.UTF-8 locale? for details on this. Basically just
target Windows SDK v17134 or newer, or use static linking to work on older Windows versions
change the code page to UTF-8
use the -A Win32 APIs instead of -W ones if you're calling those directly (recommended by MS for portability, as everyone else was using UTF-8 for decades)
set the /execution-charset:utf-8 and/or /utf-8 flags while compiling
std::setlocale(LC_ALL, ".UTF8");
std::cout << "私の犬\n";
See also Is it possible to set "locale" of a Windows application to UTF-8?

Cannot wcout wide string other than English [duplicate]

I've read a bunch of articles and forums posts discussing this problem all of the solutions seem way too complicated for such a simple task.
Here's a sample code straight from cplusplus.com:
// reading a text file
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main () {
string line;
ifstream myfile ("example.txt");
if (myfile.is_open())
{
while ( myfile.good() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}
else cout << "Unable to open file";
return 0;
}
It works fine as long as example.txt has only ASCII characters. Things get messy if I try to add, say, something in Russian.
In GNU/Linux it's as simple as saving the file as UTF-8.
In Windows, that doesn't work. Converting the file into UCS-2 Little Endian (what Windows seems to use by default) and changing all the functions into their wchar_t counterparts doesn't do the trick either.
Isn't there some kind of a "correct" way to get this done without doing all kinds of magic encoding conversions?
The Windows console supports unicode, sort of. It does not support left-to-right and "complex scripts". To print a UTF-16 file with Visual C++, use the following:
_setmode(_fileno(stdout), _O_U16TEXT);
And use wcout instead of cout.
There is no support for a "UTF8" code page so for UTF-8 you will have to use MultiBytetoWideChar
More on console support for unicode can be found in this blog
The right way to output to a console on Windows using cout is to first call GetConsoleOutputCP, and then convert the input you have into the console code page. Alternatively, use WriteConsoleW, passing a wchar_t*.
For reading UTF-8 or UTF-16 strings from a file, you can use the extended mode string of _wfopen_s and fgetws. I don't think there is a C++ interface for these extensions yet. The easiest way to print to the console is described in Michael Kaplan's blog:
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
int main(void) {
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n");
return 0;
}
Avoid GetConsoleOutputCP, it is only retained for compatibility with the 8-bit API.
While Windows console windows are UCS-2 based, they don't support UTF-8 properly.
You might make things work by setting the console window's active output code page to UTF-8 temporarily, using the appropriate API functions. Note that those functions distinguish between input code page and output code page. However, [cmd.exe] really doesn't like UTF-8 as active code page, so don't set that as a permanent code page.
Otherwise, you can use the Unicode console window functions.
Cheers & hth.,
#include <stdio.h>
int main (int argc, char *argv[])
{
// do chcp 65001 in the console before running this
printf ("γασσο γεο!\n");
}
Works perfectly if you do chcp 65001 in the console before running your program.
Caveats:
I'm using 64 bit Windows 7 with VC++ Express 2010
The code is in a file encoded as UTF-8 without BOM - I wrote it in a text editor, not using the VC++ IDE, then used VC++ to compile it.
The console has a TrueType font - this is important
Don't know if these things make too much difference...
Can't speak for chars off the BMP, give it a whirl and leave a comment.
Just to be clear, some here have mentioned UTF8. UTF8 is a multibyte format, which in some documentation is mistakenly referred to as Unicode. Unicode is always just two bytes.
I've used this previously posted solution with Visual Studio 2008. I don't know if if works with later versions of Visual Studio.
#include <iostream>
#include <fnctl.h>
#include <io.h>
#include <tchar.h>
<code ommitted>
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << _T("This is some text to print\n");
I used macros to switch between std::wcout and std::cout, and also to remove the _setmode call for ASCII builds, thus allowing compiling either for ASCII and UNICODE. This works. I have not yet tested using std::endl, but I that might work wcout and Unicode (not sure), i.e.
std::wcout << _T("This is some text to print") << std::endl;

printing Unicode characters C++

I'm trying to write a simple command line app to teach myself Japanese, but can't seem to get Unicode characters to print. What am I missing?
#include <iostream>
using namespace std;
int main()
{
wcout << L"こんにちは世界\n";
wcout << L"Hello World\n"
system("pause");
}
In this example only "Press any key to continue" is displayed. Tested on Visual C++ 2013.
This is not easy on Windows. Even when you manage to get the text to the Windows console you still need to configure cmd.exe to be able to display Japanese characters.
#include <iostream>
int main() {
std::cout << "こんにちは世界\n";
}
This works fine on any system where:
The compiler's source and execution encodings include the characters.
The output device (e.g., the console) expects text in the same encoding as the compiler's execution encoding.
A font with the appropriate characters is available (usually not a problem).
Most platforms these days use UTF-8 by default for all these encodings and so can support the entire Unicode range with code similar to the above. Unfortunately Windows is not one of these platforms.
wcout << L"こんにちは世界\n";
In this line the string literal data is (at compile time) converted from the source encoding to the execution wide encoding and then (at run time) wcout uses the locale it is imbued with to convert the wchar_t data to char data for output. Where things go wrong is that the default locale is only required to support characters from the basic source character set, which doesn't even include all ASCII characters, let alone non-ASCII characters.
So the conversion results in an error, putting wcout into a bad state. The error has to be cleared before wcout will function again, which is why the second print statement does not output anything.
You can work around this for a limited range of characters by imbuing wcout with a locale that will successfully convert the characters. Unfortunately the encoding that is needed to support the entire Unicode range this way is UTF-8; Although Microsoft's implementation of streams supports other multibyte encodings it very specifically does not support UTF-8.
For example:
wcout.imbue(std::locale(std::locale::classic(), new std::codecvt_utf8_utf16<wchar_t>()));
SetConsoleOutputCP(CP_UTF8);
wcout << L"こんにちは世界\n";
Here wcout will correctly convert the string to UTF-8, and if the output were written to a file instead of the console then the file would contain the correct UTF-8 data. However the Windows console, even though configured here to accept UTF-8 data, simply will not accept UTF-8 data written in this way.
There are a few options:
Avoid the standard library entirely:
DWORD n;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), L"こんにちは世界\n", 8, &n, nullptr);
Use non-standard magical incantation that will break standard code:
#include <fcntl.h>
#include <io.h>
_setmode(_fileno(stdout), _O_U8TEXT);
std::wcout << L"こんにちは世界\n";
After setting this mode std::cout << "Hello, World"; will crash.
Use a low level IO API along with manual conversion:
#include <codecvt>
#include <locale>
SetConsoleOutputCP(CP_UTF8);
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
std::puts(convert.to_bytes(L"こんにちは世界\n"));
Using any of these methods, cmd.exe will display the correct text to the best of its ability, by which I mean it will display unreadable boxes. Seven little boxes, for the given string.
You can copy the text out of cmd.exe and into notepad.exe or whatever to see the correct glyphs.
There's a whole article about dealing with Unicode in Windows console
http://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/
http://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/
Basically, you may implement you own streambuf for std::cout (or std::wcout) in terms of WriteConsoleW and enjoy writing UTF-8 (or whatever Unicode you want) to Windows console without depending on locales, console code pages and even without using wide characters.
It may not look very straightforward, but it's convenient and reusable solution, which is also able to give you a portable utf8-everywhere style user code. Please, don't beat me for my English :)
Or you can change Windows locale to Japanese.

Windows Unicode C++ Stream Output Failure

I am currently writing an application which requires me to call GetWindowText on arbitrary windows and store that data to a file for later processing. Long story short, I noticed that my tool was failing on Battlefield 3, and I narrowed the problem down to the following character in its window title:
http://www.fileformat.info/info/unicode/char/2122/index.htm
So I created a little test app which just does the following:
std::wcout << L"\u2122";
Low and behold that breaks output to the console window for the remainder of the program.
Why is the MSVC STL choking on this character (and I assume others) when APIs like MessageBoxW etc display it just fine?
How can I get those characters printed to my file?
Tested on both VC10 and VC11 under Windows 7 x64.
Sorry for the poorly constructed post, I'm tearing my hair out here.
Thanks.
EDIT:
Minimal test case
#include <fstream>
#include <iostream>
int main()
{
{
std::wofstream test_file("test.txt");
test_file << L"\u2122";
}
std::wcout << L"\u2122";
}
Expected result: '™' character printed to console and file.
Observed result: File is created but is empty. No output to console.
I have confirmed that the font I"m using for my console is capable of displaying the character in question, and the file is definitely empty (0 bytes in size).
EDIT:
Further debugging shows that the 'failbit' and 'badbit' are set in the stream(s).
EDIT:
I have also tried using Boost.Locale and I am having the same issue even with the new locale imbued globally and explicitly to all standard streams.
To write into a file, you have to set the locale correctly, for example if you want to write them as UTF-8 characters, you have to add
const std::locale utf8_locale
= std::locale(std::locale(), new std::codecvt_utf8<wchar_t>());
test_file.imbue(utf8_locale);
You have to add these 2 include files
#include <codecvt>
#include <locale>
To write to the console you have to set the console in the correct mode (this is windows specific) by adding
_setmode(_fileno(stdout), _O_U8TEXT);
(in case you want to use UTF-8).
For this you have to add these 2 include files:
#include <fcntl.h>
#include <io.h>
Furthermore you have to make sure that your are using a font that supports Unicode (such as for example Lucida Console). You can change the font in the properties of your console window.
The complete program now looks like this:
#include <fstream>
#include <iostream>
#include <codecvt>
#include <locale>
#include <fcntl.h>
#include <io.h>
int main()
{
const std::locale utf8_locale = std::locale(std::locale(),
new std::codecvt_utf8<wchar_t>());
{
std::wofstream test_file("c:\\temp\\test.txt");
test_file.imbue(utf8_locale);
test_file << L"\u2122";
}
_setmode(_fileno(stdout), _O_U8TEXT);
std::wcout << L"\u2122";
}
Are you always using std::wcout or are you sometimes using std::cout? Mixing these won't work. Of course, the error description "choking" doesn't say at all what problem you are observing. I'd suspect that this is a different problem to the one using files, however.
As there is no real description of the problem it takes somewhat of a crystal ball followed by a shot in the dark to hit the problem... Since you want to get Unicode characters from you file make sure that the file stream you are using uses a std::locale whose std::codecvt<...> facet actually converts to a suitable Unicode encoding.
I just tested GCC (versions 4.4 thru 4.7) and MSVC 10, which all exhibit this problem.
Equally broken is wprintf, which does as little as the C++ stream API.
I also tested the raw Win32 API to see if nothing else was causing the failure, and this works:
#include <windows.h>
int main()
{
HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD n;
WriteConsoleW( stdout, L"\u03B2", 1, &n, NULL );
}
Which writes β to the console (if you set cmd's font to something like Lucida Console).
Conclusion: wchar_t output is horribly broken in both large C++ Standard library implementations.
Although the wide character streams take Unicode as input, that's not what they produce as output - the characters go through a conversion. If a character can't be represented in the encoding that it's converting to, the output fails.

C++: output contents of a Unicode file to console in Windows

I've read a bunch of articles and forums posts discussing this problem all of the solutions seem way too complicated for such a simple task.
Here's a sample code straight from cplusplus.com:
// reading a text file
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main () {
string line;
ifstream myfile ("example.txt");
if (myfile.is_open())
{
while ( myfile.good() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}
else cout << "Unable to open file";
return 0;
}
It works fine as long as example.txt has only ASCII characters. Things get messy if I try to add, say, something in Russian.
In GNU/Linux it's as simple as saving the file as UTF-8.
In Windows, that doesn't work. Converting the file into UCS-2 Little Endian (what Windows seems to use by default) and changing all the functions into their wchar_t counterparts doesn't do the trick either.
Isn't there some kind of a "correct" way to get this done without doing all kinds of magic encoding conversions?
The Windows console supports unicode, sort of. It does not support left-to-right and "complex scripts". To print a UTF-16 file with Visual C++, use the following:
_setmode(_fileno(stdout), _O_U16TEXT);
And use wcout instead of cout.
There is no support for a "UTF8" code page so for UTF-8 you will have to use MultiBytetoWideChar
More on console support for unicode can be found in this blog
The right way to output to a console on Windows using cout is to first call GetConsoleOutputCP, and then convert the input you have into the console code page. Alternatively, use WriteConsoleW, passing a wchar_t*.
For reading UTF-8 or UTF-16 strings from a file, you can use the extended mode string of _wfopen_s and fgetws. I don't think there is a C++ interface for these extensions yet. The easiest way to print to the console is described in Michael Kaplan's blog:
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
int main(void) {
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n");
return 0;
}
Avoid GetConsoleOutputCP, it is only retained for compatibility with the 8-bit API.
While Windows console windows are UCS-2 based, they don't support UTF-8 properly.
You might make things work by setting the console window's active output code page to UTF-8 temporarily, using the appropriate API functions. Note that those functions distinguish between input code page and output code page. However, [cmd.exe] really doesn't like UTF-8 as active code page, so don't set that as a permanent code page.
Otherwise, you can use the Unicode console window functions.
Cheers & hth.,
#include <stdio.h>
int main (int argc, char *argv[])
{
// do chcp 65001 in the console before running this
printf ("γασσο γεο!\n");
}
Works perfectly if you do chcp 65001 in the console before running your program.
Caveats:
I'm using 64 bit Windows 7 with VC++ Express 2010
The code is in a file encoded as UTF-8 without BOM - I wrote it in a text editor, not using the VC++ IDE, then used VC++ to compile it.
The console has a TrueType font - this is important
Don't know if these things make too much difference...
Can't speak for chars off the BMP, give it a whirl and leave a comment.
Just to be clear, some here have mentioned UTF8. UTF8 is a multibyte format, which in some documentation is mistakenly referred to as Unicode. Unicode is always just two bytes.
I've used this previously posted solution with Visual Studio 2008. I don't know if if works with later versions of Visual Studio.
#include <iostream>
#include <fnctl.h>
#include <io.h>
#include <tchar.h>
<code ommitted>
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << _T("This is some text to print\n");
I used macros to switch between std::wcout and std::cout, and also to remove the _setmode call for ASCII builds, thus allowing compiling either for ASCII and UNICODE. This works. I have not yet tested using std::endl, but I that might work wcout and Unicode (not sure), i.e.
std::wcout << _T("This is some text to print") << std::endl;