International UTF-32 strings output to console in Linux - c++

#include <stdio.h>
#include <iostream>
#include <locale>
int main()
{
const wchar_t *str = L"\u041F\u043E\u0440\u044F\u0434\u043E\u043A";
std::locale::global(std::locale(""));
std::wcout << str << std::endl;
}
Here's a piece of code that outputs a russian phrase in UTF-32 wchar_t string as:
The correct one: Порядок when run from UTF-8 gnome terminal in Ubuntu 11.10
РџРѕСЂСЏРґРѕРє in Eclipse in the test run as above
45=B8D8:0B># in Eclipse in a real program (where I don't even know who does what and where, but I suppose someone does mess with locales)
??????? if I don't call locale
str is shown as Details:0x400960 L"\320\237\320\276\321\200\321\217\320\264\320\276\320\272" in Eclipse Watch window
is shown as ASCII only byte chars in Eclipse memory window (and there's no way to specify that this is UTF-32 string)
I believe this is a misconfiguration in either eclipse console or the program, because, for example, other people that just run my code in Eclipse they do see the correct output.
Could someone shed a light on this confusion? What is the correct way to setup all the pieces (OS, gcc, terminal, Eclipse, sources...) to output international symbols that are stored in UTF-32 wchar_t strings?
And as a side note, why should I still care about all this when we have UTF-32 and that should be enough to know what is inside...

It turned out to be that other code changed locale.

Related

C++ output Unicode in variable

I'm trying to output a string containing unicode characters, which is received with a curl call. Therefore, I'm looking for something similar to u8 and L options for literal strings, but than applicable for variables. E.g.:
const char *s = u8"\u0444";
However, since I have a string containing unicode characters, such as:
mit freundlichen Grüßen
When I want to print this string with:
cout << UnicodeString << endl;
it outputs:
mit freundlichen Gr??en
When I use wcout, it returns me:
mit freundlichen Gren
What am I doing wrong and how can I achieve the correct output. I return the output with RapidJSON, which returns the string as:
mit freundlichen Gr��en
Important to note, the application is a CGI running on Ubuntu, replying on browser requests
If you are on Windows, what I would suggest is using Unicode UTF-16 at the Windows boundary.
It seems to me that on Windows with Visual C++ (at least up to VS2015) std::cout cannot output UTF-8-encoded-text, but std::wcout correctly outputs UTF-16-encoded text.
This compilable code snippet correctly outputs your string containing German characters:
#include <fcntl.h>
#include <io.h>
#include <iostream>
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
// ü : U+00FC
// ß : U+00DF
const wchar_t * text = L"mit freundlichen Gr\u00FC\u00DFen";
std::wcout << text << L'\n';
}
Note the use of a UTF-16-encoded wchar_t string.
On a more general note, I would suggest you using the UTF-8 encoding (and for example storing text in std::strings) in your cross-platform C++ portions of code, and convert to UTF-16-encoded text at the Windows boundary.
To convert between UTF-8 and UTF-16 you can use Windows APIs like MultiByteToWideChar and WideCharToMultiByte. These are C APIs, that can be safely and conveniently wrapped in C++ code (more details can be found in this MSDN article, and you can find compilable C++ code here on GitHub).
On my system the following produces the correct output. Try it on your system. I am confident that it will produce similar results.
#include <string>
#include <iostream>
using namespace std;
int main()
{
string s="mit freundlichen Grüßen";
cout << s << endl;
return 0;
}
If it is ok, then this points to the web transfer not being 8-bit clean.
Mike.
containing unicode characters
You forgot to specify which unicode encoding does the string contain. There is the "narrow" UTF-8, which can be stored in a std::string and printed using std::cout, as well as wider variants, which can't. It is crucial to know which encoding you're dealing with. For the remainder of my answer, I'm going to assume you want to use UTF-8.
When I want to print this string with:
cout << UnicodeString << endl;
EDIT:
Important to note, the application is a CGI running on Ubuntu, replying on browser requests
The concerns here are slightly different from printing onto a terminal.
You need to set the Content-Type response header appropriately or else the client cannot know how to interpret the response. For example Content-Type: application/json; charset=utf-8.
You still need to make sure that the source string is in fact the correct encoding corresponding to the header. See the old answer below for overview.
The browser has to support the encoding. Most modern browsers have had support for UTF-8 a long time now.
Answer regarding printing to terminal:
Assuming that
UnicodeString indeed contains an UTF-8 encoded string
and that the terminal uses UTF-8 encoding
and the font that the terminal uses has the graphemes that you use
the above should work.
it outputs:
mit freundlichen Gr??en
Then it appears that at least one of the above assumptions don't hold.
Whether 1. is true, you can verify by inspecting the numeric value of each code unit separately and comparing it to what you would expect of UTF-8. If 1. isn't true, then you need to figure out what encoding does the string actually use, and either convert the encoding, or configure the terminal to use that encoding.
The terminal typically, but not necessarily, uses the system native encoding. The first step of figuring out what encoding your terminal / system uses is to figure out what terminal / system you are using in the first place. The details are probably in a manual.
If the terminal doesn't use UTF-8, then you need to convert the UFT-8 string within your program into the character encoding that the terminal does use - unless that encoding doesn't have the graphemes that you want to print. Unfortunately, the standard library doesn't provide arbitrary character encoding conversion support (there is some support for converting between narrow and wide unicode, but even that support is deprecated). You can find the unicode standard here, although I would like to point out that using an existing conversion implementation can save a lot of work.
In the case the character encoding of the terminal doesn't have the needed grapehemes - or if you don't want to implement encoding conversion - is to re-configure the terminal to use UTF-8. If the terminal / system can be configured to use UTF-8, there should be details in the manual.
You should be able to test if the font itself has the required graphemes simply by typing the characters into the terminal and see if they show as they should - although, this test will also fail if the terminal encoding does not have the graphemes, so check that first. Manual of your terminal should explain how to change the font, should it be necessary. That said, I would expect üß to exist in most fonts.

Cannot wcout wide string other than English [duplicate]

I've read a bunch of articles and forums posts discussing this problem all of the solutions seem way too complicated for such a simple task.
Here's a sample code straight from cplusplus.com:
// reading a text file
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main () {
string line;
ifstream myfile ("example.txt");
if (myfile.is_open())
{
while ( myfile.good() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}
else cout << "Unable to open file";
return 0;
}
It works fine as long as example.txt has only ASCII characters. Things get messy if I try to add, say, something in Russian.
In GNU/Linux it's as simple as saving the file as UTF-8.
In Windows, that doesn't work. Converting the file into UCS-2 Little Endian (what Windows seems to use by default) and changing all the functions into their wchar_t counterparts doesn't do the trick either.
Isn't there some kind of a "correct" way to get this done without doing all kinds of magic encoding conversions?
The Windows console supports unicode, sort of. It does not support left-to-right and "complex scripts". To print a UTF-16 file with Visual C++, use the following:
_setmode(_fileno(stdout), _O_U16TEXT);
And use wcout instead of cout.
There is no support for a "UTF8" code page so for UTF-8 you will have to use MultiBytetoWideChar
More on console support for unicode can be found in this blog
The right way to output to a console on Windows using cout is to first call GetConsoleOutputCP, and then convert the input you have into the console code page. Alternatively, use WriteConsoleW, passing a wchar_t*.
For reading UTF-8 or UTF-16 strings from a file, you can use the extended mode string of _wfopen_s and fgetws. I don't think there is a C++ interface for these extensions yet. The easiest way to print to the console is described in Michael Kaplan's blog:
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
int main(void) {
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n");
return 0;
}
Avoid GetConsoleOutputCP, it is only retained for compatibility with the 8-bit API.
While Windows console windows are UCS-2 based, they don't support UTF-8 properly.
You might make things work by setting the console window's active output code page to UTF-8 temporarily, using the appropriate API functions. Note that those functions distinguish between input code page and output code page. However, [cmd.exe] really doesn't like UTF-8 as active code page, so don't set that as a permanent code page.
Otherwise, you can use the Unicode console window functions.
Cheers & hth.,
#include <stdio.h>
int main (int argc, char *argv[])
{
// do chcp 65001 in the console before running this
printf ("γασσο γεο!\n");
}
Works perfectly if you do chcp 65001 in the console before running your program.
Caveats:
I'm using 64 bit Windows 7 with VC++ Express 2010
The code is in a file encoded as UTF-8 without BOM - I wrote it in a text editor, not using the VC++ IDE, then used VC++ to compile it.
The console has a TrueType font - this is important
Don't know if these things make too much difference...
Can't speak for chars off the BMP, give it a whirl and leave a comment.
Just to be clear, some here have mentioned UTF8. UTF8 is a multibyte format, which in some documentation is mistakenly referred to as Unicode. Unicode is always just two bytes.
I've used this previously posted solution with Visual Studio 2008. I don't know if if works with later versions of Visual Studio.
#include <iostream>
#include <fnctl.h>
#include <io.h>
#include <tchar.h>
<code ommitted>
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << _T("This is some text to print\n");
I used macros to switch between std::wcout and std::cout, and also to remove the _setmode call for ASCII builds, thus allowing compiling either for ASCII and UNICODE. This works. I have not yet tested using std::endl, but I that might work wcout and Unicode (not sure), i.e.
std::wcout << _T("This is some text to print") << std::endl;

printing Unicode characters C++

I'm trying to write a simple command line app to teach myself Japanese, but can't seem to get Unicode characters to print. What am I missing?
#include <iostream>
using namespace std;
int main()
{
wcout << L"こんにちは世界\n";
wcout << L"Hello World\n"
system("pause");
}
In this example only "Press any key to continue" is displayed. Tested on Visual C++ 2013.
This is not easy on Windows. Even when you manage to get the text to the Windows console you still need to configure cmd.exe to be able to display Japanese characters.
#include <iostream>
int main() {
std::cout << "こんにちは世界\n";
}
This works fine on any system where:
The compiler's source and execution encodings include the characters.
The output device (e.g., the console) expects text in the same encoding as the compiler's execution encoding.
A font with the appropriate characters is available (usually not a problem).
Most platforms these days use UTF-8 by default for all these encodings and so can support the entire Unicode range with code similar to the above. Unfortunately Windows is not one of these platforms.
wcout << L"こんにちは世界\n";
In this line the string literal data is (at compile time) converted from the source encoding to the execution wide encoding and then (at run time) wcout uses the locale it is imbued with to convert the wchar_t data to char data for output. Where things go wrong is that the default locale is only required to support characters from the basic source character set, which doesn't even include all ASCII characters, let alone non-ASCII characters.
So the conversion results in an error, putting wcout into a bad state. The error has to be cleared before wcout will function again, which is why the second print statement does not output anything.
You can work around this for a limited range of characters by imbuing wcout with a locale that will successfully convert the characters. Unfortunately the encoding that is needed to support the entire Unicode range this way is UTF-8; Although Microsoft's implementation of streams supports other multibyte encodings it very specifically does not support UTF-8.
For example:
wcout.imbue(std::locale(std::locale::classic(), new std::codecvt_utf8_utf16<wchar_t>()));
SetConsoleOutputCP(CP_UTF8);
wcout << L"こんにちは世界\n";
Here wcout will correctly convert the string to UTF-8, and if the output were written to a file instead of the console then the file would contain the correct UTF-8 data. However the Windows console, even though configured here to accept UTF-8 data, simply will not accept UTF-8 data written in this way.
There are a few options:
Avoid the standard library entirely:
DWORD n;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), L"こんにちは世界\n", 8, &n, nullptr);
Use non-standard magical incantation that will break standard code:
#include <fcntl.h>
#include <io.h>
_setmode(_fileno(stdout), _O_U8TEXT);
std::wcout << L"こんにちは世界\n";
After setting this mode std::cout << "Hello, World"; will crash.
Use a low level IO API along with manual conversion:
#include <codecvt>
#include <locale>
SetConsoleOutputCP(CP_UTF8);
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
std::puts(convert.to_bytes(L"こんにちは世界\n"));
Using any of these methods, cmd.exe will display the correct text to the best of its ability, by which I mean it will display unreadable boxes. Seven little boxes, for the given string.
You can copy the text out of cmd.exe and into notepad.exe or whatever to see the correct glyphs.
There's a whole article about dealing with Unicode in Windows console
http://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/
http://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/
Basically, you may implement you own streambuf for std::cout (or std::wcout) in terms of WriteConsoleW and enjoy writing UTF-8 (or whatever Unicode you want) to Windows console without depending on locales, console code pages and even without using wide characters.
It may not look very straightforward, but it's convenient and reusable solution, which is also able to give you a portable utf8-everywhere style user code. Please, don't beat me for my English :)
Or you can change Windows locale to Japanese.

Windows Unicode C++ Stream Output Failure

I am currently writing an application which requires me to call GetWindowText on arbitrary windows and store that data to a file for later processing. Long story short, I noticed that my tool was failing on Battlefield 3, and I narrowed the problem down to the following character in its window title:
http://www.fileformat.info/info/unicode/char/2122/index.htm
So I created a little test app which just does the following:
std::wcout << L"\u2122";
Low and behold that breaks output to the console window for the remainder of the program.
Why is the MSVC STL choking on this character (and I assume others) when APIs like MessageBoxW etc display it just fine?
How can I get those characters printed to my file?
Tested on both VC10 and VC11 under Windows 7 x64.
Sorry for the poorly constructed post, I'm tearing my hair out here.
Thanks.
EDIT:
Minimal test case
#include <fstream>
#include <iostream>
int main()
{
{
std::wofstream test_file("test.txt");
test_file << L"\u2122";
}
std::wcout << L"\u2122";
}
Expected result: '™' character printed to console and file.
Observed result: File is created but is empty. No output to console.
I have confirmed that the font I"m using for my console is capable of displaying the character in question, and the file is definitely empty (0 bytes in size).
EDIT:
Further debugging shows that the 'failbit' and 'badbit' are set in the stream(s).
EDIT:
I have also tried using Boost.Locale and I am having the same issue even with the new locale imbued globally and explicitly to all standard streams.
To write into a file, you have to set the locale correctly, for example if you want to write them as UTF-8 characters, you have to add
const std::locale utf8_locale
= std::locale(std::locale(), new std::codecvt_utf8<wchar_t>());
test_file.imbue(utf8_locale);
You have to add these 2 include files
#include <codecvt>
#include <locale>
To write to the console you have to set the console in the correct mode (this is windows specific) by adding
_setmode(_fileno(stdout), _O_U8TEXT);
(in case you want to use UTF-8).
For this you have to add these 2 include files:
#include <fcntl.h>
#include <io.h>
Furthermore you have to make sure that your are using a font that supports Unicode (such as for example Lucida Console). You can change the font in the properties of your console window.
The complete program now looks like this:
#include <fstream>
#include <iostream>
#include <codecvt>
#include <locale>
#include <fcntl.h>
#include <io.h>
int main()
{
const std::locale utf8_locale = std::locale(std::locale(),
new std::codecvt_utf8<wchar_t>());
{
std::wofstream test_file("c:\\temp\\test.txt");
test_file.imbue(utf8_locale);
test_file << L"\u2122";
}
_setmode(_fileno(stdout), _O_U8TEXT);
std::wcout << L"\u2122";
}
Are you always using std::wcout or are you sometimes using std::cout? Mixing these won't work. Of course, the error description "choking" doesn't say at all what problem you are observing. I'd suspect that this is a different problem to the one using files, however.
As there is no real description of the problem it takes somewhat of a crystal ball followed by a shot in the dark to hit the problem... Since you want to get Unicode characters from you file make sure that the file stream you are using uses a std::locale whose std::codecvt<...> facet actually converts to a suitable Unicode encoding.
I just tested GCC (versions 4.4 thru 4.7) and MSVC 10, which all exhibit this problem.
Equally broken is wprintf, which does as little as the C++ stream API.
I also tested the raw Win32 API to see if nothing else was causing the failure, and this works:
#include <windows.h>
int main()
{
HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD n;
WriteConsoleW( stdout, L"\u03B2", 1, &n, NULL );
}
Which writes β to the console (if you set cmd's font to something like Lucida Console).
Conclusion: wchar_t output is horribly broken in both large C++ Standard library implementations.
Although the wide character streams take Unicode as input, that's not what they produce as output - the characters go through a conversion. If a character can't be represented in the encoding that it's converting to, the output fails.

C++: output contents of a Unicode file to console in Windows

I've read a bunch of articles and forums posts discussing this problem all of the solutions seem way too complicated for such a simple task.
Here's a sample code straight from cplusplus.com:
// reading a text file
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main () {
string line;
ifstream myfile ("example.txt");
if (myfile.is_open())
{
while ( myfile.good() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}
else cout << "Unable to open file";
return 0;
}
It works fine as long as example.txt has only ASCII characters. Things get messy if I try to add, say, something in Russian.
In GNU/Linux it's as simple as saving the file as UTF-8.
In Windows, that doesn't work. Converting the file into UCS-2 Little Endian (what Windows seems to use by default) and changing all the functions into their wchar_t counterparts doesn't do the trick either.
Isn't there some kind of a "correct" way to get this done without doing all kinds of magic encoding conversions?
The Windows console supports unicode, sort of. It does not support left-to-right and "complex scripts". To print a UTF-16 file with Visual C++, use the following:
_setmode(_fileno(stdout), _O_U16TEXT);
And use wcout instead of cout.
There is no support for a "UTF8" code page so for UTF-8 you will have to use MultiBytetoWideChar
More on console support for unicode can be found in this blog
The right way to output to a console on Windows using cout is to first call GetConsoleOutputCP, and then convert the input you have into the console code page. Alternatively, use WriteConsoleW, passing a wchar_t*.
For reading UTF-8 or UTF-16 strings from a file, you can use the extended mode string of _wfopen_s and fgetws. I don't think there is a C++ interface for these extensions yet. The easiest way to print to the console is described in Michael Kaplan's blog:
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
int main(void) {
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n");
return 0;
}
Avoid GetConsoleOutputCP, it is only retained for compatibility with the 8-bit API.
While Windows console windows are UCS-2 based, they don't support UTF-8 properly.
You might make things work by setting the console window's active output code page to UTF-8 temporarily, using the appropriate API functions. Note that those functions distinguish between input code page and output code page. However, [cmd.exe] really doesn't like UTF-8 as active code page, so don't set that as a permanent code page.
Otherwise, you can use the Unicode console window functions.
Cheers & hth.,
#include <stdio.h>
int main (int argc, char *argv[])
{
// do chcp 65001 in the console before running this
printf ("γασσο γεο!\n");
}
Works perfectly if you do chcp 65001 in the console before running your program.
Caveats:
I'm using 64 bit Windows 7 with VC++ Express 2010
The code is in a file encoded as UTF-8 without BOM - I wrote it in a text editor, not using the VC++ IDE, then used VC++ to compile it.
The console has a TrueType font - this is important
Don't know if these things make too much difference...
Can't speak for chars off the BMP, give it a whirl and leave a comment.
Just to be clear, some here have mentioned UTF8. UTF8 is a multibyte format, which in some documentation is mistakenly referred to as Unicode. Unicode is always just two bytes.
I've used this previously posted solution with Visual Studio 2008. I don't know if if works with later versions of Visual Studio.
#include <iostream>
#include <fnctl.h>
#include <io.h>
#include <tchar.h>
<code ommitted>
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << _T("This is some text to print\n");
I used macros to switch between std::wcout and std::cout, and also to remove the _setmode call for ASCII builds, thus allowing compiling either for ASCII and UNICODE. This works. I have not yet tested using std::endl, but I that might work wcout and Unicode (not sure), i.e.
std::wcout << _T("This is some text to print") << std::endl;