wcout does not output as desired - c++

I've been trying to write a C++ application for a project and I ran into this issue. Basically:
class OBSClass
{
public:
wstring ClassName;
uint8_t Credit;
uint8_t Level;
OBSClass() : ClassName(), Credit(), Level() {}
OBSClass(wstring name, uint8_t credit, uint8_t hyear)
: ClassName(name), Credit(credit), Level(hyear)
{}
};
In some other file:
vector<OBSClass> AllClasses;
...
AllClasses.push_back(OBSClass(L"Bilişim Sistemleri Mühendisliğine Giriş", 3, 1));
AllClasses.push_back(OBSClass(L"İş Sağlığı ve Güvenliği", 3, 1));
AllClasses.push_back(OBSClass(L"Türk Dili 1", 2, 1));
... (rest omitted, some of entries have non-ASCII characters like 'ş' and 'İ')
I have a function basically outputs everything in AllClasses, the problem is wcout does not output as desired.
void PrintClasses()
{
for (size_t i = 0; i < AllClasses.size(); i++)
{
wcout << "Class: " << AllClasses[i].ClassName << "\n";
}
}
Output is 'Class: Bili' and nothing else. Program does not even tries to output other entries and just hangs. I am on windows using G++ 6.3.0. And I am not using Windows' cmd, I am using bash from mingw, so encoding will not be problem (or isn't it?). Any advice?
Edit: Also source code encoding is not a problem, just checked it is UTF8, default of VSCode
Edit: Also just checked to find out if problem is with string literals.
wstring test;
wcin >> test;
wcout << test;
Entered some non-ASCII characters like 'ö' and 'ş', it works perfectly. What is the problem with wide string literals?
Edit: Here you go
#include <iostream>
#include <string>
#include <vector>
using namespace std;
vector<wstring> testvec;
int main()
{
testvec.push_back(L"Bilişim Sistemleri Mühendisliğine Giriş");
testvec.push_back(L"ıiÖöUuÜü");
testvec.push_back(L"☺☻♥♦♣♠•◘○");
for (size_t i = 0; i < testvec.size(); i++)
wcout << testvec[i] << "\n";
return 0;
}
Compile with G++:
g++ file.cc -O3
This code only outputs 'Bili'. It must be something with the g++ screwing up binary encoding (?), since entering values with wcin then outputting them with wcout does not generate any problem.

The following code works for me, using MinGW-w64 7.3.0 in both MSYS2 Bash, and Windows CMD; and with the source encoded as UTF-8:
#include <iostream>
#include <locale>
#include <string>
#include <codecvt>
int main()
{
std::ios_base::sync_with_stdio(false);
std::locale utf8( std::locale(), new std::codecvt_utf8_utf16<wchar_t> );
std::wcout.imbue(utf8);
std::wstring w(L"Bilişim Sistemleri Mühendisliğine Giriş");
std::wcout << w << '\n';
}
Explanation:
The Windows console doesn't support any sort of 16-bit output; it's only ANSI and a partial UTF-8 support. So you need to configure wcout to convert the output to UTF-8. This is the default for backwards compatibility purposes, though Windows 10 1803 does add an option to set that to UTF-8 (ref).
imbue with a codecvt_utf8_utf16 achieves this; however you also need to disable sync_with_stdio otherwise the stream doesn't even use the facet, it just defers to stdout which has a similar problem.
For writing to other files, I found the same technique works to write UTF-8. For writing a UTF-16 file you need to imbue the wofstream with a UTF-16 facet, see example here, and manually write a BOM.
Commentary: Many people just avoid trying to use wide iostreams completely, due to these issues.
You can write a UTF-8 file using a narrow stream; and have function calls in your code to convert wstring to UTF-8, if you are using wstring internally; you can of course use UTF-8 internally.
Of course you can also write a UTF-16 file using a narrow stream, just not with operator<< from a wstring.

If you have at least Windows 10 1903 (May 2019), and at least
Windows Terminal 0.3.2142 (Aug 2019). Then set Unicode:
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage]
"OEMCP"="65001"
and restart. After that you can use this:
#include <iostream>
int main() {
std::string a[] = {
"Bilişim Sistemleri Mühendisliğine Giriş",
"Türk Dili 1",
"İş Sağlığı ve Güvenliği",
"ıiÖöUuÜü",
"☺☻♥♦♣♠•◘○"
};
for (auto s: a) {
std::cout << s << std::endl;
}
}

Related

Reading a file that contains chinese characters (C++)

I got issues reading a file that contains chinese characters. I know that the encoding of the file is Big5.
Here is my example file (test.txt), I can't include it here because of the chinese characters: https://gist.github.com/haruka98/974ca2c034ebd8fe7eeac4124739fc41
This is my minimal code example (main.cpp), the one I'm actually using breaks down each line and does things with the different fields.
#include <string>
#include <fstream>
#include <iostream>
int main(int argc, char* argv[]) {
setlocale(LC_ALL, "Chinese-traditional");
std::wstring wstr;
std::wifstream input_file("test.txt");
std::wofstream output_file("test_output.txt");
int counter = 0;
while(std::getline(input_file, wstr)) {
for(int i = 0; i < wstr.size(); i++) {
if(wstr[i] == L'|') {
counter++;
}
}
output_file << wstr << std::endl;
}
input_file.close();
output_file.close();
std::cout << counter << std::endl;
return 0;
}
To compile my program:
g++ -o test main.cpp -std=c++17
On Windows 10 I got my expected output. I got the entire file copied to "test_output.txt" and the 129 output in the terminal.
On Linux (Debian 9) I got the terminal output 4 and the file "test_output.txt" only contains the first line and the "1|" from the second.
Here is what I tried:
My first guess was the CR LF and LF issue when using both Windows and Linux. But testing both CR LF and LF with the file did not help.
Then I thought that the "Chinese-traditional" might not work on Linux. I replaced it with "zh_TW.BIG5" but did not get the expected result either.
First check you have the locale for "Chinese-traditional" installed. On Linux this is zh_TW.UTF-8. You can check using locale -a. If it's not listed, install it:
sudo locale-gen zh_TW.UTF-8
sudo update-locale
(There's a list of locales here with their names on Linux and Windows.)
Then use imbue with the input and output streams to set the locale of the streams.
By default, std::wcout is synchronized to the underlying stdout C stream, which uses an ASCII mapping and displays ? in place of Unicode characters it cannot handle. If you want to print Unicode characters to the terminal, you have to turn that synchronization off. You can do that with one line and set the locale of the terminal:
std::ios_base::sync_with_stdio(false);
std::wcout.imbue(loc);
Amended version of your code:
#include <string>
#include <locale>
#include <fstream>
#include <iostream>
int main(int argc, char* argv[])
{
auto loc = std::locale("zh_TW.utf8");
//Disable synchronisation with stdio & set locale
std::ios::sync_with_stdio(false);
std::wcout.imbue(loc);
//Set locale of input stream
std::wstring wstr;
std::wifstream input_file("test.txt");
input_file.imbue(loc);
//Set locale of outputput stream
std::wofstream output_file("test_output.txt");
output_file.imbue(loc);
int counter = 0;
while(std::getline(input_file, wstr)) {
for(int i = 0; i < wstr.size(); i++) {
if(wstr[i] == L'|') {
counter++;
}
}
std::wcout << wstr << std::endl;
output_file << wstr << std::endl;
}
input_file.close();
output_file.close();
std::wcout << counter << std::endl;
return 0;
}
setlocale affects the locale of your program.
It has no effect on the default encoding of the text displayed by the terminal window. The terminal window is an independent application, with its own locale.
Pretty much all modern Linux distributions default to UTF-8 as the encoding for the system console and the terminal windows (gnome-terminal, Konsole, xfce4-terminal, etc...).
Changing your program's locale only affects how your application interprets text, but the terminal still expects your application to produce UTF-8 output. The terminal window has no knowledge of the internal locale of the application running in the terminal window. Terminal windows expect applications to produce output using the system locale's character encoding.
It is theoretically possible for the C library to know the default system encoding and silently transcode all the output, however it does not work this way.
You will have to do all the work of transcoding big5 to UTF-8, using the iconv library, on Linux.
A low cost, cheap shortcut, would be for your program to fork and run the iconv command line tool as a child process, and pipe its output to it, then let iconv do the transcoding on the fly.
use std::wcout to print std::wstring instead of std::cout :-)

C++ How to write japanese characters in file with [duplicate]

I am using Visual Studio C++ 2008 (Express). When I run the below code, the wostream (both std::wcout, and std::wfstream) stops outputting at the first non-ASCII character (in this case Chinese) encountered. Plain ASCII characters print fine. However, in the debugger, I can see that the wstrings are in fact properly populated with Chinese characters, and the output << ... is in fact getting executed.
The project settings in the Visual Studio solution are set to "Use Unicode Character Set". Why is std::wostream failing to output Unicode characters outside of the ASCII range?
void PrintTable(const std::vector<std::vector<std::wstring>> &table, std::wostream& output) {
for (unsigned int i=0; i < table.size(); ++i) {
for (unsigned int j=0; j < table[i].size(); ++j) {
output << table[i][j] << L"\t";
}
//output << std::endl;
}
}
void TestUnicodeSingleTableChinesePronouns() {
FileProcessor p("SingleTableChinesePronouns.docx");
FileProcessor::iterator fileIterator;
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
for(fileIterator = p.begin(); fileIterator != p.end(); ++fileIterator) {
PrintTable(*fileIterator, myFile);
PrintTable(*fileIterator, std::wcout);
std::cout<<std::endl<<"---------------------------------------"<<std::endl;
}
myFile.flush();
myFile.close();
}
By default the locale that std::wcout and std::wofstream use for certain operations is the "C" locale, which is not required to support non-ascii characters (or any character outside C++'s basic character set). Change the locale to one that supports the characters you want to use.
The simplest thing to do on Windows is unfortunately to use legacy codepages, however you really should avoid that. Legacy codepages are bad news. Instead you should use Unicode, whether UTF-8, UTF-16, or whatever. Also you'll have to work around Windows' unfortunate console model that makes writing to the console very different from writing to other kinds of output streams. You might need to find or write your own output buffer that specifically handles the console (or maybe file a bug asking Microsoft to fix it).
Here's an example of console output:
#include <Windows.h>
#include <streambuf>
#include <iostream>
class Console_streambuf
: public std::basic_streambuf<wchar_t>
{
HANDLE m_out;
public:
Console_streambuf(HANDLE out) : m_out(out) {}
virtual int_type overflow(int_type c = traits_type::eof())
{
wchar_t wc = c;
DWORD numberOfCharsWritten;
BOOL res = WriteConsoleW(m_out, &wc, 1, &numberOfCharsWritten, NULL);
(void)res;
return 1;
}
};
int main() {
Console_streambuf out(GetStdHandle(STD_OUTPUT_HANDLE));
auto old_buf = std::wcout.rdbuf(&out);
std::wcout << L"привет, 猫咪!\n";
std::wcout.rdbuf(old_buf); // replace old buffer so that destruction can happen correctly. FIXME: use RAII to do this in an exception safe manner.
}
You can do UTF-8 output to a file like this (although I'm not sure VS2008 supports codecvt_utf8_utf16):
#include <codecvt>
#include <fstream>
int main() {
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
myFile.imbue(std::locale(myFile.getloc(),new std::codecvt_utf8_utf16<wchar_t>));
myFile << L"привет, 猫咪!";
}
Include the following header file
#include <locale>
at the start of main, add the following line.
std::locale::global(std::locale("chinese"));
This helps to set the proper locale.

no output with wide streams

I have a problem with wide stream output. My primary concern is wofstream but wcout doesn't work properly either.
So it doesn't produce output besides Latin characters.
That is
#include <string>
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
wstring wstr = L"Андрей";
wofstream fout(L"C:\\Work\\report.htm");
wcout << wstr << L"Привет мир";
fout << wstr << L"Привет мир";
fout.close();
}
Produces no output, the file stays 0 byte long.
Mixing like wcout<<L"zuhщзг" prints just "zuh", ignores the rest.
I use MVS 2013 with Intel C++ Composer 14.0
EDIT:
Windows Unicode C++ Stream Output Failure describes similar problem. But I don't quite understand how the solution works.
MVS/Windows use UTF-16 for wide strings. and I would like they to be written in the file, as is, that is utf-16, without any unnecessary conversion

wostream fails to output wstring

I am using Visual Studio C++ 2008 (Express). When I run the below code, the wostream (both std::wcout, and std::wfstream) stops outputting at the first non-ASCII character (in this case Chinese) encountered. Plain ASCII characters print fine. However, in the debugger, I can see that the wstrings are in fact properly populated with Chinese characters, and the output << ... is in fact getting executed.
The project settings in the Visual Studio solution are set to "Use Unicode Character Set". Why is std::wostream failing to output Unicode characters outside of the ASCII range?
void PrintTable(const std::vector<std::vector<std::wstring>> &table, std::wostream& output) {
for (unsigned int i=0; i < table.size(); ++i) {
for (unsigned int j=0; j < table[i].size(); ++j) {
output << table[i][j] << L"\t";
}
//output << std::endl;
}
}
void TestUnicodeSingleTableChinesePronouns() {
FileProcessor p("SingleTableChinesePronouns.docx");
FileProcessor::iterator fileIterator;
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
for(fileIterator = p.begin(); fileIterator != p.end(); ++fileIterator) {
PrintTable(*fileIterator, myFile);
PrintTable(*fileIterator, std::wcout);
std::cout<<std::endl<<"---------------------------------------"<<std::endl;
}
myFile.flush();
myFile.close();
}
By default the locale that std::wcout and std::wofstream use for certain operations is the "C" locale, which is not required to support non-ascii characters (or any character outside C++'s basic character set). Change the locale to one that supports the characters you want to use.
The simplest thing to do on Windows is unfortunately to use legacy codepages, however you really should avoid that. Legacy codepages are bad news. Instead you should use Unicode, whether UTF-8, UTF-16, or whatever. Also you'll have to work around Windows' unfortunate console model that makes writing to the console very different from writing to other kinds of output streams. You might need to find or write your own output buffer that specifically handles the console (or maybe file a bug asking Microsoft to fix it).
Here's an example of console output:
#include <Windows.h>
#include <streambuf>
#include <iostream>
class Console_streambuf
: public std::basic_streambuf<wchar_t>
{
HANDLE m_out;
public:
Console_streambuf(HANDLE out) : m_out(out) {}
virtual int_type overflow(int_type c = traits_type::eof())
{
wchar_t wc = c;
DWORD numberOfCharsWritten;
BOOL res = WriteConsoleW(m_out, &wc, 1, &numberOfCharsWritten, NULL);
(void)res;
return 1;
}
};
int main() {
Console_streambuf out(GetStdHandle(STD_OUTPUT_HANDLE));
auto old_buf = std::wcout.rdbuf(&out);
std::wcout << L"привет, 猫咪!\n";
std::wcout.rdbuf(old_buf); // replace old buffer so that destruction can happen correctly. FIXME: use RAII to do this in an exception safe manner.
}
You can do UTF-8 output to a file like this (although I'm not sure VS2008 supports codecvt_utf8_utf16):
#include <codecvt>
#include <fstream>
int main() {
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
myFile.imbue(std::locale(myFile.getloc(),new std::codecvt_utf8_utf16<wchar_t>));
myFile << L"привет, 猫咪!";
}
Include the following header file
#include <locale>
at the start of main, add the following line.
std::locale::global(std::locale("chinese"));
This helps to set the proper locale.

Stumped with Unicode, Boost, C++, codecvts

In C++, I want to use Unicode to do things. So after falling down the rabbit hole of Unicode, I've managed to end up in a train wreck of confusion, headaches and locales.
But in Boost I've had the unfortunate problem of trying to use Unicode file paths and trying to use the Boost program options library with Unicode input. I've read whatever I could find on the subjects of locales, codecvts, Unicode encodings and Boost.
My current attempt to get things to work is to have a codecvt that takes a UTF-8 string and converts it to the platform's encoding (UTF-8 on POSIX, UTF-16 on Windows), I've been trying to avoid wchar_t.
The closest I've actually gotten is trying to do this with Boost.Locale, to convert from a UTF-8 string to a UTF-32 string on output.
#include <string>
#include <boost/locale.hpp>
#include <locale>
int main(void)
{
std::string data("Testing, 㤹");
std::locale fromLoc = boost::locale::generator().generate("en_US.UTF-8");
std::locale toLoc = boost::locale::generator().generate("en_US.UTF-32");
typedef std::codecvt<wchar_t, char, mbstate_t> cvtType;
cvtType const* toCvt = &std::use_facet<cvtType>(toLoc);
std::locale convLoc = std::locale(fromLoc, toCvt);
std::cout.imbue(convLoc);
std::cout << data << std::endl;
// Output is unconverted -- what?
return 0;
}
I think I had some other kind of conversion working using wide characters, but I really don't know what I'm even doing. I don't know what the right tool for the job is at this point. Help?
Okay, after a long few months I've figured it out, and I'd like to help people in the future.
First of all, the codecvt thing was the wrong way of doing it. Boost.Locale provides a simple way of converting between character sets in its boost::locale::conv namespace. Here's one example (there's others not based on locales).
#include <boost/locale.hpp>
namespace loc = boost::locale;
int main(void)
{
loc::generator gen;
std::locale blah = gen.generate("en_US.utf-32");
std::string UTF8String = "Tésting!";
// from_utf will also work with wide strings as it uses the character size
// to detect the encoding.
std::string converted = loc::conv::from_utf(UTF8String, blah);
// Outputs a UTF-32 string.
std::cout << converted << std::endl;
return 0;
}
As you can see, if you replace the "en_US.utf-32" with "" it'll output in the user's locale.
I still don't know how to make std::cout do this all the time, but the translate() function of Boost.Locale outputs in the user's locale.
As for the filesystem using UTF-8 strings cross platform, it seems that that's possible, here's a link to how to do it.
std::cout.imbue(convLoc);
std::cout << data << std::endl;
This does no conversion, since it uses codecvt<char, char, mbstate_t> which is a no-op. The only standard streams that use codecvt are file-streams. std::cout is not required to perform any conversion at all.
To force Boost.Filesystem to interpret narrow-strings as UTF-8 on windows, use boost::filesystem::imbue with a locale with a UTF-8 ↔ UTF-16 codecvt facet. Boost.Locale has an implementation of the latter.
The Boost filesystem iostream replacement classes work fine with UTF-16 when used with Visual C++.
However, they do not work (in the sense of supporting arbitrary filenames) when used with g++ in Windows - at least as of Boost version 1.47. There is a code comment explaining that; essentially, the Visual C++ standard library provides non-standard wchar_t based constructors that Boost filesystem classes make use of, but g++ does not support these extensions.
A workaround is to use 8.3 short filenames, but this solution is a bit brittle since with old Windows versions the user can turn off automatic generation of short filenames.
Example code for using Boost filesystem in Windows:
#include "CmdLineArgs.h" // CmdLineArgs
#include "throwx.h" // throwX, hopefully
#include "string_conversions.h" // ansiOrFillerFrom( wstring )
#include <boost/filesystem/fstream.hpp> // boost::filesystem::ifstream
#include <iostream> // std::cout, std::cerr, std::endl
#include <stdexcept> // std::runtime_error, std::exception
#include <string> // std::string
#include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE
using namespace std;
namespace bfs = boost::filesystem;
inline string ansi( wstring const& ws ) { return ansiWithFillersFrom( ws ); }
int main()
{
try
{
CmdLineArgs const args;
wstring const programPath = args.at( 0 );
hopefully( args.nArgs() == 2 )
|| throwX( "Usage: " + ansi( programPath ) + " FILENAME" );
wstring const filePath = args.at( 1 );
bfs::ifstream stream( filePath ); // Nice Boost ifstream subclass.
hopefully( !stream.fail() )
|| throwX( "Failed to open file '" + ansi( filePath ) + "'" );
string line;
while( getline( stream, line ) )
{
cout << line << endl;
}
hopefully( stream.eof() )
|| throwX( "Failed to list contents of file '" + ansi( filePath ) + "'" );
return EXIT_SUCCESS;
}
catch( exception const& x )
{
cerr << "!" << x.what() << endl;
}
return EXIT_FAILURE;
}