wostream fails to output wstring - c++

I am using Visual Studio C++ 2008 (Express). When I run the below code, the wostream (both std::wcout, and std::wfstream) stops outputting at the first non-ASCII character (in this case Chinese) encountered. Plain ASCII characters print fine. However, in the debugger, I can see that the wstrings are in fact properly populated with Chinese characters, and the output << ... is in fact getting executed.
The project settings in the Visual Studio solution are set to "Use Unicode Character Set". Why is std::wostream failing to output Unicode characters outside of the ASCII range?
void PrintTable(const std::vector<std::vector<std::wstring>> &table, std::wostream& output) {
for (unsigned int i=0; i < table.size(); ++i) {
for (unsigned int j=0; j < table[i].size(); ++j) {
output << table[i][j] << L"\t";
}
//output << std::endl;
}
}
void TestUnicodeSingleTableChinesePronouns() {
FileProcessor p("SingleTableChinesePronouns.docx");
FileProcessor::iterator fileIterator;
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
for(fileIterator = p.begin(); fileIterator != p.end(); ++fileIterator) {
PrintTable(*fileIterator, myFile);
PrintTable(*fileIterator, std::wcout);
std::cout<<std::endl<<"---------------------------------------"<<std::endl;
}
myFile.flush();
myFile.close();
}

By default the locale that std::wcout and std::wofstream use for certain operations is the "C" locale, which is not required to support non-ascii characters (or any character outside C++'s basic character set). Change the locale to one that supports the characters you want to use.
The simplest thing to do on Windows is unfortunately to use legacy codepages, however you really should avoid that. Legacy codepages are bad news. Instead you should use Unicode, whether UTF-8, UTF-16, or whatever. Also you'll have to work around Windows' unfortunate console model that makes writing to the console very different from writing to other kinds of output streams. You might need to find or write your own output buffer that specifically handles the console (or maybe file a bug asking Microsoft to fix it).
Here's an example of console output:
#include <Windows.h>
#include <streambuf>
#include <iostream>
class Console_streambuf
: public std::basic_streambuf<wchar_t>
{
HANDLE m_out;
public:
Console_streambuf(HANDLE out) : m_out(out) {}
virtual int_type overflow(int_type c = traits_type::eof())
{
wchar_t wc = c;
DWORD numberOfCharsWritten;
BOOL res = WriteConsoleW(m_out, &wc, 1, &numberOfCharsWritten, NULL);
(void)res;
return 1;
}
};
int main() {
Console_streambuf out(GetStdHandle(STD_OUTPUT_HANDLE));
auto old_buf = std::wcout.rdbuf(&out);
std::wcout << L"привет, 猫咪!\n";
std::wcout.rdbuf(old_buf); // replace old buffer so that destruction can happen correctly. FIXME: use RAII to do this in an exception safe manner.
}
You can do UTF-8 output to a file like this (although I'm not sure VS2008 supports codecvt_utf8_utf16):
#include <codecvt>
#include <fstream>
int main() {
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
myFile.imbue(std::locale(myFile.getloc(),new std::codecvt_utf8_utf16<wchar_t>));
myFile << L"привет, 猫咪!";
}

Include the following header file
#include <locale>
at the start of main, add the following line.
std::locale::global(std::locale("chinese"));
This helps to set the proper locale.

Related

Reading a file that contains chinese characters (C++)

I got issues reading a file that contains chinese characters. I know that the encoding of the file is Big5.
Here is my example file (test.txt), I can't include it here because of the chinese characters: https://gist.github.com/haruka98/974ca2c034ebd8fe7eeac4124739fc41
This is my minimal code example (main.cpp), the one I'm actually using breaks down each line and does things with the different fields.
#include <string>
#include <fstream>
#include <iostream>
int main(int argc, char* argv[]) {
setlocale(LC_ALL, "Chinese-traditional");
std::wstring wstr;
std::wifstream input_file("test.txt");
std::wofstream output_file("test_output.txt");
int counter = 0;
while(std::getline(input_file, wstr)) {
for(int i = 0; i < wstr.size(); i++) {
if(wstr[i] == L'|') {
counter++;
}
}
output_file << wstr << std::endl;
}
input_file.close();
output_file.close();
std::cout << counter << std::endl;
return 0;
}
To compile my program:
g++ -o test main.cpp -std=c++17
On Windows 10 I got my expected output. I got the entire file copied to "test_output.txt" and the 129 output in the terminal.
On Linux (Debian 9) I got the terminal output 4 and the file "test_output.txt" only contains the first line and the "1|" from the second.
Here is what I tried:
My first guess was the CR LF and LF issue when using both Windows and Linux. But testing both CR LF and LF with the file did not help.
Then I thought that the "Chinese-traditional" might not work on Linux. I replaced it with "zh_TW.BIG5" but did not get the expected result either.
First check you have the locale for "Chinese-traditional" installed. On Linux this is zh_TW.UTF-8. You can check using locale -a. If it's not listed, install it:
sudo locale-gen zh_TW.UTF-8
sudo update-locale
(There's a list of locales here with their names on Linux and Windows.)
Then use imbue with the input and output streams to set the locale of the streams.
By default, std::wcout is synchronized to the underlying stdout C stream, which uses an ASCII mapping and displays ? in place of Unicode characters it cannot handle. If you want to print Unicode characters to the terminal, you have to turn that synchronization off. You can do that with one line and set the locale of the terminal:
std::ios_base::sync_with_stdio(false);
std::wcout.imbue(loc);
Amended version of your code:
#include <string>
#include <locale>
#include <fstream>
#include <iostream>
int main(int argc, char* argv[])
{
auto loc = std::locale("zh_TW.utf8");
//Disable synchronisation with stdio & set locale
std::ios::sync_with_stdio(false);
std::wcout.imbue(loc);
//Set locale of input stream
std::wstring wstr;
std::wifstream input_file("test.txt");
input_file.imbue(loc);
//Set locale of outputput stream
std::wofstream output_file("test_output.txt");
output_file.imbue(loc);
int counter = 0;
while(std::getline(input_file, wstr)) {
for(int i = 0; i < wstr.size(); i++) {
if(wstr[i] == L'|') {
counter++;
}
}
std::wcout << wstr << std::endl;
output_file << wstr << std::endl;
}
input_file.close();
output_file.close();
std::wcout << counter << std::endl;
return 0;
}
setlocale affects the locale of your program.
It has no effect on the default encoding of the text displayed by the terminal window. The terminal window is an independent application, with its own locale.
Pretty much all modern Linux distributions default to UTF-8 as the encoding for the system console and the terminal windows (gnome-terminal, Konsole, xfce4-terminal, etc...).
Changing your program's locale only affects how your application interprets text, but the terminal still expects your application to produce UTF-8 output. The terminal window has no knowledge of the internal locale of the application running in the terminal window. Terminal windows expect applications to produce output using the system locale's character encoding.
It is theoretically possible for the C library to know the default system encoding and silently transcode all the output, however it does not work this way.
You will have to do all the work of transcoding big5 to UTF-8, using the iconv library, on Linux.
A low cost, cheap shortcut, would be for your program to fork and run the iconv command line tool as a child process, and pipe its output to it, then let iconv do the transcoding on the fly.
use std::wcout to print std::wstring instead of std::cout :-)

C++ How to write japanese characters in file with [duplicate]

I am using Visual Studio C++ 2008 (Express). When I run the below code, the wostream (both std::wcout, and std::wfstream) stops outputting at the first non-ASCII character (in this case Chinese) encountered. Plain ASCII characters print fine. However, in the debugger, I can see that the wstrings are in fact properly populated with Chinese characters, and the output << ... is in fact getting executed.
The project settings in the Visual Studio solution are set to "Use Unicode Character Set". Why is std::wostream failing to output Unicode characters outside of the ASCII range?
void PrintTable(const std::vector<std::vector<std::wstring>> &table, std::wostream& output) {
for (unsigned int i=0; i < table.size(); ++i) {
for (unsigned int j=0; j < table[i].size(); ++j) {
output << table[i][j] << L"\t";
}
//output << std::endl;
}
}
void TestUnicodeSingleTableChinesePronouns() {
FileProcessor p("SingleTableChinesePronouns.docx");
FileProcessor::iterator fileIterator;
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
for(fileIterator = p.begin(); fileIterator != p.end(); ++fileIterator) {
PrintTable(*fileIterator, myFile);
PrintTable(*fileIterator, std::wcout);
std::cout<<std::endl<<"---------------------------------------"<<std::endl;
}
myFile.flush();
myFile.close();
}
By default the locale that std::wcout and std::wofstream use for certain operations is the "C" locale, which is not required to support non-ascii characters (or any character outside C++'s basic character set). Change the locale to one that supports the characters you want to use.
The simplest thing to do on Windows is unfortunately to use legacy codepages, however you really should avoid that. Legacy codepages are bad news. Instead you should use Unicode, whether UTF-8, UTF-16, or whatever. Also you'll have to work around Windows' unfortunate console model that makes writing to the console very different from writing to other kinds of output streams. You might need to find or write your own output buffer that specifically handles the console (or maybe file a bug asking Microsoft to fix it).
Here's an example of console output:
#include <Windows.h>
#include <streambuf>
#include <iostream>
class Console_streambuf
: public std::basic_streambuf<wchar_t>
{
HANDLE m_out;
public:
Console_streambuf(HANDLE out) : m_out(out) {}
virtual int_type overflow(int_type c = traits_type::eof())
{
wchar_t wc = c;
DWORD numberOfCharsWritten;
BOOL res = WriteConsoleW(m_out, &wc, 1, &numberOfCharsWritten, NULL);
(void)res;
return 1;
}
};
int main() {
Console_streambuf out(GetStdHandle(STD_OUTPUT_HANDLE));
auto old_buf = std::wcout.rdbuf(&out);
std::wcout << L"привет, 猫咪!\n";
std::wcout.rdbuf(old_buf); // replace old buffer so that destruction can happen correctly. FIXME: use RAII to do this in an exception safe manner.
}
You can do UTF-8 output to a file like this (although I'm not sure VS2008 supports codecvt_utf8_utf16):
#include <codecvt>
#include <fstream>
int main() {
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
myFile.imbue(std::locale(myFile.getloc(),new std::codecvt_utf8_utf16<wchar_t>));
myFile << L"привет, 猫咪!";
}
Include the following header file
#include <locale>
at the start of main, add the following line.
std::locale::global(std::locale("chinese"));
This helps to set the proper locale.

wcout does not output as desired

I've been trying to write a C++ application for a project and I ran into this issue. Basically:
class OBSClass
{
public:
wstring ClassName;
uint8_t Credit;
uint8_t Level;
OBSClass() : ClassName(), Credit(), Level() {}
OBSClass(wstring name, uint8_t credit, uint8_t hyear)
: ClassName(name), Credit(credit), Level(hyear)
{}
};
In some other file:
vector<OBSClass> AllClasses;
...
AllClasses.push_back(OBSClass(L"Bilişim Sistemleri Mühendisliğine Giriş", 3, 1));
AllClasses.push_back(OBSClass(L"İş Sağlığı ve Güvenliği", 3, 1));
AllClasses.push_back(OBSClass(L"Türk Dili 1", 2, 1));
... (rest omitted, some of entries have non-ASCII characters like 'ş' and 'İ')
I have a function basically outputs everything in AllClasses, the problem is wcout does not output as desired.
void PrintClasses()
{
for (size_t i = 0; i < AllClasses.size(); i++)
{
wcout << "Class: " << AllClasses[i].ClassName << "\n";
}
}
Output is 'Class: Bili' and nothing else. Program does not even tries to output other entries and just hangs. I am on windows using G++ 6.3.0. And I am not using Windows' cmd, I am using bash from mingw, so encoding will not be problem (or isn't it?). Any advice?
Edit: Also source code encoding is not a problem, just checked it is UTF8, default of VSCode
Edit: Also just checked to find out if problem is with string literals.
wstring test;
wcin >> test;
wcout << test;
Entered some non-ASCII characters like 'ö' and 'ş', it works perfectly. What is the problem with wide string literals?
Edit: Here you go
#include <iostream>
#include <string>
#include <vector>
using namespace std;
vector<wstring> testvec;
int main()
{
testvec.push_back(L"Bilişim Sistemleri Mühendisliğine Giriş");
testvec.push_back(L"ıiÖöUuÜü");
testvec.push_back(L"☺☻♥♦♣♠•◘○");
for (size_t i = 0; i < testvec.size(); i++)
wcout << testvec[i] << "\n";
return 0;
}
Compile with G++:
g++ file.cc -O3
This code only outputs 'Bili'. It must be something with the g++ screwing up binary encoding (?), since entering values with wcin then outputting them with wcout does not generate any problem.
The following code works for me, using MinGW-w64 7.3.0 in both MSYS2 Bash, and Windows CMD; and with the source encoded as UTF-8:
#include <iostream>
#include <locale>
#include <string>
#include <codecvt>
int main()
{
std::ios_base::sync_with_stdio(false);
std::locale utf8( std::locale(), new std::codecvt_utf8_utf16<wchar_t> );
std::wcout.imbue(utf8);
std::wstring w(L"Bilişim Sistemleri Mühendisliğine Giriş");
std::wcout << w << '\n';
}
Explanation:
The Windows console doesn't support any sort of 16-bit output; it's only ANSI and a partial UTF-8 support. So you need to configure wcout to convert the output to UTF-8. This is the default for backwards compatibility purposes, though Windows 10 1803 does add an option to set that to UTF-8 (ref).
imbue with a codecvt_utf8_utf16 achieves this; however you also need to disable sync_with_stdio otherwise the stream doesn't even use the facet, it just defers to stdout which has a similar problem.
For writing to other files, I found the same technique works to write UTF-8. For writing a UTF-16 file you need to imbue the wofstream with a UTF-16 facet, see example here, and manually write a BOM.
Commentary: Many people just avoid trying to use wide iostreams completely, due to these issues.
You can write a UTF-8 file using a narrow stream; and have function calls in your code to convert wstring to UTF-8, if you are using wstring internally; you can of course use UTF-8 internally.
Of course you can also write a UTF-16 file using a narrow stream, just not with operator<< from a wstring.
If you have at least Windows 10 1903 (May 2019), and at least
Windows Terminal 0.3.2142 (Aug 2019). Then set Unicode:
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage]
"OEMCP"="65001"
and restart. After that you can use this:
#include <iostream>
int main() {
std::string a[] = {
"Bilişim Sistemleri Mühendisliğine Giriş",
"Türk Dili 1",
"İş Sağlığı ve Güvenliği",
"ıiÖöUuÜü",
"☺☻♥♦♣♠•◘○"
};
for (auto s: a) {
std::cout << s << std::endl;
}
}

Stumped with Unicode, Boost, C++, codecvts

In C++, I want to use Unicode to do things. So after falling down the rabbit hole of Unicode, I've managed to end up in a train wreck of confusion, headaches and locales.
But in Boost I've had the unfortunate problem of trying to use Unicode file paths and trying to use the Boost program options library with Unicode input. I've read whatever I could find on the subjects of locales, codecvts, Unicode encodings and Boost.
My current attempt to get things to work is to have a codecvt that takes a UTF-8 string and converts it to the platform's encoding (UTF-8 on POSIX, UTF-16 on Windows), I've been trying to avoid wchar_t.
The closest I've actually gotten is trying to do this with Boost.Locale, to convert from a UTF-8 string to a UTF-32 string on output.
#include <string>
#include <boost/locale.hpp>
#include <locale>
int main(void)
{
std::string data("Testing, 㤹");
std::locale fromLoc = boost::locale::generator().generate("en_US.UTF-8");
std::locale toLoc = boost::locale::generator().generate("en_US.UTF-32");
typedef std::codecvt<wchar_t, char, mbstate_t> cvtType;
cvtType const* toCvt = &std::use_facet<cvtType>(toLoc);
std::locale convLoc = std::locale(fromLoc, toCvt);
std::cout.imbue(convLoc);
std::cout << data << std::endl;
// Output is unconverted -- what?
return 0;
}
I think I had some other kind of conversion working using wide characters, but I really don't know what I'm even doing. I don't know what the right tool for the job is at this point. Help?
Okay, after a long few months I've figured it out, and I'd like to help people in the future.
First of all, the codecvt thing was the wrong way of doing it. Boost.Locale provides a simple way of converting between character sets in its boost::locale::conv namespace. Here's one example (there's others not based on locales).
#include <boost/locale.hpp>
namespace loc = boost::locale;
int main(void)
{
loc::generator gen;
std::locale blah = gen.generate("en_US.utf-32");
std::string UTF8String = "Tésting!";
// from_utf will also work with wide strings as it uses the character size
// to detect the encoding.
std::string converted = loc::conv::from_utf(UTF8String, blah);
// Outputs a UTF-32 string.
std::cout << converted << std::endl;
return 0;
}
As you can see, if you replace the "en_US.utf-32" with "" it'll output in the user's locale.
I still don't know how to make std::cout do this all the time, but the translate() function of Boost.Locale outputs in the user's locale.
As for the filesystem using UTF-8 strings cross platform, it seems that that's possible, here's a link to how to do it.
std::cout.imbue(convLoc);
std::cout << data << std::endl;
This does no conversion, since it uses codecvt<char, char, mbstate_t> which is a no-op. The only standard streams that use codecvt are file-streams. std::cout is not required to perform any conversion at all.
To force Boost.Filesystem to interpret narrow-strings as UTF-8 on windows, use boost::filesystem::imbue with a locale with a UTF-8 ↔ UTF-16 codecvt facet. Boost.Locale has an implementation of the latter.
The Boost filesystem iostream replacement classes work fine with UTF-16 when used with Visual C++.
However, they do not work (in the sense of supporting arbitrary filenames) when used with g++ in Windows - at least as of Boost version 1.47. There is a code comment explaining that; essentially, the Visual C++ standard library provides non-standard wchar_t based constructors that Boost filesystem classes make use of, but g++ does not support these extensions.
A workaround is to use 8.3 short filenames, but this solution is a bit brittle since with old Windows versions the user can turn off automatic generation of short filenames.
Example code for using Boost filesystem in Windows:
#include "CmdLineArgs.h" // CmdLineArgs
#include "throwx.h" // throwX, hopefully
#include "string_conversions.h" // ansiOrFillerFrom( wstring )
#include <boost/filesystem/fstream.hpp> // boost::filesystem::ifstream
#include <iostream> // std::cout, std::cerr, std::endl
#include <stdexcept> // std::runtime_error, std::exception
#include <string> // std::string
#include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE
using namespace std;
namespace bfs = boost::filesystem;
inline string ansi( wstring const& ws ) { return ansiWithFillersFrom( ws ); }
int main()
{
try
{
CmdLineArgs const args;
wstring const programPath = args.at( 0 );
hopefully( args.nArgs() == 2 )
|| throwX( "Usage: " + ansi( programPath ) + " FILENAME" );
wstring const filePath = args.at( 1 );
bfs::ifstream stream( filePath ); // Nice Boost ifstream subclass.
hopefully( !stream.fail() )
|| throwX( "Failed to open file '" + ansi( filePath ) + "'" );
string line;
while( getline( stream, line ) )
{
cout << line << endl;
}
hopefully( stream.eof() )
|| throwX( "Failed to list contents of file '" + ansi( filePath ) + "'" );
return EXIT_SUCCESS;
}
catch( exception const& x )
{
cerr << "!" << x.what() << endl;
}
return EXIT_FAILURE;
}

How to write console data into a text file in C++?

I'm working on a file sharing application in C++. I want to write console output into a separate file and at the same time I want to see the output in console also. Can anybody help me...Thanks in advance.
Here we go...
#include <fstream>
using std::ofstream;
#include <iostream>
using std::cout;
using std::endl;
int main( int argc, char* argv[] )
{
ofstream file( "output.txt" ); // create output file stream to file output.txt
if( !file ) // check stream for error (check if it opened the file correctly)
cout << "error opening file for writing." << endl;
for( int i=0; i<argc; ++i ) // argc contains the number of arguments
{
file << argv[i] << endl; // argv contains the char arrays of commandline arguments
cout << argv[i] << endl;
}
file.close(); // always close a file stream when you're done with it.
return 0;
}
PS: OK, read your question wrong (console output/input mixup), but you still get the idea I think.
The idea is to create a derivate of std::streambuf which will output data to both the file and cout. Then create an instance of it and use cout.rdbuf(...);
Here is the code (tested with MSVC++ 2010, should work on any compiler):
class StreambufDoubler : public std::streambuf {
public:
StreambufDoubler(std::streambuf* buf1, std::streambuf* buf2) :
_buf1(buf1), _buf2(buf2), _buffer(128)
{
assert(_buf1 && _buf2);
setg(0, 0, 0);
setp(_buffer.data(), _buffer.data(), _buffer.data() + _buffer.size());
}
~StreambufDoubler() {
sync();
}
void imbue(const std::locale& loc) {
_buf1->pubimbue(loc);
_buf2->pubimbue(loc);
}
std::streampos seekpos(std::streampos sp, std::ios_base::openmode which) {
return seekoff(sp, std::ios_base::cur, which);
}
std::streampos seekoff(std::streamoff off, std::ios_base::seekdir way, std::ios_base::openmode which) {
if (which | std::ios_base::in)
throw(std::runtime_error("Can't use this class to read data"));
// which one to return? good question
// anyway seekpos and seekoff should never be called
_buf1->pubseekoff(off, way, which);
return _buf2->pubseekoff(off, way, which);
}
int overflow(int c) {
int retValue = sync() ? EOF : 0;
sputc(c);
return retValue;
}
int sync() {
_buf1->sputn(pbase(), pptr() - pbase());
_buf2->sputn(pbase(), pptr() - pbase());
setp(_buffer.data(), _buffer.data(), _buffer.data() + _buffer.size());
return _buf1->pubsync() | _buf2->pubsync();
}
private:
std::streambuf* _buf1;
std::streambuf* _buf2;
std::vector<char> _buffer;
};
int main() {
std::ofstream myFile("file.txt");
StreambufDoubler doubler(std::cout.rdbuf(), myFile.rdbuf());
std::cout.rdbuf(&doubler);
// your code here
return 0;
}
However note that a better implementation would use templates, a list of streambufs instead of just two, etc. but I wanted to keep it as simple as possible.
What you want actually is to follow in real time the lines added to the log your application writes.
In the Unix world, there's a simple tool that has that very function, it's called tail.
Call tail -f your_file and you will see the file contents appearing in almost real time in the console.
Unfortunately, tail is not a standard tool in Windows (which I suppose you're using, according to your question's tags).
It can however be found in the GnuWin32 package, as well as MSYS.
There are also several native tools for Windows with the same functionality, I'm personally using Tail For Win32, which is licensed under the GPL.
So, to conclude, I think your program should not output the same data to different streams, as it might slow it down without real benefits, while there are established tools that have been designed specifically to solve that problem, without the need to develop anything.
i don't program in c++ but here is my advice: create new class, that takes InputStream (istream in c++ or smth), and than every incoming byte it will transfer in std.out and in file.
I am sure there is a way to change standard output stream with forementioned class. As i remember, std.out is some kind of property of cout.
And again, i spent 1 week on c++ more than half a year ago, so there is a chance that all i've said is garbage.