Reading a file that contains chinese characters (C++) - c++

I got issues reading a file that contains chinese characters. I know that the encoding of the file is Big5.
Here is my example file (test.txt), I can't include it here because of the chinese characters: https://gist.github.com/haruka98/974ca2c034ebd8fe7eeac4124739fc41
This is my minimal code example (main.cpp), the one I'm actually using breaks down each line and does things with the different fields.
#include <string>
#include <fstream>
#include <iostream>
int main(int argc, char* argv[]) {
setlocale(LC_ALL, "Chinese-traditional");
std::wstring wstr;
std::wifstream input_file("test.txt");
std::wofstream output_file("test_output.txt");
int counter = 0;
while(std::getline(input_file, wstr)) {
for(int i = 0; i < wstr.size(); i++) {
if(wstr[i] == L'|') {
counter++;
}
}
output_file << wstr << std::endl;
}
input_file.close();
output_file.close();
std::cout << counter << std::endl;
return 0;
}
To compile my program:
g++ -o test main.cpp -std=c++17
On Windows 10 I got my expected output. I got the entire file copied to "test_output.txt" and the 129 output in the terminal.
On Linux (Debian 9) I got the terminal output 4 and the file "test_output.txt" only contains the first line and the "1|" from the second.
Here is what I tried:
My first guess was the CR LF and LF issue when using both Windows and Linux. But testing both CR LF and LF with the file did not help.
Then I thought that the "Chinese-traditional" might not work on Linux. I replaced it with "zh_TW.BIG5" but did not get the expected result either.

First check you have the locale for "Chinese-traditional" installed. On Linux this is zh_TW.UTF-8. You can check using locale -a. If it's not listed, install it:
sudo locale-gen zh_TW.UTF-8
sudo update-locale
(There's a list of locales here with their names on Linux and Windows.)
Then use imbue with the input and output streams to set the locale of the streams.
By default, std::wcout is synchronized to the underlying stdout C stream, which uses an ASCII mapping and displays ? in place of Unicode characters it cannot handle. If you want to print Unicode characters to the terminal, you have to turn that synchronization off. You can do that with one line and set the locale of the terminal:
std::ios_base::sync_with_stdio(false);
std::wcout.imbue(loc);
Amended version of your code:
#include <string>
#include <locale>
#include <fstream>
#include <iostream>
int main(int argc, char* argv[])
{
auto loc = std::locale("zh_TW.utf8");
//Disable synchronisation with stdio & set locale
std::ios::sync_with_stdio(false);
std::wcout.imbue(loc);
//Set locale of input stream
std::wstring wstr;
std::wifstream input_file("test.txt");
input_file.imbue(loc);
//Set locale of outputput stream
std::wofstream output_file("test_output.txt");
output_file.imbue(loc);
int counter = 0;
while(std::getline(input_file, wstr)) {
for(int i = 0; i < wstr.size(); i++) {
if(wstr[i] == L'|') {
counter++;
}
}
std::wcout << wstr << std::endl;
output_file << wstr << std::endl;
}
input_file.close();
output_file.close();
std::wcout << counter << std::endl;
return 0;
}

setlocale affects the locale of your program.
It has no effect on the default encoding of the text displayed by the terminal window. The terminal window is an independent application, with its own locale.
Pretty much all modern Linux distributions default to UTF-8 as the encoding for the system console and the terminal windows (gnome-terminal, Konsole, xfce4-terminal, etc...).
Changing your program's locale only affects how your application interprets text, but the terminal still expects your application to produce UTF-8 output. The terminal window has no knowledge of the internal locale of the application running in the terminal window. Terminal windows expect applications to produce output using the system locale's character encoding.
It is theoretically possible for the C library to know the default system encoding and silently transcode all the output, however it does not work this way.
You will have to do all the work of transcoding big5 to UTF-8, using the iconv library, on Linux.
A low cost, cheap shortcut, would be for your program to fork and run the iconv command line tool as a child process, and pipe its output to it, then let iconv do the transcoding on the fly.

use std::wcout to print std::wstring instead of std::cout :-)

Related

Is there a proper way to receive input from console in UTF-8 encoding?

When getting input from std::cin in windows, the input is apparently always in the encoding windows-1252 (the default for the host machine in my case) despite all the configurations made, that apparently only affect to the output. Is there a proper way to capture input in windows in UTF-8 encoding?
For instance, let's check out this program:
#include <iostream>
int main(int argc, char* argv[])
{
std::cin.imbue(locale("es_ES.UTF-8"));
std::cout.imbue(locale("es_ES.UTF-8"));
std::cout << "ñeñeñe> ";
std::string in;
std::getline( std::cin, in );
std::cout << in;
}
I've compiled it using visual studio 2022 in a windows machine with spanish locale. The source code is in UTF-8. When executing the resulting program (windows powershell session, after executing chcp 65001 to set the default encoding to UTF-8), I see the following:
PS C:\> .\test_program.exe
ñeñeñe> ñeñeñe
e e e
The first "ñeñeñe" is correct: it display correctly the "ñ" caracter to the output console. So far, so good. The user input is echoed back to the console correctly: another good point. But! when it turns to send back the encoded string to the ouput, the "ñ" caracter is substituted by an empty space.
When debugging this program, I see that the variable "in" have captured the input in an encoding that it is not utf-8: for the "ñ" it use only one character, whereas in utf-8 that caracter must consume two. The conclusion is that the input is not affect for the chcp command. Is something I doing wrong?
UPDATE
Somebody have asked me to see what happens when changing to wcout/wcin:
std::wcout << u"ñeñeñe> ";
std::wstring in;
std::getline(std::wcin, in);
std::wcout << in;
Behaviour:
PS C:\> .\test.exe
0,000,7FF,6D1,B76,E30ñeñeñe
e e e
Other try (setting the string as L"ñeñeñe"):
ñeñeñe> ñeñeñe
e e e
Leaving it as is:
std::wcout << "ñeñeñe> ";
Result is:
eee>
This is the closest to the solution I've found so far:
int main(int argc, char* argv[])
{
_setmode(_fileno(stdout), _O_WTEXT);
_setmode(_fileno(stdin), _O_WTEXT);
std::wcout << L"ñeñeñe";
std::wstring in;
std::getline(std::wcin, in);
std::wcout << in;
return 0;
}
The solution depicted here went in the right direction. Problem: both stdin and stdout should be in the same configuration, because the echo of the console rewrites the input. The problem is the writing of the string with \uXXXX codes.... I am guessing how to overcome that or using #define's to overcome and clarify the text literals

Multiple calls to setlocale

I am trying to figure out how Unicode is supported in C++.
When I want to output multilingual text to console, I call std::setlocale. However I noticed that the result depends on prior calls to setlocale.
Consider the following example. If run without arguments it calls setlocale once, otherwise it makes a prior call to setlocale to get the value of current locale and restore it at the end of the function.
#include <iostream>
#include <locale>
using namespace std;
int main(int argc, char** argv)
{
char *current_locale = 0;
if (argc > 1) {
current_locale = setlocale(LC_ALL, NULL);
wcout << L"Current output locale: " << current_locale << endl;
}
char* new_locale = setlocale(LC_ALL, "ru_RU.UTF8");
if (! new_locale)
wcout << L"failed to set new locale" << endl;
else
wcout << L"new locale: " << new_locale << endl;
wcout << L"Привет!" << endl;
if (current_locale) setlocale(LC_ALL, current_locale);
return 0;
}
The output is different:
:~> ./check_locale
new locale: ru_RU.UTF8
Привет!
:~> ./check_locale 1
Current output locale: C
new locale: ru_RU.UTF8
??????!
Is there something that setlocale(LC_ALL, NULL) does that needs to be taken care of in future setlocale calls?
The compiler is g++ 7.5.0 or clang++ 7.0.1. And the console is a linux console in a graphical terminal.
More details on the system config: OpenSUSE 15.1, linux 4.12, glibc 2.26, libstdc++6-10.2.1
Is there something that setlocale(LC_ALL, NULL) does that needs to be taken care of in future setlocale calls?
No, setlocale(..., NULL) does not modify the current locale. The following code is fine:
setlocale(LC_ALL, NULL);
setlocale(LC_ALL, "ru_RU.UTF8");
wprintf(L"Привет!\n");
However the following code will fail:
wprintf(L"anything"); // or even just `fwide(stdout, 1);`
setlocale(LC_ALL, "ru_RU.UTF8");
wprintf(L"Привет!\n");
The problem is that stream has it's own locale that is determined at the point the stream orientation is changed to wide.
// here stdout has no orientation and no locale associated with it
wprintf(L"anything");
// `stdout` stream orientation switches to wide stream
// current locale is used - `stdout` has C locale
setlocale(LC_ALL, "ru_RU.UTF8");
wprintf(L"Привет!\n");
// `stdout` is wide oriented
// current locale is ru_RU.UTF-8
// __but__ the locale of `stdout` is still C and cannot be changed!
The only documentation I found of this gnu.org Stream and I18N emphasis mine:
Since a stream is created in the unoriented state it has at that point no conversion associated with it. The conversion which will be used is determined by the LC_CTYPE category selected at the time the stream is oriented. If the locales are changed at the runtime this might produce surprising results unless one pays attention. This is just another good reason to orient the stream explicitly as soon as possible, perhaps with a call to fwide.
You can:
Use separate locale for C++ stream and C FILE (see here):
std::ios_base::sync_with_stdio(false);
std::wcout.imbue(std::locale("ru_RU.utf8"));
Reopen stdout:
wprintf(L""); // stdout has C locale
char* new_locale = setlocale(LC_ALL, "ru_RU.UTF8");
freopen("/dev/stdout", "w", stdout); // stdout has no stream orientation
wprintf(L"Привет!\n"); // stdout is wide and ru_RU locale
I think (untested) that in glibc you can even reopen stdout with explicit locale (see GNU opening streams):
freopen("/dev/stdout", "w,css=ru_RU.UTF-8", stdout);
std::wcout << L"Привет!\n"; // fine
In any case, try to set locale as soon as possible before doing anything else.

C++ How to write japanese characters in file with [duplicate]

I am using Visual Studio C++ 2008 (Express). When I run the below code, the wostream (both std::wcout, and std::wfstream) stops outputting at the first non-ASCII character (in this case Chinese) encountered. Plain ASCII characters print fine. However, in the debugger, I can see that the wstrings are in fact properly populated with Chinese characters, and the output << ... is in fact getting executed.
The project settings in the Visual Studio solution are set to "Use Unicode Character Set". Why is std::wostream failing to output Unicode characters outside of the ASCII range?
void PrintTable(const std::vector<std::vector<std::wstring>> &table, std::wostream& output) {
for (unsigned int i=0; i < table.size(); ++i) {
for (unsigned int j=0; j < table[i].size(); ++j) {
output << table[i][j] << L"\t";
}
//output << std::endl;
}
}
void TestUnicodeSingleTableChinesePronouns() {
FileProcessor p("SingleTableChinesePronouns.docx");
FileProcessor::iterator fileIterator;
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
for(fileIterator = p.begin(); fileIterator != p.end(); ++fileIterator) {
PrintTable(*fileIterator, myFile);
PrintTable(*fileIterator, std::wcout);
std::cout<<std::endl<<"---------------------------------------"<<std::endl;
}
myFile.flush();
myFile.close();
}
By default the locale that std::wcout and std::wofstream use for certain operations is the "C" locale, which is not required to support non-ascii characters (or any character outside C++'s basic character set). Change the locale to one that supports the characters you want to use.
The simplest thing to do on Windows is unfortunately to use legacy codepages, however you really should avoid that. Legacy codepages are bad news. Instead you should use Unicode, whether UTF-8, UTF-16, or whatever. Also you'll have to work around Windows' unfortunate console model that makes writing to the console very different from writing to other kinds of output streams. You might need to find or write your own output buffer that specifically handles the console (or maybe file a bug asking Microsoft to fix it).
Here's an example of console output:
#include <Windows.h>
#include <streambuf>
#include <iostream>
class Console_streambuf
: public std::basic_streambuf<wchar_t>
{
HANDLE m_out;
public:
Console_streambuf(HANDLE out) : m_out(out) {}
virtual int_type overflow(int_type c = traits_type::eof())
{
wchar_t wc = c;
DWORD numberOfCharsWritten;
BOOL res = WriteConsoleW(m_out, &wc, 1, &numberOfCharsWritten, NULL);
(void)res;
return 1;
}
};
int main() {
Console_streambuf out(GetStdHandle(STD_OUTPUT_HANDLE));
auto old_buf = std::wcout.rdbuf(&out);
std::wcout << L"привет, 猫咪!\n";
std::wcout.rdbuf(old_buf); // replace old buffer so that destruction can happen correctly. FIXME: use RAII to do this in an exception safe manner.
}
You can do UTF-8 output to a file like this (although I'm not sure VS2008 supports codecvt_utf8_utf16):
#include <codecvt>
#include <fstream>
int main() {
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
myFile.imbue(std::locale(myFile.getloc(),new std::codecvt_utf8_utf16<wchar_t>));
myFile << L"привет, 猫咪!";
}
Include the following header file
#include <locale>
at the start of main, add the following line.
std::locale::global(std::locale("chinese"));
This helps to set the proper locale.

wcout does not output as desired

I've been trying to write a C++ application for a project and I ran into this issue. Basically:
class OBSClass
{
public:
wstring ClassName;
uint8_t Credit;
uint8_t Level;
OBSClass() : ClassName(), Credit(), Level() {}
OBSClass(wstring name, uint8_t credit, uint8_t hyear)
: ClassName(name), Credit(credit), Level(hyear)
{}
};
In some other file:
vector<OBSClass> AllClasses;
...
AllClasses.push_back(OBSClass(L"Bilişim Sistemleri Mühendisliğine Giriş", 3, 1));
AllClasses.push_back(OBSClass(L"İş Sağlığı ve Güvenliği", 3, 1));
AllClasses.push_back(OBSClass(L"Türk Dili 1", 2, 1));
... (rest omitted, some of entries have non-ASCII characters like 'ş' and 'İ')
I have a function basically outputs everything in AllClasses, the problem is wcout does not output as desired.
void PrintClasses()
{
for (size_t i = 0; i < AllClasses.size(); i++)
{
wcout << "Class: " << AllClasses[i].ClassName << "\n";
}
}
Output is 'Class: Bili' and nothing else. Program does not even tries to output other entries and just hangs. I am on windows using G++ 6.3.0. And I am not using Windows' cmd, I am using bash from mingw, so encoding will not be problem (or isn't it?). Any advice?
Edit: Also source code encoding is not a problem, just checked it is UTF8, default of VSCode
Edit: Also just checked to find out if problem is with string literals.
wstring test;
wcin >> test;
wcout << test;
Entered some non-ASCII characters like 'ö' and 'ş', it works perfectly. What is the problem with wide string literals?
Edit: Here you go
#include <iostream>
#include <string>
#include <vector>
using namespace std;
vector<wstring> testvec;
int main()
{
testvec.push_back(L"Bilişim Sistemleri Mühendisliğine Giriş");
testvec.push_back(L"ıiÖöUuÜü");
testvec.push_back(L"☺☻♥♦♣♠•◘○");
for (size_t i = 0; i < testvec.size(); i++)
wcout << testvec[i] << "\n";
return 0;
}
Compile with G++:
g++ file.cc -O3
This code only outputs 'Bili'. It must be something with the g++ screwing up binary encoding (?), since entering values with wcin then outputting them with wcout does not generate any problem.
The following code works for me, using MinGW-w64 7.3.0 in both MSYS2 Bash, and Windows CMD; and with the source encoded as UTF-8:
#include <iostream>
#include <locale>
#include <string>
#include <codecvt>
int main()
{
std::ios_base::sync_with_stdio(false);
std::locale utf8( std::locale(), new std::codecvt_utf8_utf16<wchar_t> );
std::wcout.imbue(utf8);
std::wstring w(L"Bilişim Sistemleri Mühendisliğine Giriş");
std::wcout << w << '\n';
}
Explanation:
The Windows console doesn't support any sort of 16-bit output; it's only ANSI and a partial UTF-8 support. So you need to configure wcout to convert the output to UTF-8. This is the default for backwards compatibility purposes, though Windows 10 1803 does add an option to set that to UTF-8 (ref).
imbue with a codecvt_utf8_utf16 achieves this; however you also need to disable sync_with_stdio otherwise the stream doesn't even use the facet, it just defers to stdout which has a similar problem.
For writing to other files, I found the same technique works to write UTF-8. For writing a UTF-16 file you need to imbue the wofstream with a UTF-16 facet, see example here, and manually write a BOM.
Commentary: Many people just avoid trying to use wide iostreams completely, due to these issues.
You can write a UTF-8 file using a narrow stream; and have function calls in your code to convert wstring to UTF-8, if you are using wstring internally; you can of course use UTF-8 internally.
Of course you can also write a UTF-16 file using a narrow stream, just not with operator<< from a wstring.
If you have at least Windows 10 1903 (May 2019), and at least
Windows Terminal 0.3.2142 (Aug 2019). Then set Unicode:
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage]
"OEMCP"="65001"
and restart. After that you can use this:
#include <iostream>
int main() {
std::string a[] = {
"Bilişim Sistemleri Mühendisliğine Giriş",
"Türk Dili 1",
"İş Sağlığı ve Güvenliği",
"ıiÖöUuÜü",
"☺☻♥♦♣♠•◘○"
};
for (auto s: a) {
std::cout << s << std::endl;
}
}

wostream fails to output wstring

I am using Visual Studio C++ 2008 (Express). When I run the below code, the wostream (both std::wcout, and std::wfstream) stops outputting at the first non-ASCII character (in this case Chinese) encountered. Plain ASCII characters print fine. However, in the debugger, I can see that the wstrings are in fact properly populated with Chinese characters, and the output << ... is in fact getting executed.
The project settings in the Visual Studio solution are set to "Use Unicode Character Set". Why is std::wostream failing to output Unicode characters outside of the ASCII range?
void PrintTable(const std::vector<std::vector<std::wstring>> &table, std::wostream& output) {
for (unsigned int i=0; i < table.size(); ++i) {
for (unsigned int j=0; j < table[i].size(); ++j) {
output << table[i][j] << L"\t";
}
//output << std::endl;
}
}
void TestUnicodeSingleTableChinesePronouns() {
FileProcessor p("SingleTableChinesePronouns.docx");
FileProcessor::iterator fileIterator;
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
for(fileIterator = p.begin(); fileIterator != p.end(); ++fileIterator) {
PrintTable(*fileIterator, myFile);
PrintTable(*fileIterator, std::wcout);
std::cout<<std::endl<<"---------------------------------------"<<std::endl;
}
myFile.flush();
myFile.close();
}
By default the locale that std::wcout and std::wofstream use for certain operations is the "C" locale, which is not required to support non-ascii characters (or any character outside C++'s basic character set). Change the locale to one that supports the characters you want to use.
The simplest thing to do on Windows is unfortunately to use legacy codepages, however you really should avoid that. Legacy codepages are bad news. Instead you should use Unicode, whether UTF-8, UTF-16, or whatever. Also you'll have to work around Windows' unfortunate console model that makes writing to the console very different from writing to other kinds of output streams. You might need to find or write your own output buffer that specifically handles the console (or maybe file a bug asking Microsoft to fix it).
Here's an example of console output:
#include <Windows.h>
#include <streambuf>
#include <iostream>
class Console_streambuf
: public std::basic_streambuf<wchar_t>
{
HANDLE m_out;
public:
Console_streambuf(HANDLE out) : m_out(out) {}
virtual int_type overflow(int_type c = traits_type::eof())
{
wchar_t wc = c;
DWORD numberOfCharsWritten;
BOOL res = WriteConsoleW(m_out, &wc, 1, &numberOfCharsWritten, NULL);
(void)res;
return 1;
}
};
int main() {
Console_streambuf out(GetStdHandle(STD_OUTPUT_HANDLE));
auto old_buf = std::wcout.rdbuf(&out);
std::wcout << L"привет, 猫咪!\n";
std::wcout.rdbuf(old_buf); // replace old buffer so that destruction can happen correctly. FIXME: use RAII to do this in an exception safe manner.
}
You can do UTF-8 output to a file like this (although I'm not sure VS2008 supports codecvt_utf8_utf16):
#include <codecvt>
#include <fstream>
int main() {
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
myFile.imbue(std::locale(myFile.getloc(),new std::codecvt_utf8_utf16<wchar_t>));
myFile << L"привет, 猫咪!";
}
Include the following header file
#include <locale>
at the start of main, add the following line.
std::locale::global(std::locale("chinese"));
This helps to set the proper locale.