Stumped with Unicode, Boost, C++, codecvts - c++

In C++, I want to use Unicode to do things. So after falling down the rabbit hole of Unicode, I've managed to end up in a train wreck of confusion, headaches and locales.
But in Boost I've had the unfortunate problem of trying to use Unicode file paths and trying to use the Boost program options library with Unicode input. I've read whatever I could find on the subjects of locales, codecvts, Unicode encodings and Boost.
My current attempt to get things to work is to have a codecvt that takes a UTF-8 string and converts it to the platform's encoding (UTF-8 on POSIX, UTF-16 on Windows), I've been trying to avoid wchar_t.
The closest I've actually gotten is trying to do this with Boost.Locale, to convert from a UTF-8 string to a UTF-32 string on output.
#include <string>
#include <boost/locale.hpp>
#include <locale>
int main(void)
{
std::string data("Testing, 㤹");
std::locale fromLoc = boost::locale::generator().generate("en_US.UTF-8");
std::locale toLoc = boost::locale::generator().generate("en_US.UTF-32");
typedef std::codecvt<wchar_t, char, mbstate_t> cvtType;
cvtType const* toCvt = &std::use_facet<cvtType>(toLoc);
std::locale convLoc = std::locale(fromLoc, toCvt);
std::cout.imbue(convLoc);
std::cout << data << std::endl;
// Output is unconverted -- what?
return 0;
}
I think I had some other kind of conversion working using wide characters, but I really don't know what I'm even doing. I don't know what the right tool for the job is at this point. Help?

Okay, after a long few months I've figured it out, and I'd like to help people in the future.
First of all, the codecvt thing was the wrong way of doing it. Boost.Locale provides a simple way of converting between character sets in its boost::locale::conv namespace. Here's one example (there's others not based on locales).
#include <boost/locale.hpp>
namespace loc = boost::locale;
int main(void)
{
loc::generator gen;
std::locale blah = gen.generate("en_US.utf-32");
std::string UTF8String = "Tésting!";
// from_utf will also work with wide strings as it uses the character size
// to detect the encoding.
std::string converted = loc::conv::from_utf(UTF8String, blah);
// Outputs a UTF-32 string.
std::cout << converted << std::endl;
return 0;
}
As you can see, if you replace the "en_US.utf-32" with "" it'll output in the user's locale.
I still don't know how to make std::cout do this all the time, but the translate() function of Boost.Locale outputs in the user's locale.
As for the filesystem using UTF-8 strings cross platform, it seems that that's possible, here's a link to how to do it.

std::cout.imbue(convLoc);
std::cout << data << std::endl;
This does no conversion, since it uses codecvt<char, char, mbstate_t> which is a no-op. The only standard streams that use codecvt are file-streams. std::cout is not required to perform any conversion at all.
To force Boost.Filesystem to interpret narrow-strings as UTF-8 on windows, use boost::filesystem::imbue with a locale with a UTF-8 ↔ UTF-16 codecvt facet. Boost.Locale has an implementation of the latter.

The Boost filesystem iostream replacement classes work fine with UTF-16 when used with Visual C++.
However, they do not work (in the sense of supporting arbitrary filenames) when used with g++ in Windows - at least as of Boost version 1.47. There is a code comment explaining that; essentially, the Visual C++ standard library provides non-standard wchar_t based constructors that Boost filesystem classes make use of, but g++ does not support these extensions.
A workaround is to use 8.3 short filenames, but this solution is a bit brittle since with old Windows versions the user can turn off automatic generation of short filenames.
Example code for using Boost filesystem in Windows:
#include "CmdLineArgs.h" // CmdLineArgs
#include "throwx.h" // throwX, hopefully
#include "string_conversions.h" // ansiOrFillerFrom( wstring )
#include <boost/filesystem/fstream.hpp> // boost::filesystem::ifstream
#include <iostream> // std::cout, std::cerr, std::endl
#include <stdexcept> // std::runtime_error, std::exception
#include <string> // std::string
#include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE
using namespace std;
namespace bfs = boost::filesystem;
inline string ansi( wstring const& ws ) { return ansiWithFillersFrom( ws ); }
int main()
{
try
{
CmdLineArgs const args;
wstring const programPath = args.at( 0 );
hopefully( args.nArgs() == 2 )
|| throwX( "Usage: " + ansi( programPath ) + " FILENAME" );
wstring const filePath = args.at( 1 );
bfs::ifstream stream( filePath ); // Nice Boost ifstream subclass.
hopefully( !stream.fail() )
|| throwX( "Failed to open file '" + ansi( filePath ) + "'" );
string line;
while( getline( stream, line ) )
{
cout << line << endl;
}
hopefully( stream.eof() )
|| throwX( "Failed to list contents of file '" + ansi( filePath ) + "'" );
return EXIT_SUCCESS;
}
catch( exception const& x )
{
cerr << "!" << x.what() << endl;
}
return EXIT_FAILURE;
}

Related

C++ How to write japanese characters in file with [duplicate]

I am using Visual Studio C++ 2008 (Express). When I run the below code, the wostream (both std::wcout, and std::wfstream) stops outputting at the first non-ASCII character (in this case Chinese) encountered. Plain ASCII characters print fine. However, in the debugger, I can see that the wstrings are in fact properly populated with Chinese characters, and the output << ... is in fact getting executed.
The project settings in the Visual Studio solution are set to "Use Unicode Character Set". Why is std::wostream failing to output Unicode characters outside of the ASCII range?
void PrintTable(const std::vector<std::vector<std::wstring>> &table, std::wostream& output) {
for (unsigned int i=0; i < table.size(); ++i) {
for (unsigned int j=0; j < table[i].size(); ++j) {
output << table[i][j] << L"\t";
}
//output << std::endl;
}
}
void TestUnicodeSingleTableChinesePronouns() {
FileProcessor p("SingleTableChinesePronouns.docx");
FileProcessor::iterator fileIterator;
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
for(fileIterator = p.begin(); fileIterator != p.end(); ++fileIterator) {
PrintTable(*fileIterator, myFile);
PrintTable(*fileIterator, std::wcout);
std::cout<<std::endl<<"---------------------------------------"<<std::endl;
}
myFile.flush();
myFile.close();
}
By default the locale that std::wcout and std::wofstream use for certain operations is the "C" locale, which is not required to support non-ascii characters (or any character outside C++'s basic character set). Change the locale to one that supports the characters you want to use.
The simplest thing to do on Windows is unfortunately to use legacy codepages, however you really should avoid that. Legacy codepages are bad news. Instead you should use Unicode, whether UTF-8, UTF-16, or whatever. Also you'll have to work around Windows' unfortunate console model that makes writing to the console very different from writing to other kinds of output streams. You might need to find or write your own output buffer that specifically handles the console (or maybe file a bug asking Microsoft to fix it).
Here's an example of console output:
#include <Windows.h>
#include <streambuf>
#include <iostream>
class Console_streambuf
: public std::basic_streambuf<wchar_t>
{
HANDLE m_out;
public:
Console_streambuf(HANDLE out) : m_out(out) {}
virtual int_type overflow(int_type c = traits_type::eof())
{
wchar_t wc = c;
DWORD numberOfCharsWritten;
BOOL res = WriteConsoleW(m_out, &wc, 1, &numberOfCharsWritten, NULL);
(void)res;
return 1;
}
};
int main() {
Console_streambuf out(GetStdHandle(STD_OUTPUT_HANDLE));
auto old_buf = std::wcout.rdbuf(&out);
std::wcout << L"привет, 猫咪!\n";
std::wcout.rdbuf(old_buf); // replace old buffer so that destruction can happen correctly. FIXME: use RAII to do this in an exception safe manner.
}
You can do UTF-8 output to a file like this (although I'm not sure VS2008 supports codecvt_utf8_utf16):
#include <codecvt>
#include <fstream>
int main() {
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
myFile.imbue(std::locale(myFile.getloc(),new std::codecvt_utf8_utf16<wchar_t>));
myFile << L"привет, 猫咪!";
}
Include the following header file
#include <locale>
at the start of main, add the following line.
std::locale::global(std::locale("chinese"));
This helps to set the proper locale.

wcout does not output as desired

I've been trying to write a C++ application for a project and I ran into this issue. Basically:
class OBSClass
{
public:
wstring ClassName;
uint8_t Credit;
uint8_t Level;
OBSClass() : ClassName(), Credit(), Level() {}
OBSClass(wstring name, uint8_t credit, uint8_t hyear)
: ClassName(name), Credit(credit), Level(hyear)
{}
};
In some other file:
vector<OBSClass> AllClasses;
...
AllClasses.push_back(OBSClass(L"Bilişim Sistemleri Mühendisliğine Giriş", 3, 1));
AllClasses.push_back(OBSClass(L"İş Sağlığı ve Güvenliği", 3, 1));
AllClasses.push_back(OBSClass(L"Türk Dili 1", 2, 1));
... (rest omitted, some of entries have non-ASCII characters like 'ş' and 'İ')
I have a function basically outputs everything in AllClasses, the problem is wcout does not output as desired.
void PrintClasses()
{
for (size_t i = 0; i < AllClasses.size(); i++)
{
wcout << "Class: " << AllClasses[i].ClassName << "\n";
}
}
Output is 'Class: Bili' and nothing else. Program does not even tries to output other entries and just hangs. I am on windows using G++ 6.3.0. And I am not using Windows' cmd, I am using bash from mingw, so encoding will not be problem (or isn't it?). Any advice?
Edit: Also source code encoding is not a problem, just checked it is UTF8, default of VSCode
Edit: Also just checked to find out if problem is with string literals.
wstring test;
wcin >> test;
wcout << test;
Entered some non-ASCII characters like 'ö' and 'ş', it works perfectly. What is the problem with wide string literals?
Edit: Here you go
#include <iostream>
#include <string>
#include <vector>
using namespace std;
vector<wstring> testvec;
int main()
{
testvec.push_back(L"Bilişim Sistemleri Mühendisliğine Giriş");
testvec.push_back(L"ıiÖöUuÜü");
testvec.push_back(L"☺☻♥♦♣♠•◘○");
for (size_t i = 0; i < testvec.size(); i++)
wcout << testvec[i] << "\n";
return 0;
}
Compile with G++:
g++ file.cc -O3
This code only outputs 'Bili'. It must be something with the g++ screwing up binary encoding (?), since entering values with wcin then outputting them with wcout does not generate any problem.
The following code works for me, using MinGW-w64 7.3.0 in both MSYS2 Bash, and Windows CMD; and with the source encoded as UTF-8:
#include <iostream>
#include <locale>
#include <string>
#include <codecvt>
int main()
{
std::ios_base::sync_with_stdio(false);
std::locale utf8( std::locale(), new std::codecvt_utf8_utf16<wchar_t> );
std::wcout.imbue(utf8);
std::wstring w(L"Bilişim Sistemleri Mühendisliğine Giriş");
std::wcout << w << '\n';
}
Explanation:
The Windows console doesn't support any sort of 16-bit output; it's only ANSI and a partial UTF-8 support. So you need to configure wcout to convert the output to UTF-8. This is the default for backwards compatibility purposes, though Windows 10 1803 does add an option to set that to UTF-8 (ref).
imbue with a codecvt_utf8_utf16 achieves this; however you also need to disable sync_with_stdio otherwise the stream doesn't even use the facet, it just defers to stdout which has a similar problem.
For writing to other files, I found the same technique works to write UTF-8. For writing a UTF-16 file you need to imbue the wofstream with a UTF-16 facet, see example here, and manually write a BOM.
Commentary: Many people just avoid trying to use wide iostreams completely, due to these issues.
You can write a UTF-8 file using a narrow stream; and have function calls in your code to convert wstring to UTF-8, if you are using wstring internally; you can of course use UTF-8 internally.
Of course you can also write a UTF-16 file using a narrow stream, just not with operator<< from a wstring.
If you have at least Windows 10 1903 (May 2019), and at least
Windows Terminal 0.3.2142 (Aug 2019). Then set Unicode:
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage]
"OEMCP"="65001"
and restart. After that you can use this:
#include <iostream>
int main() {
std::string a[] = {
"Bilişim Sistemleri Mühendisliğine Giriş",
"Türk Dili 1",
"İş Sağlığı ve Güvenliği",
"ıiÖöUuÜü",
"☺☻♥♦♣♠•◘○"
};
for (auto s: a) {
std::cout << s << std::endl;
}
}

How to use same wide char string format for windows and linux

Currently, I have some code that will running on Windows/Linux. Now I have a problem when output some information.
On windows, I use _vsnwprintf_s() to support variable parameter. So it support below format.
Log0(L"111");
Log1(L"%s", L"22");
Log2(L"%s, %d", L"33", 44);
On Linux, I can not use vswprintf to format string, but it need use %ls to format wide string.
Debug0(L"111");
Debug1(L"%ls", L"22");
Currently, I want to wrapper a unified function InfoX() to support cross-platform, so it internally will use LogX() or DebugX() base on current OS type.
As you can see, on windows, I will use %s to format wide string, but will use %ls on linux. I do not known how to input at ??? in Info2() function.
Info2(L"???", L"22");
Why don't you use stl library ? It is the better tool to answer to your problem.
You use wide char so std::wstring is the best solution.
To create your "wstring" value you use "wstringstream" and you don't need to format your charaters.
#include <sstream>
#include <iostream>
void LOG(std::wstring _sLog)
{
std::wcout << _sLog << '\n';
//Your LOG code
}
int main()
{
std::wstringstream oss;
std::wstring wideValue(L"my wide value :");
oss << wideValue << 101;
std::wstring s = oss.str();
LOG(s);
return 0;
}

no output with wide streams

I have a problem with wide stream output. My primary concern is wofstream but wcout doesn't work properly either.
So it doesn't produce output besides Latin characters.
That is
#include <string>
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
wstring wstr = L"Андрей";
wofstream fout(L"C:\\Work\\report.htm");
wcout << wstr << L"Привет мир";
fout << wstr << L"Привет мир";
fout.close();
}
Produces no output, the file stays 0 byte long.
Mixing like wcout<<L"zuhщзг" prints just "zuh", ignores the rest.
I use MVS 2013 with Intel C++ Composer 14.0
EDIT:
Windows Unicode C++ Stream Output Failure describes similar problem. But I don't quite understand how the solution works.
MVS/Windows use UTF-16 for wide strings. and I would like they to be written in the file, as is, that is utf-16, without any unnecessary conversion

wostream fails to output wstring

I am using Visual Studio C++ 2008 (Express). When I run the below code, the wostream (both std::wcout, and std::wfstream) stops outputting at the first non-ASCII character (in this case Chinese) encountered. Plain ASCII characters print fine. However, in the debugger, I can see that the wstrings are in fact properly populated with Chinese characters, and the output << ... is in fact getting executed.
The project settings in the Visual Studio solution are set to "Use Unicode Character Set". Why is std::wostream failing to output Unicode characters outside of the ASCII range?
void PrintTable(const std::vector<std::vector<std::wstring>> &table, std::wostream& output) {
for (unsigned int i=0; i < table.size(); ++i) {
for (unsigned int j=0; j < table[i].size(); ++j) {
output << table[i][j] << L"\t";
}
//output << std::endl;
}
}
void TestUnicodeSingleTableChinesePronouns() {
FileProcessor p("SingleTableChinesePronouns.docx");
FileProcessor::iterator fileIterator;
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
for(fileIterator = p.begin(); fileIterator != p.end(); ++fileIterator) {
PrintTable(*fileIterator, myFile);
PrintTable(*fileIterator, std::wcout);
std::cout<<std::endl<<"---------------------------------------"<<std::endl;
}
myFile.flush();
myFile.close();
}
By default the locale that std::wcout and std::wofstream use for certain operations is the "C" locale, which is not required to support non-ascii characters (or any character outside C++'s basic character set). Change the locale to one that supports the characters you want to use.
The simplest thing to do on Windows is unfortunately to use legacy codepages, however you really should avoid that. Legacy codepages are bad news. Instead you should use Unicode, whether UTF-8, UTF-16, or whatever. Also you'll have to work around Windows' unfortunate console model that makes writing to the console very different from writing to other kinds of output streams. You might need to find or write your own output buffer that specifically handles the console (or maybe file a bug asking Microsoft to fix it).
Here's an example of console output:
#include <Windows.h>
#include <streambuf>
#include <iostream>
class Console_streambuf
: public std::basic_streambuf<wchar_t>
{
HANDLE m_out;
public:
Console_streambuf(HANDLE out) : m_out(out) {}
virtual int_type overflow(int_type c = traits_type::eof())
{
wchar_t wc = c;
DWORD numberOfCharsWritten;
BOOL res = WriteConsoleW(m_out, &wc, 1, &numberOfCharsWritten, NULL);
(void)res;
return 1;
}
};
int main() {
Console_streambuf out(GetStdHandle(STD_OUTPUT_HANDLE));
auto old_buf = std::wcout.rdbuf(&out);
std::wcout << L"привет, 猫咪!\n";
std::wcout.rdbuf(old_buf); // replace old buffer so that destruction can happen correctly. FIXME: use RAII to do this in an exception safe manner.
}
You can do UTF-8 output to a file like this (although I'm not sure VS2008 supports codecvt_utf8_utf16):
#include <codecvt>
#include <fstream>
int main() {
std::wofstream myFile("data.bin", std::ios::out | std::ios::binary);
myFile.imbue(std::locale(myFile.getloc(),new std::codecvt_utf8_utf16<wchar_t>));
myFile << L"привет, 猫咪!";
}
Include the following header file
#include <locale>
at the start of main, add the following line.
std::locale::global(std::locale("chinese"));
This helps to set the proper locale.