STL and UTF-8 file input/output. How to do it? - c++

I use wchar_t for internal strings and UTF-8 for storage in files. I need to use STL to input/output text to screen and also do it by using full Lithuanian charset.
It's all fine because I'm not forced to do the same for files, so the following example does the job just fine:#include <io.h>
#include <fcntl.h>
#include <iostream>
_setmode (_fileno(stdout), _O_U16TEXT);
wcout << L"AaĄąfl" << endl;
But I became curious and attempted to do the same with files with no success. Of course I could use formatted input/output, but that is... discouraged. FILE* fp;
_wfopen_s (&fp, L"utf-8_out_test.txt", L"w");
_setmode (_fileno (fp), _O_U8TEXT);
_fwprintf_p (fp, L"AaĄą\nfl");
fclose (fp);
_wfopen_s (&fp, L"utf-8_in_test.txt", L"r");
_setmode (_fileno (fp), _O_U8TEXT);
wchar_t text[256];
fseek (fp, NULL, SEEK_SET);
fwscanf (fp, L"%s", text);
wcout << text << endl;
fwscanf (fp, L"%s", text);
wcout << text << endl;
fclose (fp);This snippet works perfectly (although I am not sure how it handles malformed chars). So, is there any way to:
get FILE* or integer file handle form a std::basic_*fstream?
simulate _setmode () on it?
extend std::basic_*fstream so it handles UTF-8 I/O?
Yes, I am studying at an university and this is somewhat related to my assignments, but I am trying to figure this out for myself. It won't influence my grade or anything like that.

Use std::codecvt_facet template to perform the conversion.
You may use standard std::codecvt_byname, or a non-standard codecvt_facet implementation.
#include <locale>
using namespace std;
typedef codecvt_facet<wchar_t, char, mbstate_t> Cvt;
locale utf8locale(locale(), new codecvt_byname<wchar_t, char, mbstate_t> ("en_US.UTF-8"));
wcout.pubimbue(utf8locale);
wcout << L"Hello, wide to multybyte world!" << endl;
Beware that on some platforms codecvt_byname can only emit conversion only for locales that are installed in the system.

Well, after some testing I figured out that FILE is accepted for _iobuf (in the w*fstream constructor). So, the following code does what I need.#include <iostream>
#include <fstream>
#include <io.h>
#include <fcntl.h>
//For writing
FILE* fp;
_wfopen_s (&fp, L"utf-8_out_test.txt", L"w");
_setmode (_fileno (fp), _O_U8TEXT);
wofstream fs (fp);
fs << L"ąfl";
fclose (fp);
//And reading
FILE* fp;
_wfopen_s (&fp, L"utf-8_in_test.txt", L"r");
_setmode (_fileno (fp), _O_U8TEXT);
wifstream fs (fp);
wchar_t array[6];
fs.getline (array, 5);
wcout << array << endl;//For debug
fclose (fp);This sample reads and writes legit UTF-8 files (without BOM) in Windows compiled with Visual Studio 2k8.
Can someone give any comments about portability? Improvements?

The easiest way would be to do the conversion to UTF-8 yourself before trying to output. You might get some inspiration from this question: UTF8 to/from wide char conversion in STL

get FILE* or integer file handle form a std::basic_*fstream?
Answered elsewhere.

You can't make STL to directly work with UTF-8. The basic reason is that STL indirectly forbids multi-char characters. Each character has to be one char/wchar_t.
Microsoft actually breaks the standard with their UTF-16 encoding, so maybe you can get some inspiration there.

Related

Reading an input of mixed unicode characters and integers [duplicate]

I ask a code snippet which cin a unicode text, concatenates another unicode one to the first unicode text and the cout the result.
P.S. This code will help me to solve another bigger problem with unicode. But before the key thing is to accomplish what I ask.
ADDED: BTW I can't write in the command line any unicode symbol when I run the executable file. How I should do that?
I had a similar problem in the past, in my case imbue and sync_with_stdio did the trick. Try this:
#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main() {
ios_base::sync_with_stdio(false);
wcin.imbue(locale("en_US.UTF-8"));
wcout.imbue(locale("en_US.UTF-8"));
wstring s;
wstring t(L" la Polynésie française");
wcin >> s;
wcout << s << t << endl;
return 0;
}
Depending on what type unicode you mean. I assume you mean you are just working with std::wstring though. In that case use std::wcin and std::wcout.
For conversion between encodings you can use your OS functions like for Win32: WideCharToMultiByte, MultiByteToWideChar or you can use a library like libiconv
Here is an example that shows four different methods, of which only the third (C conio) and the fourth (native Windows API) work (but only if stdin/stdout aren't redirected). Note that you still need a font that contains the character you want to show (Lucida Console supports at least Greek and Cyrillic). Note that everything here is completely non-portable, there is just no portable way to input/output Unicode strings on the terminal.
#ifndef UNICODE
#define UNICODE
#endif
#ifndef _UNICODE
#define _UNICODE
#endif
#define STRICT
#define NOMINMAX
#define WIN32_LEAN_AND_MEAN
#include <iostream>
#include <string>
#include <cstdlib>
#include <cstdio>
#include <conio.h>
#include <windows.h>
void testIostream();
void testStdio();
void testConio();
void testWindows();
int wmain() {
testIostream();
testStdio();
testConio();
testWindows();
std::system("pause");
}
void testIostream() {
std::wstring first, second;
std::getline(std::wcin, first);
if (!std::wcin.good()) return;
std::getline(std::wcin, second);
if (!std::wcin.good()) return;
std::wcout << first << second << std::endl;
}
void testStdio() {
wchar_t buffer[0x1000];
if (!_getws_s(buffer)) return;
const std::wstring first = buffer;
if (!_getws_s(buffer)) return;
const std::wstring second = buffer;
const std::wstring result = first + second;
_putws(result.c_str());
}
void testConio() {
wchar_t buffer[0x1000];
std::size_t numRead = 0;
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring first(buffer, numRead);
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second + L'\n';
_cputws(result.c_str());
}
void testWindows() {
const HANDLE stdIn = GetStdHandle(STD_INPUT_HANDLE);
WCHAR buffer[0x1000];
DWORD numRead = 0;
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring first(buffer, numRead - 2);
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second;
const HANDLE stdOut = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD numWritten = 0;
WriteConsoleW(stdOut, result.c_str(), result.size(), &numWritten, NULL);
}
Edit 1: I've added a method based on conio.
Edit 2: I've messed around with _O_U16TEXT a bit as described in Michael Kaplan's blog, but that seemingly only had wgets interpret the (8-bit) data from ReadFile as UTF-16. I'll investigate this a bit further during the weekend.
If you have actual text (i.e., a string of logical characters), then insert to the wide streams instead. The wide streams will automatically encode your characters to match the bits expected by the locale encoding. (And if you have encoded bits instead, the streams will decode the bits, then re-encode them to match the locale.)
There is a lesser solution if you KNOW you have UTF-encoded bits (i.e., an array of bits intended to be decoded into a string of logical characters) AND you KNOW the target of the output stream is expecting that very same bit-format, then you can skip the decoding and re-encoding steps and write() the bits as-is. This only works when you know both sides use the same encoding format, which may be the case for small utilities not intended to communicate with processes in other locales.
It depends on the OS. If your OS understands you can simply send it UTF-8 sequences.

C++ writing UTF-8 on Linux

I have the following code on Windows written in C++ with Visual Studio:
FILE* outFile = fopen(outFileName, "a,ccs=UTF-8");
fwrite(buffer.c_str(), buffer.getLength() * sizeof(wchar_t), 1, outFile);
std::wstring newLine = L"\n";
fwrite(newLine.c_str(), sizeof(wchar_t), 1, outFile);
fclose(outFile);
This correctly writes out the file in UTF-8.
When I compile and run the same code on Linux, the file is created, but it is zero length. If I change the fopen command as follows, the file is created and non-zero length, but all non-ASCII characters display as garbage:
FILE* outFile = fopen(outFileName, "a");
Does ccs=UTF-8 not work on Linux gcc?
No, the extensions done on Windows do not work on Linux, OS-X, Android, iOS and everywhere else. The Microsoft just makes those extensions to achieve that you write incompatible code with other platforms.
Convert your wide string to byte string that contains UTF-8, then write the bytes to file like usual.
There are lot of ways to do it but most standard-compatible way is perhaps like that:
#include <iostream>
#include <string>
#include <codecvt>
#include <locale>
using Converter = std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t>;
int main()
{
std::wstring wide = L"Öö Tiib 😛";
std::string u8 = Converter{}.to_bytes(wide);
// note: I just put the bytes out to cout, you want to write to file
std::cout << std::endl << u8 << std::endl;
}
Demo is there. It uses g++ 8.1.0 but g++ 4.9.x is also likely fine.
Note that is rare case when anyone needs to use wide strings on Linux, most of code there uses utf8 only.

Can't write chinese character into textfile with wofstream

I'm using std::wofstream to write characters in a text file.My characters can have chars from very different languages(english to chinese).
I want to print my vector<wstring> into that file.
If my vector contains only english characters I can print them without a problem.
But if I write chineses characters my file remains empty.
I browsed trough stackoverflow and all answers said bascially to use functions from the library:
#include <codecvt>
I can't include that library, because I am using Dev-C++ in version 5.11.
I did:#define UNICODE in all my header files.
I guess there is a really simple solution for that problem.
It would be great, if someone could help me out.
My code:
#define UNICODE
#include <string>
#include <fstream>
using namespace std;
int main()
{
string Path = "D:\\Users\\\t\\Desktop\\korrigiert_RotCommon_zh_check_error.log";
wofstream Out;
wstring eng = L"hello";
wstring chi = L"程序";
Out.open(Path, ios::out);
//works.
Out << eng;
//fails
Out << chi;
Out.close();
return 0;
}
Kind Regards
Even if the name of the wofstream implies it's a wide char stream, it's not. It's still a char stream that uses a convert facet from a locale to convert the wchars to char.
Here is what cppreference says:
All file I/O operations performed through std::basic_fstream<CharT> use the std::codecvt<CharT, char, std::mbstate_t> facet of the locale imbued in the stream.
So you could either set the global locale to one that supports Chinese or imbue the stream. In both cases you'll get a single byte stream.
#include <locale>
//...
const std::locale loc = std::locale(std::locale(), new std::codecvt_utf8<wchar_t>);
Out.open(Path, ios::out);
Out.imbue(loc);
Unfortunately std::codecvt_utf8 is already deprecated[2]. This MSDN
magazine
article explains how to do UTF-8 conversion using MultiByteToWideChar C++ - Unicode Encoding Conversions with STL Strings and Win32 APIs.
Here the Microsoft/vcpkg variant of an to_utf8 conversion:
std::string to_utf8(const CWStringView w)
{
const size_t size = WideCharToMultiByte(CP_UTF8, 0, w.c_str(), -1, nullptr, 0, nullptr, nullptr);
std::string output;
output.resize(size - 1);
WideCharToMultiByte(CP_UTF8, 0, w.c_str(), -1, output.data(), size - 1, nullptr, nullptr);
return output;
}
On the other side you can use normal binary stream and write the wstring data with write().
std::ofstream Out(Path, ios::out | ios::binary);
const uint16_t bom = 0xFEFF;
Out.write(reinterpret_cast<const char*>(&bom), sizeof(bom)); // optional Byte order mark
Out.write(reinterpret_cast<const char*>(chi.data()), chi.size() * sizeof(wchar_t));
You forgot to tell your stream what locale to use:
Out.imbue(std::locale("zh_CN.UTF-8"));
You'll obviously need to include <locale> for this.

writing a string to file as a sequence of bytes

I want to write a wide string to a file as a sequence of bytes. I tried two ways, the first way:
std::wstring str = L"This is a test";
LPBYTE pBuf = (LPBYTE)str.c_str();
FILE* hFile = _wfopen( L"c:\\temp.txt", L"w" );
for( int i = 0; i<(str.length()*sizeof(wchar_t)); ++i)
fwprintf( hFile, L"%02X", pBuf[i] );
fclose(hFile);
The second way:
std::wstring str = L"This is a test";
LPBYTE pBuf = (LPBYTE)str.c_str();
HANDLE hFile = CreateFile( L"c:\\temp.txt", GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL );
DWORD dwRet;
WriteFile( hFile, pBuf, str.length()*sizeof(wchar_t), &dwRet, NULL );
CloseHandle(hFile);
When I open the result file, in the first case the contents of the file are:
54006800690073002000690073002000610020007400650073007400
In the second case, the contents of the file are:
This is a test
Why the first way doesn't work as expected? it looks like both ways are equal.
In the first example, you used fwprintf to format the bytes as 2-digit hex strings so that is why you see hex in that file.
I suspect you should spend some time researching the ASCII code and UTF-16LE and looking at text using a hex editor.
Every file is just a sequence of bytes so your question is not well defined and makes me think you have some fundamental misunderstanding about bytes and encodings but I'm not sure what it is.
Assuming you want to write out the in-memory representation of the string:
#include <fstream>
int main (int argc,char *argv[]) {
std::wstring str = L"This is a test";
std::ofstream fout(R"(c:\temp.txt)");
fout.exceptions(std::ios::badbit | std::ios::failbit);
fout.write(reinterpret_cast<const char*>(str.data()), sizeof(wchar_t) * str.size());
}
We use ofstream because this is C++ and it's better to use RAII types instead of having to manually call fclose or CloseHandle. We use a raw string for the filename so we don't have to deal with escaping the backslash. (On platforms that use a sensible path separator ; ) the raw string here is unnecessary.) We also turn on exceptions so that we don't have to explicitly check for errors.
Then we write out the bytes using the write member function. Note that the codecvt facet is still applied to the data written using this method. This is the reason we're using ofstream instead of wofstream; The default facet for ofstream does nothing, but the default facet for wofstream would convert the wchar_t to char using the default locale.
If you simply want to write UTF-16 data out then there are better ways than trying to write the raw bytes of a wchar_t string. (wchar_t isn't necessarily UTF-16. Some platforms just happen to use UTF-16.)
One way is to use a the codecvt_utf16 facet:
#include <fstream>
#include <codecvt>
int main(int argc, char *argv[]) {
std::wstring str = L"This is a test";
std::wofstream fout(R"(C:\temp.txt)");
fout.exceptions(std::ios::badbit | std::ios::failbit);
fout.imbue(std::locale(std::locale("C"), new std::codecvt_utf16<wchar_t>));
fout << str;
}
Here we write a wchar_t string normally, but we've imbued the wstream with codecvt_utf16, so that the the wchar_t is converted to UTF-16. If you want little endian UTF-16, or you want to include U+FEFF at the beginning of the file (these are frequently done on Windows) then there are flags to enable that: std::codecvt_utf16<wchar_t, 0x10FFFF, std::codecvt_mode::generate_header | std::codecvt_mode::little_endian>. (also note that codecvt_utf16 will treat wchar_t as UCS-2 or UCS-4, never UTF-16. The upshot is that this only handles the BMP on Windows)
Another option is to use normal streams and the wstring_convert facility:
#include <fstream>
#include <codecvt>
int main(int argc, char *argv[]) {
std::wstring str = L"This is a test";
std::ofstream fout(R"(C:\temp.txt)");
fout.exceptions(std::ios::badbit | std::ios::failbit);
std::wstring_convert<std::codecvt_utf16<wchar_t>, wchar_t> convert;
fout << convert.to_bytes(str);
}
This is probably the option I would choose, since it allows one to almost completely avoid wchar_t.

How can I cin and cout some unicode text?

I ask a code snippet which cin a unicode text, concatenates another unicode one to the first unicode text and the cout the result.
P.S. This code will help me to solve another bigger problem with unicode. But before the key thing is to accomplish what I ask.
ADDED: BTW I can't write in the command line any unicode symbol when I run the executable file. How I should do that?
I had a similar problem in the past, in my case imbue and sync_with_stdio did the trick. Try this:
#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main() {
ios_base::sync_with_stdio(false);
wcin.imbue(locale("en_US.UTF-8"));
wcout.imbue(locale("en_US.UTF-8"));
wstring s;
wstring t(L" la Polynésie française");
wcin >> s;
wcout << s << t << endl;
return 0;
}
Depending on what type unicode you mean. I assume you mean you are just working with std::wstring though. In that case use std::wcin and std::wcout.
For conversion between encodings you can use your OS functions like for Win32: WideCharToMultiByte, MultiByteToWideChar or you can use a library like libiconv
Here is an example that shows four different methods, of which only the third (C conio) and the fourth (native Windows API) work (but only if stdin/stdout aren't redirected). Note that you still need a font that contains the character you want to show (Lucida Console supports at least Greek and Cyrillic). Note that everything here is completely non-portable, there is just no portable way to input/output Unicode strings on the terminal.
#ifndef UNICODE
#define UNICODE
#endif
#ifndef _UNICODE
#define _UNICODE
#endif
#define STRICT
#define NOMINMAX
#define WIN32_LEAN_AND_MEAN
#include <iostream>
#include <string>
#include <cstdlib>
#include <cstdio>
#include <conio.h>
#include <windows.h>
void testIostream();
void testStdio();
void testConio();
void testWindows();
int wmain() {
testIostream();
testStdio();
testConio();
testWindows();
std::system("pause");
}
void testIostream() {
std::wstring first, second;
std::getline(std::wcin, first);
if (!std::wcin.good()) return;
std::getline(std::wcin, second);
if (!std::wcin.good()) return;
std::wcout << first << second << std::endl;
}
void testStdio() {
wchar_t buffer[0x1000];
if (!_getws_s(buffer)) return;
const std::wstring first = buffer;
if (!_getws_s(buffer)) return;
const std::wstring second = buffer;
const std::wstring result = first + second;
_putws(result.c_str());
}
void testConio() {
wchar_t buffer[0x1000];
std::size_t numRead = 0;
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring first(buffer, numRead);
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second + L'\n';
_cputws(result.c_str());
}
void testWindows() {
const HANDLE stdIn = GetStdHandle(STD_INPUT_HANDLE);
WCHAR buffer[0x1000];
DWORD numRead = 0;
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring first(buffer, numRead - 2);
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second;
const HANDLE stdOut = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD numWritten = 0;
WriteConsoleW(stdOut, result.c_str(), result.size(), &numWritten, NULL);
}
Edit 1: I've added a method based on conio.
Edit 2: I've messed around with _O_U16TEXT a bit as described in Michael Kaplan's blog, but that seemingly only had wgets interpret the (8-bit) data from ReadFile as UTF-16. I'll investigate this a bit further during the weekend.
If you have actual text (i.e., a string of logical characters), then insert to the wide streams instead. The wide streams will automatically encode your characters to match the bits expected by the locale encoding. (And if you have encoded bits instead, the streams will decode the bits, then re-encode them to match the locale.)
There is a lesser solution if you KNOW you have UTF-encoded bits (i.e., an array of bits intended to be decoded into a string of logical characters) AND you KNOW the target of the output stream is expecting that very same bit-format, then you can skip the decoding and re-encoding steps and write() the bits as-is. This only works when you know both sides use the same encoding format, which may be the case for small utilities not intended to communicate with processes in other locales.
It depends on the OS. If your OS understands you can simply send it UTF-8 sequences.