UTF-8 output on Windows console - c++

The following code shows unexpected behaviour on my machine (tested with Visual C++ 2008 SP1 on Windows XP and VS 2012 on Windows 7):
#include <iostream>
#include "Windows.h"
int main() {
SetConsoleOutputCP( CP_UTF8 );
std::cout << "\xc3\xbc";
int fail = std::cout.fail() ? '1': '0';
fputc( fail, stdout );
fputs( "\xc3\xbc", stdout );
}
I simply compiled with cl /EHsc test.cpp.
Windows XP: Output in a console window is
ü0ü (translated to Codepage 1252, originally shows some line drawing
charachters in the default Codepage, perhaps 437). When I change the settings
of the console window to use the "Lucida Console" character set and run my
test.exe again, output is changed to 1ü, which means
the character ü can be written using fputs and its UTF-8 encoding C3 BC
std::cout does not work for whatever reason
the streams failbit is setting after trying to write the character
Windows 7: Output using Consolas is ��0ü. Even more interesting. The correct bytes are written, probably (at least when redirecting the output to a file) and the stream state is ok, but the two bytes are written as separate characters).
I tried to raise this issue on "Microsoft Connect" (see here),
but MS has not been very helpful. You might as well look here
as something similar has been asked before.
Can you reproduce this problem?
What am I doing wrong? Shouldn't the std::cout and the fputs have the same
effect?
SOLVED: (sort of) Following mike.dld's idea I implemented a std::stringbuf doing the conversion from UTF-8 to Windows-1252 in sync() and replaced the streambuf of std::cout with this converter (see my comment on mike.dld's answer).

I understand the question is quite old, but if someone would still be interested, below is my solution. I've implemented a quite simple std::streambuf descendant and then passed it to each of standard streams on the very beginning of program execution.
This allows you to use UTF-8 everywhere in your program. On input, data is taken from console in Unicode and then converted and returned to you in UTF-8. On output the opposite is done, taking data from you in UTF-8, converting it to Unicode and sending to console. No issues found so far.
Also note, that this solution doesn't require any codepage modification, with either SetConsoleCP, SetConsoleOutputCP or chcp, or something else.
That's the stream buffer:
class ConsoleStreamBufWin32 : public std::streambuf
{
public:
ConsoleStreamBufWin32(DWORD handleId, bool isInput);
protected:
// std::basic_streambuf
virtual std::streambuf* setbuf(char_type* s, std::streamsize n);
virtual int sync();
virtual int_type underflow();
virtual int_type overflow(int_type c = traits_type::eof());
private:
HANDLE const m_handle;
bool const m_isInput;
std::string m_buffer;
};
ConsoleStreamBufWin32::ConsoleStreamBufWin32(DWORD handleId, bool isInput) :
m_handle(::GetStdHandle(handleId)),
m_isInput(isInput),
m_buffer()
{
if (m_isInput)
{
setg(0, 0, 0);
}
}
std::streambuf* ConsoleStreamBufWin32::setbuf(char_type* /*s*/, std::streamsize /*n*/)
{
return 0;
}
int ConsoleStreamBufWin32::sync()
{
if (m_isInput)
{
::FlushConsoleInputBuffer(m_handle);
setg(0, 0, 0);
}
else
{
if (m_buffer.empty())
{
return 0;
}
std::wstring const wideBuffer = utf8_to_wstring(m_buffer);
DWORD writtenSize;
::WriteConsoleW(m_handle, wideBuffer.c_str(), wideBuffer.size(), &writtenSize, NULL);
}
m_buffer.clear();
return 0;
}
ConsoleStreamBufWin32::int_type ConsoleStreamBufWin32::underflow()
{
if (!m_isInput)
{
return traits_type::eof();
}
if (gptr() >= egptr())
{
wchar_t wideBuffer[128];
DWORD readSize;
if (!::ReadConsoleW(m_handle, wideBuffer, ARRAYSIZE(wideBuffer) - 1, &readSize, NULL))
{
return traits_type::eof();
}
wideBuffer[readSize] = L'\0';
m_buffer = wstring_to_utf8(wideBuffer);
setg(&m_buffer[0], &m_buffer[0], &m_buffer[0] + m_buffer.size());
if (gptr() >= egptr())
{
return traits_type::eof();
}
}
return sgetc();
}
ConsoleStreamBufWin32::int_type ConsoleStreamBufWin32::overflow(int_type c)
{
if (m_isInput)
{
return traits_type::eof();
}
m_buffer += traits_type::to_char_type(c);
return traits_type::not_eof(c);
}
The usage then is as follows:
template<typename StreamT>
inline void FixStdStream(DWORD handleId, bool isInput, StreamT& stream)
{
if (::GetFileType(::GetStdHandle(handleId)) == FILE_TYPE_CHAR)
{
stream.rdbuf(new ConsoleStreamBufWin32(handleId, isInput));
}
}
// ...
int main()
{
FixStdStream(STD_INPUT_HANDLE, true, std::cin);
FixStdStream(STD_OUTPUT_HANDLE, false, std::cout);
FixStdStream(STD_ERROR_HANDLE, false, std::cerr);
// ...
std::cout << "\xc3\xbc" << std::endl;
// ...
}
Left out wstring_to_utf8 and utf8_to_wstring could easily be implemented with WideCharToMultiByte and MultiByteToWideChar WinAPI functions.

Oi. Congratulations on finding a way to change the code page of the console from inside your program. I didn't know about that call, I always had to use chcp.
I'm guessing the C++ default locale is getting involved. By default it will use the code page provide by GetThreadLocale() to determine the text encoding of non-wstring stuff. This generally defaults to CP1252. You could try using SetThreadLocale() to get to UTF-8 (if it even does that, can't recall), with the hope that std::locale defaults to something that can handle your UTF-8 encoding.

It's time to close this now. Stephan T. Lavavej says the behaviour is "by design", although I cannot follow this explanation.
My current knowledge is: Windows XP console in UTF-8 codepage does not work with C++ iostreams.
Windows XP is getting out of fashion now and so does VS 2008. I'd be interested to hear if the problem still exists on newer Windows systems.
On Windows 7 the effect is probably due to the way the C++ streams output characters. As seen in an answer to Properly print utf8 characters in windows console, UTF-8 output fails with C stdio when printing one byte after after another like putc('\xc3'); putc('\xbc'); as well. Perhaps this is what C++ streams do here.

I just follow mike.dld's answer in this question, and add the printf support for the UTF-8 string.
As mkluwe mentioned in his answer that by default, printf function will output to the console one by one byte, while the console can't handle single byte correctly. My method is quite simple, I use the snprintf function to print the whole content to a internal string buffer, and then dump the buffer to std::cout.
Here is the full testing code:
#include <iostream>
#include <locale>
#include <windows.h>
#include <cstdlib>
using namespace std;
// https://stackoverflow.com/questions/4358870/convert-wstring-to-string-encoded-in-utf-8
#include <codecvt>
#include <string>
// convert UTF-8 string to wstring
std::wstring utf8_to_wstring (const std::string& str)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.from_bytes(str);
}
// convert wstring to UTF-8 string
std::string wstring_to_utf8 (const std::wstring& str)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.to_bytes(str);
}
// https://stackoverflow.com/questions/1660492/utf-8-output-on-windows-console
// mike.dld's answer
class ConsoleStreamBufWin32 : public std::streambuf
{
public:
ConsoleStreamBufWin32(DWORD handleId, bool isInput);
protected:
// std::basic_streambuf
virtual std::streambuf* setbuf(char_type* s, std::streamsize n);
virtual int sync();
virtual int_type underflow();
virtual int_type overflow(int_type c = traits_type::eof());
private:
HANDLE const m_handle;
bool const m_isInput;
std::string m_buffer;
};
ConsoleStreamBufWin32::ConsoleStreamBufWin32(DWORD handleId, bool isInput) :
m_handle(::GetStdHandle(handleId)),
m_isInput(isInput),
m_buffer()
{
if (m_isInput)
{
setg(0, 0, 0);
}
}
std::streambuf* ConsoleStreamBufWin32::setbuf(char_type* /*s*/, std::streamsize /*n*/)
{
return 0;
}
int ConsoleStreamBufWin32::sync()
{
if (m_isInput)
{
::FlushConsoleInputBuffer(m_handle);
setg(0, 0, 0);
}
else
{
if (m_buffer.empty())
{
return 0;
}
std::wstring const wideBuffer = utf8_to_wstring(m_buffer);
DWORD writtenSize;
::WriteConsoleW(m_handle, wideBuffer.c_str(), wideBuffer.size(), &writtenSize, NULL);
}
m_buffer.clear();
return 0;
}
ConsoleStreamBufWin32::int_type ConsoleStreamBufWin32::underflow()
{
if (!m_isInput)
{
return traits_type::eof();
}
if (gptr() >= egptr())
{
wchar_t wideBuffer[128];
DWORD readSize;
if (!::ReadConsoleW(m_handle, wideBuffer, ARRAYSIZE(wideBuffer) - 1, &readSize, NULL))
{
return traits_type::eof();
}
wideBuffer[readSize] = L'\0';
m_buffer = wstring_to_utf8(wideBuffer);
setg(&m_buffer[0], &m_buffer[0], &m_buffer[0] + m_buffer.size());
if (gptr() >= egptr())
{
return traits_type::eof();
}
}
return sgetc();
}
ConsoleStreamBufWin32::int_type ConsoleStreamBufWin32::overflow(int_type c)
{
if (m_isInput)
{
return traits_type::eof();
}
m_buffer += traits_type::to_char_type(c);
return traits_type::not_eof(c);
}
template<typename StreamT>
inline void FixStdStream(DWORD handleId, bool isInput, StreamT& stream)
{
if (::GetFileType(::GetStdHandle(handleId)) == FILE_TYPE_CHAR)
{
stream.rdbuf(new ConsoleStreamBufWin32(handleId, isInput));
}
}
// some code are from this blog
// https://blog.csdn.net/witton/article/details/108087135
#define printf(fmt, ...) __fprint(stdout, fmt, ##__VA_ARGS__ )
int __vfprint(FILE *fp, const char *fmt, va_list va)
{
// https://stackoverflow.com/questions/7315936/which-of-sprintf-snprintf-is-more-secure
size_t nbytes = snprintf(NULL, 0, fmt, va) + 1; /* +1 for the '\0' */
char *str = (char*)malloc(nbytes);
snprintf(str, nbytes, fmt, va);
std::cout << str;
free(str);
return nbytes;
}
int __fprint(FILE *fp, const char *fmt, ...)
{
va_list va;
va_start(va, fmt);
int n = __vfprint(fp, fmt, va);
va_end(va);
return n;
}
int main()
{
FixStdStream(STD_INPUT_HANDLE, true, std::cin);
FixStdStream(STD_OUTPUT_HANDLE, false, std::cout);
FixStdStream(STD_ERROR_HANDLE, false, std::cerr);
// ...
std::cout << "\xc3\xbc" << std::endl;
printf("\xc3\xbc");
// ...
return 0;
}
The source code is saved in UTF-8 format, and build under Msys2's GCC and run under Windows 7 64bit. Here is the result
ü
ü

Related

C++17 UTF8 std::string to std::wstring UTF32 using unicode.org code or C++ standard functions?

Looking for a working solution to the classic UTF8 to UTF32 in a stable and tested system.
Now I have the source to Unicode.org's
C code:
https://android.googlesource.com/platform/external/id3lib/+/master/unicode.org/ConvertUTF.c
https://android.googlesource.com/platform/external/id3lib/+/master/unicode.org/ConvertUTF.h
License:
https://android.googlesource.com/platform/external/id3lib/+/master/unicode.org/readme.txt
Using the following C++ which interfaces the C library code from above:
std::wstring Utf8_To_wstring(const std::string& utf8string)
{
if (utf8string.length()==0)
{
return std::wstring();
}
size_t widesize = utf8string.length();
if (sizeof(wchar_t) == 2)
{
std::wstring resultstring;
resultstring.resize(widesize, L'\0');
const UTF8* sourcestart = reinterpret_cast<const UTF8*>(utf8string.c_str());
const UTF8* sourceend = sourcestart + widesize;
UTF16* targetstart = reinterpret_cast<UTF16*>(&resultstring[0]);
UTF16* targetend = targetstart + widesize;
ConversionResult res = ConvertUTF8toUTF16(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
if (res != conversionOK)
{
return std::wstring(utf8string.begin(), utf8string.end());
}
*targetstart = 0;
return std::wstring(resultstring.c_str());
}
else if (sizeof(wchar_t) == 4)
{
std::wstring resultstring;
resultstring.resize(widesize, L'\0');
const UTF8* sourcestart = reinterpret_cast<const UTF8*>(utf8string.c_str());
const UTF8* sourceend = sourcestart + widesize;
UTF32* targetstart = reinterpret_cast<UTF32*>(&resultstring[0]);
UTF32* targetend = targetstart + widesize;
ConversionResult res = ConvertUTF8toUTF32(&sourcestart, sourceend, &targetstart, targetend, lenientConversion);
if (res != conversionOK)
{
return std::wstring(utf8string.begin(), utf8string.end());
}
*targetstart = 0;
if(!resultstring.empty() && resultstring.size() > 0) {
std::wstring result = std::wstring(resultstring.c_str());
return result;
} else {
return std::wstring();
}
}
else
{
assert(false);
return L"";
}
return L"";
}
Now this code initially works however crashes soon after due to some issues in the above interfacing code. This interfacing code was adapted from open source code found on GitHub from a production project...
However crashes a few strings into the conversion, so I guess there's a overflow in this code
Does anyone have a good replacement or example code for a simple C++11/C++17 solution to convert a std::string to std::wstring to get UTF32 unicode values encoded
I have a working solution for UTF8 to UTF16 using C++17 Locale:
This seems to do the job for me to convert to the correct level of Unicode to enable extraction of character codes to int to load glyph codes correctly
#include <locale>
#include <codecvt>
#include <string>
std::wstring Utf8_To_wstring(const std::string& utf8string)
{
wstring_convert<codecvt_utf8_utf16<wchar_t>> converter;
wstring utf16;
try {
utf16 = converter.from_bytes(utf8string);
}
catch(range_error e)
{
// log / handle exp
}
return utf16;
}

Coloring a specified char in C++

I have a minesweeper console game and I want to make it a bit more beautiful. I found some coloring libraries from the internet. and used them.
the library is:
// ConsoleColor.h
#pragma once
#include <iostream>
#include <windows.h>
inline std::ostream& blue(std::ostream &s)
{
HANDLE hStdout = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleTextAttribute(hStdout, FOREGROUND_BLUE
|FOREGROUND_GREEN|FOREGROUND_INTENSITY);
return s;
}
inline std::ostream& red(std::ostream &s)
{
HANDLE hStdout = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleTextAttribute(hStdout,
FOREGROUND_RED|FOREGROUND_INTENSITY);
return s;
}
inline std::ostream& green(std::ostream &s)
{
HANDLE hStdout = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleTextAttribute(hStdout,
FOREGROUND_GREEN|FOREGROUND_INTENSITY);
return s;
}
inline std::ostream& yellow(std::ostream &s)
{
HANDLE hStdout = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleTextAttribute(hStdout,
FOREGROUND_GREEN|FOREGROUND_RED|FOREGROUND_INTENSITY);
return s;
}
inline std::ostream& white(std::ostream &s)
{
HANDLE hStdout = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleTextAttribute(hStdout,
FOREGROUND_RED|FOREGROUND_GREEN|FOREGROUND_BLUE);
return s;
}
struct color {
color(WORD attribute):m_color(attribute){};
WORD m_color;
};
template <class _Elem, class _Traits>
std::basic_ostream<_Elem,_Traits>&
operator<<(std::basic_ostream<_Elem,_Traits>& i, color& c)
{
HANDLE hStdout=GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleTextAttribute(hStdout,c.m_color);
return i;
}
It works fine but I have a question. is there anyway to colorize a character in cpp? for example I want to print all 'X' of my program red. is is possible?
thanks for helping
Yes, what you're asking for is entirely possible, though it appears that only a few people know much about the applicable parts of the library. Setting the highlight color with a manipulator adds a little bit more work, but not a whole lot. Code could look something like this:
#include <iostream>
#include <windows.h>
class attribute {
DWORD attrib;
public:
attribute(DWORD attrib) : attrib(attrib) {}
DWORD operator()() const { return attrib; }
};
class outbuf : public std::streambuf {
HANDLE h;
DWORD default_color = FOREGROUND_RED|FOREGROUND_GREEN|FOREGROUND_BLUE;
DWORD highlight_color = FOREGROUND_GREEN;
public:
outbuf(HANDLE h) : h(h) {
SetConsoleTextAttribute(h, default_color);
}
void set_highlight(DWORD color) { highlight_color = color; }
protected:
virtual int_type overflow(int_type c) override {
if (c != EOF) {
if (c == 'x') {
SetConsoleTextAttribute(h, highlight_color);
DWORD written;
WriteConsole(h, &c, 1, &written, nullptr);
SetConsoleTextAttribute(h, default_color);
}
else {
DWORD written;
WriteConsole(h, &c, 1, &written, nullptr);
}
}
return c;
}
};
std::ostream &operator<<(std::ostream &os, attribute a) {
outbuf *out = dynamic_cast<outbuf *>(os.rdbuf());
if (out) {
out->set_highlight(a());
}
return os;
}
int main() {
outbuf buf(GetStdHandle(STD_OUTPUT_HANDLE));
attribute red{FOREGROUND_RED};
attribute blue{FOREGROUND_BLUE | FOREGROUND_INTENSITY};
std::cout.rdbuf(&buf);
std::cout << "oxen\n" << red << "axis\n" << blue << "waxy";
}
Result:
There are essentially two ways you can go about it:
Using the Console API (as illustrated in Jerry Coffin's answer).
Taking advantage of the console's ability to process Virtual Terminal Sequences1.
The latter is easier to implement and a lot more versatile. For example, it allows you to use the full range of 24-bit colors, something that isn't available through the Console API. The following illustrates how to enable processing of virtual terminal sequences, and how to use them:
#include <Windows.h>
#include <iostream>
// Define commonly used formatting control sequences
auto const& reset { L"\x1b[0m" };
auto const& red { L"\x1b[31m" };
auto const& bright_red { L"\x1b[91m" };
int wmain()
{
// Enable processing of virtual terminal sequences
auto output_handle { ::GetStdHandle(STD_OUTPUT_HANDLE) };
DWORD mode {};
auto success { ::GetConsoleMode(output_handle, &mode) };
mode |= ENABLE_VIRTUAL_TERMINAL_PROCESSING;
success = ::SetConsoleMode(output_handle, mode);
std::wcout << red << L"Red Text\n"
<< bright_red << L"Bright Red Text\n"
<< reset << L"Normal Text\n";
}
This produces the following output:
1 I wasn't able to find information on when virtual terminal sequence processing was introduced into Windows' console.

Convert C++ std::string to UTF-16-LE encoded string

I've been searching for hours today and just can't find anything that works out for me. The one I've just had a look at, with no luck, is "How to convert UTF-8 encoded std::string to UTF-16 std::string".
My question is, with a brief explanation:
I want to make a valid NTLM hash in std C++, and I'm using OpenSSL's library to create the hash using its MD4 routines. I know how to do that, so does anyone know how to convert the std::string into a UTF-16 LE encoded string which I can pass to the MD4 functions to get a correct digest?
So, can I have a std::string which holds the char type, and convert it to a UTF16-LE encoded variable length std::string_type? Whether that be std::u16string, or std::wstring?
And would I use s.c_str() or s.data() and would the length() function report correctly in both cases?
I think something like this should do the trick:
std::string utf16_to_utf8(std::u16string const& s)
{
std::wstring_convert<std::codecvt_utf8_utf16<char16_t, 0x10ffff,
std::codecvt_mode::little_endian>, char16_t> cnv;
std::string utf8 = cnv.to_bytes(s);
if(cnv.converted() < s.size())
throw std::runtime_error("incomplete conversion");
return utf8;
}
std::u16string utf8_to_utf16(std::string const& utf8)
{
std::wstring_convert<std::codecvt_utf8_utf16<char16_t, 0x10ffff,
std::codecvt_mode::little_endian>, char16_t> cnv;
std::u16string s = cnv.from_bytes(utf8);
if(cnv.converted() < utf8.size())
throw std::runtime_error("incomplete conversion");
return s;
}
Note: that std::wstring_convert is deprecated in C++17 but I still favor using it rather than a non-standard library given that it is portable, has no dependencies and will no doubt remain until replaced.
And, if all else fails, you can reimplement these same functions with alternative code without changing any other part of the application.
Apologies, firsthand... this will be an ugly reply with some long code. I ended up using the following function, while effectively compiling in iconv into my windows application file by file :)
Hope this helps.
char* conver(const char* in, size_t in_len, size_t* used_len)
{
const int CC_MUL = 2; // 16 bit
setlocale(LC_ALL, "");
char* t1 = setlocale(LC_CTYPE, "");
char* locn = (char*)calloc(strlen(t1) + 1, sizeof(char));
if(locn == NULL)
{
return 0;
}
strcpy(locn, t1);
const char* enc = strchr(locn, '.') + 1;
#if _WINDOWS
std::string win = "WINDOWS-";
win += enc;
enc = win.c_str();
#endif
iconv_t foo = iconv_open("UTF-16LE", enc);
if(foo == (void*)-1)
{
if (errno == EINVAL)
{
fprintf(stderr, "Conversion from %s is not supported\n", enc);
}
else
{
fprintf(stderr, "Initialization failure:\n");
}
free(locn);
return 0;
}
size_t out_len = CC_MUL * in_len;
size_t saved_in_len = in_len;
iconv(foo, NULL, NULL, NULL, NULL);
char* converted = (char*)calloc(out_len, sizeof(char));
char *converted_start = converted;
char* t = const_cast<char*>(in);
int ret = iconv(foo,
&t,
&in_len,
&converted,
&out_len);
iconv_close(foo);
*used_len = CC_MUL * saved_in_len - out_len;
if(ret == -1)
{
switch(errno)
{
case EILSEQ:
fprintf(stderr, "EILSEQ\n");
break;
case EINVAL:
fprintf(stderr, "EINVAL\n");
break;
}
perror("iconv");
free(locn);
return 0;
}
else
{
free(locn);
return converted_start;
}
}

Can't read file with cyrillic path in C++

I'm trying to read file, which contains Cyrillic characters in their path, and got ifstream.is_open() == false
This is my code:
std::string ReadFile(const std::string &path) {
std::string newLine, fileContent;
std::ifstream in(path.c_str(), std::ios::in);
if (!in.is_open()) {
return std::string("isn't opened");
}
while (in.good()) {
getline(in, newLine);
fileContent += newLine;
}
in.close();
return fileContent;
}
int main() {
std::string path = "C:\\test\\документ.txt";
std::string content = ReadFile(path);
std::cout << content << std::endl;
return 0;
}
Specified file exists
I'm trying to find solution in google, but I got nothing
Here is links, which I saw:
I don't need wstring
The same as previous
no answer here
is not about C++
has no answer too
P.S. I need to get file's content in string, not in wstring
THIS IS ENCODING SETTINGS OF MY IDE (CLION 2017.1)
You'll need an up-to-date compiler or Boost. std::filesystem::path can handle these names, but it's new in the C++17 standard. Your compiler may still have it as std::experimental::filesystem::path, or else you'd use the third-party boost::filesystem::path. The interfaces are pretty comparable as the Boost version served as the inspiration.
The definition for std::string is std::basic_string, so your Cyrillic chararecters are not stored as intended. Atleast, try to use std::wstring to store your file path and then you can read from file using std::string.
First of all, set your project settings to use UTF-8 encoding instead of windows-1251. Until standard library gets really good (not any time soon) you basically can not rely on it if you want to deal with io properly. To make input stream read from files on Windows you need to write your own custom input stream buffer that opens files using 2-byte wide chars or rely on some third-party implementations of such routines. Here is some incomplete (but sufficient for your example) implementation:
// assuming that usual Windows SDK macros such as _UNICODE, WIN32_LEAN_AND_MEAN are defined above
#include <Windows.h>
#include <string>
#include <iostream>
#include <system_error>
#include <memory>
#include <utility>
#include <cstdlib>
#include <cstdio>
static_assert(2 == sizeof(wchar_t), "wchar_t size must be 2 bytes");
using namespace ::std;
class MyStreamBuf final: public streambuf
{
#pragma region Fields
private: ::HANDLE const m_file_handle;
private: char m_buffer; // typically buffer should be much bigger
#pragma endregion
public: explicit
MyStreamBuf(wchar_t const * psz_file_path)
: m_file_handle(::CreateFileW(psz_file_path, FILE_GENERIC_READ, FILE_SHARE_READ, nullptr, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL))
, m_buffer{}
{
if(INVALID_HANDLE_VALUE == m_file_handle)
{
auto const error_code{::GetLastError()};
throw(system_error(static_cast< int >(error_code), system_category(), "::CreateFileW call failed"));
}
}
public:
~MyStreamBuf(void)
{
auto const closed{::CloseHandle(m_file_handle)};
if(FALSE == closed)
{
auto const error_code{::GetLastError()};
//throw(::std::system_error(static_cast< int >(error_code), system_category(), "::CloseHandle call failed"));
// throwing in destructor is kinda wrong
// but if CloseHandle returned false then our program is in inconsistent state
// and must be terminated anyway
(void) error_code; // not used
abort();
}
}
private: auto
underflow(void) -> int_type override
{
::DWORD bytes_count_to_read{1};
::DWORD read_bytes_count{};
{
auto const succeeded{::ReadFile(m_file_handle, addressof(m_buffer), bytes_count_to_read, addressof(read_bytes_count), nullptr)};
if(FALSE == succeeded)
{
auto const error_code{::GetLastError()};
setg(nullptr, nullptr, nullptr);
throw(system_error(static_cast< int >(error_code), system_category(), "::ReadFile call failed"));
}
}
if(0 == read_bytes_count)
{
setg(nullptr, nullptr, nullptr);
return(EOF);
}
setg(addressof(m_buffer), addressof(m_buffer), addressof(m_buffer) + 1);
return(m_buffer);
}
};
string
MyReadFile(wchar_t const * psz_file_path)
{
istream in(new MyStreamBuf(psz_file_path)); // note that we create normal stream
string new_line;
string file_content;
while(in.good())
{
getline(in, new_line);
file_content += new_line;
}
return(::std::move(file_content));
}
int
main(void)
{
string content = MyReadFile(L"C:\\test\\документ.txt"); // note that path is a wide string
cout << content << endl;
return 0;
}
Change your code to use wstring and save your file using Unicode encoding (non UTF8 one, use USC-2, UTF16 or something like that). MSVC has non-standard overload specifically for this reason to be able to handle non-ascii chars in filenames:
std::string ReadFile(const std::wstring &path)
{
std::string newLine, fileContent;
std::ifstream in(path.c_str(), std::ios::in);
if (!in)
return std::string("isn't opened");
while (getline(in, newLine))
fileContent += newLine;
return fileContent;
}
int main()
{
std::wstring path = L"C:\\test\\документ.txt";
std::string content = ReadFile(path);
std::cout << content << std::endl;
}
Also, note corrected ReadFile code.

C++ iostream UTF-16 file I/O with CR LF translation

I want to read and write utf-16 files which use CR LF line separators (L"\r\n"). Using C++ (Microsoft Visual Studio 2010) iostreams. I want every L"\n" written to the stream to be translated to L"\r\n" transparently. Using the codecvt_utf16 locale facet requires to open the fstream in ios::binary mode, losing the usual text mode \n to \r\n translation.
std::wofstream wofs;
wofs.open("try_utf16.txt", std::ios::binary);
wofs.imbue(
std::locale(
wofs.getloc(),
new std::codecvt_utf16<wchar_t, 0x10ffff, std::generate_header>));
wofs << L"Hi!\n"; // i want a '\r' to be inserted before the '\n' in the output file
wofs.close();++
I want a solution without needing extra libraries like BOOST.
I think I've found a solution myself, I want to share it. Your comments are welcome!
#include <iostream>
#include <fstream>
class wcrlf_filebuf : public std::basic_filebuf<wchar_t>
{
typedef std::basic_filebuf<wchar_t> BASE;
wchar_t awch[128];
bool bBomWritten;
public:
wcrlf_filebuf()
: bBomWritten(false)
{ memset(awch, 0, sizeof awch); }
wcrlf_filebuf(const wchar_t *wszFilespec,
std::ios_base::open_mode _Mode = std::ios_base::out)
: bBomWritten(false)
{
memset(awch, 0, sizeof awch);
BASE::open(wszFilespec, _Mode | std::ios_base::binary);
pubsetbuf(awch, _countof(awch));
}
wcrlf_filebuf *open(const wchar_t *wszFilespec,
std::ios_base::open_mode _Mode = std::ios_base::out)
{
BASE::open(wszFilespec, _Mode | std::ios_base::binary);
pubsetbuf(awch, _countof(awch));
return this;
}
virtual int_type overflow(int_type ch = traits_type::eof())
{
if (!bBomWritten) {
bBomWritten = true;
int_type iRet = BASE::overflow(0xfeff);
if (iRet != traits_type::not_eof(0xfeff)) return iRet;
}
if (ch == '\n') {
int_type iRet = BASE::overflow('\r');
if (iRet != traits_type::not_eof('\r')) return iRet;
}
return BASE::overflow(ch);
}
};
class wcrlfofstream : public std::wostream
{
typedef std::wostream BASE;
public:
wcrlfofstream(const wchar_t *wszFilespec,
std::ios_base::open_mode _Mode = std::ios_base::out)
: std::wostream(new wcrlf_filebuf(wszFilespec, _Mode))
{}
wcrlf_filebuf* rdbuf()
{
return dynamic_cast<wcrlf_filebuf*>(std::wostream::rdbuf());
}
void close()
{
rdbuf()->close();
}
};