Can't write chinese character into textfile with wofstream - c++

I'm using std::wofstream to write characters in a text file.My characters can have chars from very different languages(english to chinese).
I want to print my vector<wstring> into that file.
If my vector contains only english characters I can print them without a problem.
But if I write chineses characters my file remains empty.
I browsed trough stackoverflow and all answers said bascially to use functions from the library:
#include <codecvt>
I can't include that library, because I am using Dev-C++ in version 5.11.
I did:#define UNICODE in all my header files.
I guess there is a really simple solution for that problem.
It would be great, if someone could help me out.
My code:
#define UNICODE
#include <string>
#include <fstream>
using namespace std;
int main()
{
string Path = "D:\\Users\\\t\\Desktop\\korrigiert_RotCommon_zh_check_error.log";
wofstream Out;
wstring eng = L"hello";
wstring chi = L"程序";
Out.open(Path, ios::out);
//works.
Out << eng;
//fails
Out << chi;
Out.close();
return 0;
}
Kind Regards

Even if the name of the wofstream implies it's a wide char stream, it's not. It's still a char stream that uses a convert facet from a locale to convert the wchars to char.
Here is what cppreference says:
All file I/O operations performed through std::basic_fstream<CharT> use the std::codecvt<CharT, char, std::mbstate_t> facet of the locale imbued in the stream.
So you could either set the global locale to one that supports Chinese or imbue the stream. In both cases you'll get a single byte stream.
#include <locale>
//...
const std::locale loc = std::locale(std::locale(), new std::codecvt_utf8<wchar_t>);
Out.open(Path, ios::out);
Out.imbue(loc);
Unfortunately std::codecvt_utf8 is already deprecated[2]. This MSDN
magazine
article explains how to do UTF-8 conversion using MultiByteToWideChar C++ - Unicode Encoding Conversions with STL Strings and Win32 APIs.
Here the Microsoft/vcpkg variant of an to_utf8 conversion:
std::string to_utf8(const CWStringView w)
{
const size_t size = WideCharToMultiByte(CP_UTF8, 0, w.c_str(), -1, nullptr, 0, nullptr, nullptr);
std::string output;
output.resize(size - 1);
WideCharToMultiByte(CP_UTF8, 0, w.c_str(), -1, output.data(), size - 1, nullptr, nullptr);
return output;
}
On the other side you can use normal binary stream and write the wstring data with write().
std::ofstream Out(Path, ios::out | ios::binary);
const uint16_t bom = 0xFEFF;
Out.write(reinterpret_cast<const char*>(&bom), sizeof(bom)); // optional Byte order mark
Out.write(reinterpret_cast<const char*>(chi.data()), chi.size() * sizeof(wchar_t));

You forgot to tell your stream what locale to use:
Out.imbue(std::locale("zh_CN.UTF-8"));
You'll obviously need to include <locale> for this.

Related

Is it poosible to open the same file using wfstream and fstream

Actually i have a requirement wherein i need to open the same file using wfstream file instance at one part of the code and open it using fstream instance at the other part of the code. I need to access a file where the username is of type std::wstring and password is of type std::string. how do i get the values of both the variables in the same part of the code?
Like you can see below i need to get the values for username and password from the file and assign it to variables.
type conversion cannot be done. Please do not give that solution.
......file.txt.......
username-amritha
password-rajeevan
the code is written as follows:
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
int main()
{
std::string y;
unsigned int l;
std::wstring username;
std::wstring x=L"username";
std::wstring q;
std::string password;
std::string a="password";
std::cout<<"enter the username:";
std::wcin>>username;
std::cout<<"enter the password:";
std::cin>>password;
std::wfstream fpp("/home/aricent/Documents/testing.txt",std::ios::in | std::ios::out );
std::getline(fpp,q);
if (q.find(x, 0) != std::string::npos) {
std::wstring z=q.substr(q.find(L"-") + 1) ;
std::wcout<<"the username is:"<<z;
fpp.seekg( 0, std::ios::beg );
fpp<<q.replace(x.length()+1, z.length(), username);
}
fpp.close();
std::fstream fp("/home/aricent/Documents/testing.txt",std::ios::in | std::ios::out );
std::getline(fp,y);
if (y.find(a, 0) != std::string::npos)
{
unsigned int len=x.length()+1;
unsigned int leng=username.length();
l=len+leng;
fp.seekg(l+1);
std::string b=y.substr(y.find("-") + 1) ;
fp<<y.replace(a.length()+1, b.length(), password);
}
fp.close();
}
It's not recommended to open multiple streams to a same file simultaneously. On the other hand, if you don't write to the file, but only read (and thus, would be using ifstream and wifstream), that's probably safe.
Alternatively, you can simply open a wfstream, read the username, close the stream, open a fstream and read the password.
If you have the choice, avoid mixed encoding files entirely.
You should not try to open the same file with two descriptors. Even if it worked (read only mode for example), both descriptors would not be synchronised, so you would read first characters on one, and next same characters on second.
So IMHO, you should stick to one single solution. My advice is to use a character stream to process the file, and use a codecvt to convert from a narrow string to a wide wstring when you need it.
An example conversion function could be (ref: cplusplus.com: codecvt::in):
std::wstring wconv(const std::string& str, const std::locale mylocale) {
// define a codecvt facet for the locale
typedef std::codecvt<wchar_t,char,std::mbstate_t> facet_type;
const facet_type& myfacet = std::use_facet<facet_type>(mylocale);
// define a mbstate to use in codecvt::in
std::mbstate_t mystate = std::mbstate_t();
size_t l = str.length();
const char * ix = str.data(), *next; // narrow character pointers
wchar_t *wc, *wnext; // wide character pointers
// use a wide char array of same length than the narrow char array to convert
wc = new wchar_t[str.length() + 1];
// conversion call
facet_type::result result = myfacet.in(mystate, ix, ix + l,
next, wc, wc + l, wnext);
// should test for error conditions
*wnext = 0; // ensure the wide char array is properly null terminated
std::wstring wstr(wc); // store it in a wstring
delete[] wc; // destroy the char array
return wstr;
}
This code should test for abnormal conditions, and use try catch to be immune to exceptions but it is left as exercise for the reader :-)
A variant of the above using codecvt::out could be used to convert from wide string to narrow string.
In above code, I would use (assuming nconv is the function using codecvt::out to convert from wide string to narrow string):
...
#include <locale>
...
std::cin>>password;
std::locale mylocale;
std::fstream fp("/home/aricent/Documents/testing.txt",std::ios::in | std::ios::out );
std::getline(fp,y);
q = wconv(y, mylocale);
...
fp<<nconv(q.replace(x.length()+1, z.length(), username));
}
std::getline(fp, y);
...

Store non-English string in std::string

I have a simple string in std::wstring
std::wstring tempStr = _T("F:\\Projects\\Current_자동_\\Cam.xml");
I want to store this string in a std::string.
I have tried the below code but the result is not the same as input string
std::wstring tempStr = _T("F:\\Projects\\Current_자동_\\Cam.xml");
//setup converter
typedef std::codecvt_utf8_utf16 <wchar_t> convert_type;
std::wstring_convert<convert_type, wchar_t> converter;
//use converter (.to_bytes: wstr->str, .from_bytes: str->wstr)
std::string converted_str = converter.to_bytes( tempStr );
The Korean string present in the input string is converted to "ìžë™".
Is there any way I can get the same string in std::string?
Expected result:
converted_str should contain F:\Projects\Current_자동_\Cam.xml
Below is an screenshot of debugging showing 3 values in 3 scenarios (conversion in 3 ways). But none of them gives the desired value.
Your conversion code is fine.
In fact, in UTF-8 (the string you store in std::string), the characters 자동 corresponds to:
자 (UTF-16 0xC790) ---> UTF-8: EC 9E 90
동 (UTF-16 0xB3D9) ---> UTF-8: EB 8F 99
If you run the following program, which just prints the converted UTF-8 bytes, you get this output:
ec 9e 90 eb 8f 99
#include <iomanip> // For std::hex
#include <iostream> // For console output
#include <string> // For STL strings
#include <codecvt> // For Unicode conversions
void print_char_hex(const char ch)
{
auto * p = reinterpret_cast<const unsigned char*>(&ch);
int i = *p;
std::cout << std::hex << i << ' ';
}
int main()
{
std::wstring utf16_str = L"\xC790\xB3D9";
// setup converter
typedef std::codecvt_utf8_utf16<wchar_t> convert_type;
std::wstring_convert<convert_type, wchar_t> converter;
// use converter (.to_bytes: wstr->str, .from_bytes: str->wstr)
std::string converted_str = converter.to_bytes( utf16_str );
// Output the converted bytes (UTF-8)
for (size_t i = 0; i < converted_str.length(); ++i)
{
print_char_hex(converted_str[i]);
}
std::cout << std::endl;
}
I think the best solution would be to use the wide-char APIs to open the file, e.g. CreateFileW(...);, because then you can use the wide-char file name directly.
If this is not possible, maybe the string should not be converted to UTF8, but to the system default ANSI code page.
I think this might work:
char out[200];
wchar_t * in = L"F:\\Projects\\Current_자동_\\Cam.xml";
WideCharToMultiByte(CP_ACP, 0, in, 100, out, 100, 0, 0);
or maybe another Korean code page:
WideCharToMultiByte(949, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(1361, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(10003, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(20833, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(20949, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(50225, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(50933, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(51949, 0, in, 100, out, 100, 0, 0);
The code page ids can be found here:
http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx
good luck :-)
This works.. You can tell because the conversion back to UTF16 is valid.. If you write the UTF8 string to a file, it will also display properly. This way, you now have two ways of validating that it works.
// UTF16ToUTF8.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <windows.h>
#include <iostream>
#include <codecvt>
std::wstring ToUTF16(const std::string &data)
{
return std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(data);
}
std::string ToUTF8(const std::wstring &data)
{
return std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(data);
}
int _tmain(int argc, _TCHAR* argv[])
{
std::wstring u16 = L"_자동_";
std::string u8 = ToUTF8(u16);
MessageBoxW(NULL, ToUTF16(u8).c_str(), L"", 0);
std::cin.get();
return 0;
}
You can store UTF-8 in std:string as regular char sequence. Here's library with some useful things, such as length() and everything about indexing, you may want to have http://utfcpp.sourceforge.net/.
For windows console you need to set codepage to 65001 and will become UTF-8.
Sadly or not, std::wstring and whole wchar_t thing doesn't specify any specific encoding.
By the way, you're using Managed C++, why wouldn't you use .NET Framework's System::String^? There's no problems with encodings at all. http://msdn.microsoft.com/ru-ru/library/system.string(v=vs.110).aspx?cs-save-lang=1&cs-lang=cpp
The problem is not in your string conversion code. This is a typical source file encoding problem. Visual studio does not use Unicode as default so you should convert your source file's encoding to UTF-8 yourself. To make this conversion you can open your file with notepad++ and click Encoding->Convert to UTF-8
Note1: In VS2010 and vs2012 if you write non-ascii characters to a source file visual studio now warns you and offers to make this conversion.
Note2: From your use of macro _T() I predict this is targeted only to Windows. If you try to build UTF-8 encoded source files that contains BOM with gcc you may get different errors. In any case the best approach would be to read your UTF-8 encoded text data from a file during run-time.

writing a string to file as a sequence of bytes

I want to write a wide string to a file as a sequence of bytes. I tried two ways, the first way:
std::wstring str = L"This is a test";
LPBYTE pBuf = (LPBYTE)str.c_str();
FILE* hFile = _wfopen( L"c:\\temp.txt", L"w" );
for( int i = 0; i<(str.length()*sizeof(wchar_t)); ++i)
fwprintf( hFile, L"%02X", pBuf[i] );
fclose(hFile);
The second way:
std::wstring str = L"This is a test";
LPBYTE pBuf = (LPBYTE)str.c_str();
HANDLE hFile = CreateFile( L"c:\\temp.txt", GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL );
DWORD dwRet;
WriteFile( hFile, pBuf, str.length()*sizeof(wchar_t), &dwRet, NULL );
CloseHandle(hFile);
When I open the result file, in the first case the contents of the file are:
54006800690073002000690073002000610020007400650073007400
In the second case, the contents of the file are:
This is a test
Why the first way doesn't work as expected? it looks like both ways are equal.
In the first example, you used fwprintf to format the bytes as 2-digit hex strings so that is why you see hex in that file.
I suspect you should spend some time researching the ASCII code and UTF-16LE and looking at text using a hex editor.
Every file is just a sequence of bytes so your question is not well defined and makes me think you have some fundamental misunderstanding about bytes and encodings but I'm not sure what it is.
Assuming you want to write out the in-memory representation of the string:
#include <fstream>
int main (int argc,char *argv[]) {
std::wstring str = L"This is a test";
std::ofstream fout(R"(c:\temp.txt)");
fout.exceptions(std::ios::badbit | std::ios::failbit);
fout.write(reinterpret_cast<const char*>(str.data()), sizeof(wchar_t) * str.size());
}
We use ofstream because this is C++ and it's better to use RAII types instead of having to manually call fclose or CloseHandle. We use a raw string for the filename so we don't have to deal with escaping the backslash. (On platforms that use a sensible path separator ; ) the raw string here is unnecessary.) We also turn on exceptions so that we don't have to explicitly check for errors.
Then we write out the bytes using the write member function. Note that the codecvt facet is still applied to the data written using this method. This is the reason we're using ofstream instead of wofstream; The default facet for ofstream does nothing, but the default facet for wofstream would convert the wchar_t to char using the default locale.
If you simply want to write UTF-16 data out then there are better ways than trying to write the raw bytes of a wchar_t string. (wchar_t isn't necessarily UTF-16. Some platforms just happen to use UTF-16.)
One way is to use a the codecvt_utf16 facet:
#include <fstream>
#include <codecvt>
int main(int argc, char *argv[]) {
std::wstring str = L"This is a test";
std::wofstream fout(R"(C:\temp.txt)");
fout.exceptions(std::ios::badbit | std::ios::failbit);
fout.imbue(std::locale(std::locale("C"), new std::codecvt_utf16<wchar_t>));
fout << str;
}
Here we write a wchar_t string normally, but we've imbued the wstream with codecvt_utf16, so that the the wchar_t is converted to UTF-16. If you want little endian UTF-16, or you want to include U+FEFF at the beginning of the file (these are frequently done on Windows) then there are flags to enable that: std::codecvt_utf16<wchar_t, 0x10FFFF, std::codecvt_mode::generate_header | std::codecvt_mode::little_endian>. (also note that codecvt_utf16 will treat wchar_t as UCS-2 or UCS-4, never UTF-16. The upshot is that this only handles the BMP on Windows)
Another option is to use normal streams and the wstring_convert facility:
#include <fstream>
#include <codecvt>
int main(int argc, char *argv[]) {
std::wstring str = L"This is a test";
std::ofstream fout(R"(C:\temp.txt)");
fout.exceptions(std::ios::badbit | std::ios::failbit);
std::wstring_convert<std::codecvt_utf16<wchar_t>, wchar_t> convert;
fout << convert.to_bytes(str);
}
This is probably the option I would choose, since it allows one to almost completely avoid wchar_t.

Buffer size for reading a UTF-8-encoded file using ICU (ICU4C)

I am trying to read a UTF-8-encoded file using ICU4C on Windows with msvc11. I need to determine the size of the buffer to build a UnicodeString. Since there is no fseek-like function in the ICU4C API I thought I could use an underlying C-file:
#include <unicode/ustdio.h>
#include <stdio.h>
/*...*/
UFILE *in = u_fopen("utfICUfseek.txt", "r", NULL, "UTF-8");
FILE* inFile = u_fgetfile(in);
fseek(inFile, 0, SEEK_END); /* Access violation here */
int size = ftell(inFile);
auto uChArr = new UChar[size];
There are two problems with this code:
It "throws" access violation at the fseek() line for some reason (Unhandled exception at 0x000007FC5451AB00 (ntdll.dll) in test.exe: 0xC0000005: Access violation writing location 0x0000000000000024.)
The size returned by the ftell function will not be the size I want because UTF-8 can use up to 4 bytes for a code point (a u8"tю" string will be of length 3).
So the questions are:
How do I determine a buffer size for a UnicodeString if I know that the input file is UTF-8-encoded?
Is there a portable way to use iostream/fstream for both reading and writing ICU's UnicodeStrings?
Edit:
Here is the possible solution (tested on msvc11 and gcc 4.8.1) based on the first answer and C++11 Standard. A few things from ISO IEC 14882 2011:
"The fundamental storage unit in the C++ memory model is the byte. A
byte is at least large enough to contain any member of the basic
execution character set (2.3) and the eight-bit code units of the
Unicode UTF-8 encoding form..."
"The basic source character set consists of 96 characters...", - 7 bits needed already
"The basic execution character set and the basic execution
wide-character set shall each contain all the members of the basic
source character set..."
"Objects declared as characters (char) shall be large enough to
store any member of the implementation’s basic character set."
So, to make this portable for platforms where the implementation defined size of char is 1 byte = 8 bits (don't know where this isn't true) we can read Unicode characters into chars using unformatted input operation:
std::ifstream is;
is.open("utfICUfSeek.txt");
is.seekg(0, is.end);
int strSize = is.tellg();
auto inputCStr = new char[strSize + 1];
inputCStr[strSize] = '\0'; //add null-character at the end
is.seekg(0, is.beg);
is.read(inputCStr, strSize);
is.seekg(0, is.beg);
UnicodeString uStr = UnicodeString::fromUTF8(inputCStr);
is.close();
What troubles me is that I have to create an additional buffer for chars and only then convert them to the required UnicodeString.
This is an alternative to using ICU.
Using the standard std::fstream you can read the whole/ part of the file into a standard std::string then iterate over that with a unicode aware iterator. http://code.google.com/p/utf-iter/
std::string get_file_contents(const char *filename)
{
std::ifstream in(filename, std::ios::in | std::ios::binary);
if (in)
{
std::string contents;
in.seekg(0, std::ios::end);
contents.reserve(in.tellg());
in.seekg(0, std::ios::beg);
contents.assign((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
in.close();
return(contents);
}
throw(errno);
}
Then in your code
std::string myString = get_file_contents( "foobar" );
unicode::iterator< std::string, unicode::utf8 /* or utf16/32 */ > iter = myString.begin();
while ( iter != myString.end() )
{
...
++iter;
}
Well, either you want to read in the whole file at once for some kind of postprocessing, in which case icu::UnicodeString is not really the best container...
#include <iostream>
#include <fstream>
#include <sstream>
int main()
{
std::ifstream in( "utfICUfSeek.txt" );
std::stringstream buffer;
buffer << in.rdbuf();
in.close();
// ...
return 0;
}
...or what you really want is to read into icu::UnicodeString just like into any other string object but went the long way around...
#include <iostream>
#include <fstream>
#include <unicode/unistr.h>
#include <unicode/ustream.h>
int main()
{
std::ifstream in( "utfICUfSeek.txt" );
icu::UnicodeString uStr;
in >> uStr;
// ...
in.close();
return 0;
}
...or I am completely missing what your problem really is about. ;)

How to read a UCS-2 file?

I'm writing a program to get the infomation in *.rc file encoding in UCS-2 Little Endian.
int _tmain(int argc, _TCHAR* argv[]) {
wstring csvLine(wstring sLine);
wifstream fin("en.rc");
wofstream fout("table.csv");
wofstream fout_rm("temp.txt");
wstring sLine;
fout << "en\n";
while(getline(fin,sLine)) {
if (sLine.find(L"IDS") == -1)
fout_rm << sLine << endl;
else
fout << csvLine(sLine);
}
fout << flush;
system("pause");
return 0;
}
The first line in "en.rc" is #include <windows.h> but sLine shows as below:
[0] 255 L'ÿ'
[1] 254 L'þ'
[2] 35 L'#'
[3] 0
[4] 105 L'i'
[5] 0
[6] 110 L'n'
[7] 0
[8] 99 L'c'
. .
. .
. .
This program can work out correctly for UTF-8. How can I do it to UCS-2?
Wide streams use a wide stream buffer to access the file. The Wide stream buffer reads bytes from the file and uses its codecvt facet to convert these bytes to wide characters. The default codecvt facet is std::codecvt<wchar_t, char ,std::mbstate_t> which converts between the native character sets for wchar_t and char (i.e., like mbstowcs() does).
You're not using the native char character set, so what you want is a codecvt facet that reads UCS-2 as a multibyte sequence and converts it to wide characters.
#include <fstream>
#include <string>
#include <codecvt>
#include <iostream>
int main(int argc, char *argv[])
{
wifstream fin("en.rc", std::ios::binary); // You need to open the file in binary mode
// Imbue the file stream with a codecvt facet that uses UTF-16 as the external multibyte encoding
fin.imbue(std::locale(fin.getloc(),
new std::codecvt_utf16<wchar_t, 0xffff, consume_header>));
// ^ We set 0xFFFF as the maxcode because that's the largest that will fit in a single wchar_t
// We use consume_header to detect and use the UTF-16 'BOM'
// The following is not really the correct way to write Unicode output, but it's easy
std::wstring sLine;
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert;
while (getline(fin, sLine))
{
std::cout << convert.to_bytes(sLine) << '\n';
}
}
Note that there's an issue with UTF-16 here. The purpose of wchar_t is for one wchar_t to represent one codepoint. However Windows uses UTF-16 which represents some codepoints as two wchar_ts. This means that the standard API doesn't work very well with Windows.
The consequence here is that when the file contains a surrogate pair, codecvt_utf16 will read that pair, convert it to a single codepoint value greater than 16 bits and have to truncate the value to 16 bits to stick it in a wchar_t. This means this code really is limited to UCS-2. I've set the maxcode template parameter to 0xFFFF to reflect this.
There are a number of other problems with wchar_t, and you might want to just avoid it entirely: What's “wrong” with C++ wchar_t?
#include <filesystem>
namespace fs = std::filesystem;
FILE* f = _wfopen(L"myfile.txt", L"rb");
auto file_size = fs::file_size(filename);
std::wstring buf;
buf.resize((size_t)file_size / sizeof(decltype(buf)::value_type));// buf in my code is a template object, so I use decltype(buf) to decide its type.
fread(&buf[0], 1, 2, f); // escape UCS2 BOM
fread(&buf[0], 1, file_size, f);