std::fstream different behavior on msvc and g++ with utf-8 - c++

std::string path("path.txt");
std::fstream f(path);
f.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
std::string lcpath;
f >> lcpath;
Reading a utf-8 text from path.txt on windows fails with MSVC compiler on windows in the sense lcpath does not understand the path as utf-8.
The below code works correctly on linux when compiled with g++.
std::string path("path.txt");
std::fstream ff;
ff.open(path.c_str());
std::string lcpath;
ff>>lcpath;
Is fstream on windows(MSVC) by default assume ascii only?
In the first snippet if I change string with wstring and fstream with wfstream, lcpath gets correct value on windows as well.
EDIT: If I convert the read lcpath using MultiByteToWideChar(), I get the correct representation. But why can't I directly read a UTF-8 string into std::string on windows?

Imbuing an opened file can be problamatic:
http://www.cplusplus.com/reference/fstream/filebuf/imbue/
If loc is not the same locale as currently used by the file stream buffer, either the internal position pointer points to the beginning of the file, or its encoding is not state-dependent. Otherwise, it causes undefined behavior.
The problem here is that when a file is opened and the file has a BOM marker in it this will usually be read from the file by the currently installed local. Thus the position pointer is no longer at the beginning of the file and we have undefined behavior.
To make sure your local is set correctly you must do it before opening the file.
std::fstream f;
f.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
std::string path("path.txt");
f.open(path);
std::string lcpath;
f >> lcpath;

Related

File Path Name and User Input

I am trying to have the user input the file path of a file to be read by my program. However, when I try to compile the code, it errors, with the following error: no matching function to call to 'std::basic_ifstream::open(std::string&)'. The code works correctly with no errors when I directly enter the file instead of using getline or cin. I don't know what is the problem. Any suggestions?
int main()
{
ifstream input_file;
string file_name;
cout<< "Please input file path to PostFix arithmetic expressions file\n";
getline(cin, file_name);
input_file.open(file_name);
read_and_evaluate(input_file);
}
You need to compile as C++11 in order to get an ifstream constructor or open member function that takes a std::string argument. With g++ use the -std=c++11 option.
In C++03 iostream constructors only supported C strings, which you can get via std::string::c_str().
Do note that regardless of C string or std::string, in Windows this will fail to open a file with non-ANSI characters in the path, unless you first shorten the path to DOS 8.3 path items (which are pure ASCII).

R6025 and the std::locale

I have been struggling with some very simple code to write std::wstring to a file. From some research on stackoverflow it was reccomended that I should set the locale for the file before reading and writting.
typedef std::codecvt_utf8<wchar_t> ConverterType;
ConverterType converter;
// open a file in read mode.
std::wifstream readFile;
// pass in the current locale of the file and the converter
std::locale wloc(readFile.getloc(), &converter); //"en_US.UTF-8");
// imbue the file with the locale
readFile.imbue(wloc);
readFile.open(m_absoluteFilePath, std::ios::in | std::ios::binary);
if (!readFile.is_open())return false;
// read the data from the file
readFile >> readString;
// close the opened file.
readFile.close();
The above causes a R6025 error pure virtual function call. I found an online answer that suggested the following.
(http://forums.codeguru.com/showthread.php?457106-Unicode-text-file)
copy and paste from the page-
"I've got the message "Runtim Error! Program: ... R6025 - pure virtual function call
The reason is that the stream's destructor accesses the facet again which has already been destructed.
You can fix the code by shifting the creation of the facet before the creation of the stream."
...
null_wcodecvt wcodec(1);
std::locale wloc(std::locale::classic(), &wcodec);
std::wfstream file;
file.imbue(wloc);
I think this is exactly my problem, as when I remove the loc code entirely and just read and write to and from the file there is no error. The problem is I can't figure out how to move the wloc declaration prior to the wifstream declaration, since wloc is built using the readFile.getloc(). This seems to be a bit of a chicken and the egg situation to me?
What is the correct way to do this? It also seems strange to me that this dependancy of order does not seem to be well documented?
(On an additional note, I am reading and writting json strings to the file. My understanding is to use the std::ios::binary to simplify all the escape characters in the json string? perhaps this is untrue and a poor choice as I have not been able to test it, but I wanted to explain my choice of using std::ios::binary above.)

Why getline is reading my entire unicode file

I have seen many threads but none of solution given is working for me, so if anyone can throw some light that would be great
I am reading unicode file and using getline i try to scan line by line but then it scans the entire file, since the objects are wstring it does not allow me to place delimeter in getline. and asks only wchar_t in which i cant fit the delimeter. (\0 does not work as i am reading in binary mode) so below is code snippet
Platform: Windows , Visual Studio 2010
Unicode encoding: UTF 16
wifstream fin("profiles1.prd", ios_base::binary); //open a file
wofstream fout("DXout.txt",ios_base::binary); // this dumps the parsing ouput
fin.imbue(std::locale(fin.getloc(),new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
fout.imbue(std::locale(fin.getloc(),new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
wstring stream;
getline(fin,stream);
I am hopeful this is what you're looking for:
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff,
std::codecvt_mode(std::little_endian|std::consume_header)>);
Windows is little-endian, and as such to both skip the BOM and imbue utf16, you need to punch it in the gut by inventing a new conversion mode.
Hope it helps you out. I leave the writing side to you.

c++ fstreams open file with utf-16 name

At first I built my project on Linux and it was built around streams.
When I started moving to Windows I ran into some problems.
I have a name of the file that I want to open in UTF-16 encoding.
I try to do it using fstream:
QString source; // content of source is shown on image
char *op= (char *) source.data();
fstream stream(op, std::ios::in | std::ios::binary);
But file cannot be opened.
When I check it,
if(!stream.is_open())
{} // I always get that it's not opened. But file indeed exists.
I tried to do it with wstream. But result is the same, because wstream accepts only char * too. As I understand it's so , because string , that is sent as char * , is truncated after the first zero and only one symbol of the file's name is sent, so file is never found. I know wfstream in Vissual studio can accept wchar_t * line as name, but compiler of my choice is MinGW and it doesn't have such signature for wstring constructor.
Is there any way to do it with STL streams?
ADDITION
That string can contaion not only Ascii symbols, it can contain Russian, German, Chinese symbols simultaneously. I don't want limit myself only to ASCII or local encoding.
NEXT ADDITION
Also data can be different, not only ASCII, otherwise I wouldn't bother myself with Unicode at all.
E.g.
Thanks in advance!
Boost::Filesystem especially the fstream.hpp header may help.
If you are using MSVC and it's implementation of the c++ standard library, something like this should work:
QString source; // content of source is shown on image
wchar_t *op= source.data();
fstream stream(op, std::ios::in | std::ios::binary);
This works because the Microsoft c++ implementation has an extension to allow fstream to be opened with a wide character string.
Convert the UTF-16 string using WideCharToMultiByte with CP_ACP before passing the filename to fstream.

Typographic apostrophe + wide string literal broke my wofstream (C++)

I’ve just encountered some strange behaviour when dealing with the ominous typographic apostrophe ( ’ ) – not the typewriter apostrophe ( ' ). Used with wide string literal, the apostrophe breaks wofstream.
This code works
ofstream file("test.txt");
file << "A’B" ;
file.close();
==> A’B
This code works
wofstream file("test.txt");
file << "A’B" ;
file.close();
==> A’B
This code fails
wofstream file("test.txt");
file << L"A’B" ;
file.close();
==> A
This code fails...
wstring test = L"A’B";
wofstream file("test.txt");
file << test ;
file.close();
==> A
Any idea ?
You should "enable" locale before using wofstream:
std::locale::global(std::locale()); // Enable locale support
wofstream file("test.txt");
file << L"A’B";
So if you have system locale en_US.UTF-8 then the file test.txt will include
utf8 encoded data (4 byes), if you have system locale en_US.ISO8859-1, then it would encode it as 8 bit encoding (3 bytes), unless ISO 8859-1 misses such character.
wofstream file("test.txt");
file << "A’B" ;
file.close();
This code works because "A’B" is actually utf-8 string and you save utf-8
string to file byte by byte.
Note: I assume you are using POSIX like OS, and you have default locale different from "C" that is the default locale.
Are you sure it's not your compiler's support for unicode characters in source files that is "broken"? What if you use \x or similar to encode the character in the string literal? Is your source file even in whatever encoding might might to a wchar_t for your compiler?
Try wrapping the stream insertion character in a try-catch block and tell us what, if any, exception it throws.
I am not sure what is going on here, but I'll harass a guess anyway. The typographic apostrophe probably has a value that fits into one byte. This works with "A’B" since it blindly copies bytes without bothering about the underlying encoding. However, with L"A’B", an implementation dependent encoding factor comes into play. It probably doesn't find the proper UTF-16 (if you are on Windows) or UTF-32 (if you are on *nix/Mac) value to store for this particular character.