Typographic apostrophe + wide string literal broke my wofstream (C++) - c++

I’ve just encountered some strange behaviour when dealing with the ominous typographic apostrophe ( ’ ) – not the typewriter apostrophe ( ' ). Used with wide string literal, the apostrophe breaks wofstream.
This code works
ofstream file("test.txt");
file << "A’B" ;
file.close();
==> A’B
This code works
wofstream file("test.txt");
file << "A’B" ;
file.close();
==> A’B
This code fails
wofstream file("test.txt");
file << L"A’B" ;
file.close();
==> A
This code fails...
wstring test = L"A’B";
wofstream file("test.txt");
file << test ;
file.close();
==> A
Any idea ?

You should "enable" locale before using wofstream:
std::locale::global(std::locale()); // Enable locale support
wofstream file("test.txt");
file << L"A’B";
So if you have system locale en_US.UTF-8 then the file test.txt will include
utf8 encoded data (4 byes), if you have system locale en_US.ISO8859-1, then it would encode it as 8 bit encoding (3 bytes), unless ISO 8859-1 misses such character.
wofstream file("test.txt");
file << "A’B" ;
file.close();
This code works because "A’B" is actually utf-8 string and you save utf-8
string to file byte by byte.
Note: I assume you are using POSIX like OS, and you have default locale different from "C" that is the default locale.

Are you sure it's not your compiler's support for unicode characters in source files that is "broken"? What if you use \x or similar to encode the character in the string literal? Is your source file even in whatever encoding might might to a wchar_t for your compiler?

Try wrapping the stream insertion character in a try-catch block and tell us what, if any, exception it throws.
I am not sure what is going on here, but I'll harass a guess anyway. The typographic apostrophe probably has a value that fits into one byte. This works with "A’B" since it blindly copies bytes without bothering about the underlying encoding. However, with L"A’B", an implementation dependent encoding factor comes into play. It probably doesn't find the proper UTF-16 (if you are on Windows) or UTF-32 (if you are on *nix/Mac) value to store for this particular character.

Related

Qt UTF-8 File to std::string Adds extra characters

I have a UTF-8 encoded text file, which has characters such as ²,³,Ç and ó. When I read the file using the below, the file appears to be read appropriately (at least according to what I can see in Visual Studio's editor when viewing the contents of the contents variable)
QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
return;
}
QString contents;
QTextStream stream( &file );
contents.append( stream.readAll() );
file.close();
However, as soon as the contents get converted to an std::string the additional characters are added. For example, the ² gets converted to ², when it should just be ². This appears to happen for every non-ANSI character, the extra  is added, which, of course, means that when the a new file is saved, the characters are not correct in the output file.
I have, of course, tried simply doing toStdString(), I've also tried toUtf8 and have even tried using the QTextCodec but each fails to give the proper values.
I do not understand why going from UTF-8 file, to QString, then to std::string loses the UTF-8 characters. It should be able to reproduce the exact file that was originally read, or am I completely missing something?
As Daniel Kamil Kozar mentioned in his answer, the QTextStream does not read in the encoding, and, therefore, does not actually read the file correctly. The QTextStream must set its Codec prior to reading the file in order to properly parse the characters. Added a comment to the code below to show the extra file needed.
QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
return;
}
QString contents;
QTextStream stream( &file );
stream.setCodec( QTextCodec::codecForName( "UTF-8" ) ); // This is required.
contents.append( stream.readAll() );
file.close();
What you're seeing is actually the expected behaviour.
The string ² consists of the bytes C3 82 C2 B2 when encoded as UTF-8. Assuming that QTextStream actually recognises UTF-8 correctly (which isn't all that obvious, judging from the documentation, which only mentions character encoding detection when there's a BOM present, and you haven't said anything about the input file having a BOM), we can assume that the QString which is returned by QTextStream::readAll actually contains the string ².
QString::toStdString() returns a UTF-8 encoded variant of the string that the given QString represents, so the return value should contain the same bytes as the input file - namely C3 82 C2 B2.
Now, about what you're seeing in the debugger :
You've stated in one of the comments that "QString only has 0xC2 0xB2 in the string (which is correct).". This is only partially true : QString uses UTF-16LE internally, which means that its internal character array contains two 16-bit values : 0x00C2 0x00B2. These, in fact, map to the characters  and ² when each is encoded as UTF-16, which proves that the QString is constructed correctly based on the input from the file. However, your debugger seems to be smart enough to know that the bytes which make up a QString are encoded in UTF-16 and thus renders the characters correctly.
You've also stated that the debugger shows the content of the std::string returned from QString::toStdString as ². Assuming that your debugger uses the dreaded "ANSI code page" for resolving bytes to characters when no encoding is stated explicitly, and you're using a English-language Windows which uses Windows-1252 as its default legacy code page, everything fits into place : the std::string actually contains the bytes C3 82 C2 B2, which map to the characters ² in Windows-1252.
Shameless self plug : I delivered a talk about character encodings at a conference last year. Perhaps watching it will help you understand some of these problems better.
One last thing : ANSI is not an encoding. It can mean a number of different encodings based on Windows' regional settings.

Writing wide string to a file in byte mode stopped

I am writing out unicode text (stored as wstring) into a file and I'm doing it in byte mode, but the string in the file ends prior to "™" character being printed. Is "™" not unicode or am I doing something wrong?
wofstream output;
outp.open("output.txt", ofstream::binary);
wstring a =L"ABC™";
output << a;
TM is definitely unicode. ofstream and wofstream do not write the text in UTF-8 format. You've to encode the output buffer in UTF-8 in order to see the results you're expecting. So, try using "WideCharToMultiByte".
There is a common misconception about the iostream binary mode: that it is to read/write binary files. The iostream library works only with text files and only read and write text files. The only thing the the "binary" mode changes is how NL (new line) characters are handled. In binary more, no transformation occurs. In non-binary mode, writing LF characters ('\n') to a stream will convert it to the platform specific new line sequence (Unix -> LF, Windows -> CR LF ("\r\n"), Mac -> CR) while when reading, the platform specific new line sequence will be converted to a single LF ('\n') character.
For everything else, nothing changes, meaning an wofstream will always convert the Unicode wide character string to single byte or multi byte character stream depending on the locale used by your process. If you have a locale of "en_US.utf8" on Linux for example, it will be converted to UTF8. Now, if the current locale does not have a representation for the TM Unicode symbol, then either nothing or a '?' will be written to the file.

Reading From A File Which Contains Unicode Characters

I have this huge file which contains unicode strings at the beginning (first ~10,000 character or so)
I don't care about the unicode part, parts I'm interested aren't unicode but whenever I try to read those parts I get '=', and if I were to load the entire file to char array and write to to some temporary file (without altering the data) with ofstream I get incorrect data actually all I get is a text file filled with Í If I were to remove the unicode part manually everything works fine, So it seems ifstream cannot deal with streams which contains unicode data, but if this assumption is true, is there any way to work on this file introducing a new library to my project?
Thanks,
EDIT: Here's a sample code, program reads from this file which contains characters (some, not all) that can't be represented in ASCII.
ifstream inFile("somefile");
inFile.seekg(0,ios_base::end);
size_t size = inFile.tellg();
inFile.seekg(0,ios_base::beg);
char *book = new char[size];
inFile.read(book,size);
for (int i = 0; i < size; i++) {
cout << book[i] << " " << i << endl; //book[i] will always be '='
}
ofstream outFile("TEST.txt");
outFile.write(book,size);
outFile.close();
Keith Thompson's question is very important. Depending on which Unicode encoding, writing a small C routine that reads (and discards) the Unicode characters can be trivial, or slightly more complex.
Supposing the encoding is UTF-8, you will have a problem determining when to stop discarding because ASCII is a subset of UTF-8, so any time you encounter an ASCII char, you might be tempted to say "this is it, we're back in ASCII land" and the next char still might be still outside the ASCII range.
So you need to read the file and determine where the last character>127 is. Anything after that is plain ASCII -- hopefully.
A text file is generally in just one encoding utf-8, utf-16 (big or little endian) or utf-32 (big or little) or ASCII or other ANSI code pages. Mixing of encoding is only possible in some custom ways.
That said, you will have to read both the data that you need and that you don't in the same encoding. If you know the format is utf-8 you could, depending on what you are going to do with the data, read the file as a binary file into char buffer piece by piece. Then you could API(s) like strnextc (on windows. equivalent API must be available on other platforms) to move character by character on the buffer. Once you reach the end - you could move the balance to the front of the buffer and load the rest of the buffer from the file.
In fact you could use the above approach in general for any encoding. But for utf-16, you could try using wifstream - provided the endianess of the file and the platform you would be running on is the same. And you need to check if the implementation of wifstream is good at handling change in endiness and is able to take care of BOM (byte order mark) - 2 byte sequence ("FE FF" or "FF FE") that is generally present at the beginning of a file - leave alone surrogate pairs.

Output data not the same as input data

I'm doing some file io and created the test below, but I thought testoutput2.txt would be the same as testinputdata.txt after running it?
testinputdata.txt:
some plain
text
data with
a number
42.0
testoutput2.txt (In some editors its on seperate lines, but in others its all on one line)
some plain
਍ऀ琀攀砀琀ഀഀ
data with
਍ 愀  渀甀洀戀攀爀ഀഀ
42.0
int main()
{
//Read plain text data
std::ifstream filein("testinputdata.txt");
filein.seekg(0,std::ios::end);
std::streampos length = filein.tellg();
filein.seekg(0,std::ios::beg);
std::vector<char> datain(length);
filein.read(&datain[0], length);
filein.close();
//Write data
std::ofstream fileoutBinary("testoutput.dat");
fileoutBinary.write(&datain[0], datain.size());
fileoutBinary.close();
//Read file
std::ifstream filein2("testoutput.dat");
std::vector<char> datain2;
filein2.seekg(0,std::ios::end);
length = filein2.tellg();
filein2.seekg(0,std::ios::beg);
datain2.resize(length);
filein2.read(&datain2[0], datain2.size());
filein2.close();
//Write data
std::ofstream fileout("testoutput2.txt");
fileout.write(&datain2[0], datain2.size());
fileout.close();
}
Its working fine on my side, i have run your program on VC++ 6.0 and checked the output on notepad and MS Word. can you specify name of editor where you are facing problem.
You can't read Unicode text into a std::vector<char>. The char data type only works with narrow strings, and my guess is that the text file you're reading in (testinputdata.txt) is saved with either UTF-8 or UTF-16 encoding.
Try using the wchar_t type for your characters, instead. It is specifically designed to work with "wide" (or Unicode) characters.
Thou shalt verify thy input was successful! Although this would sort you out, you should also note that number of bytes in the file has no direct relationship to the number of characters being read: there can be less characters than bytes (think Unicode character using multiple bytes using UTF8 to be encoded) or vice versa (although the latter doesn't happen with any of the Unicode encodings). All you experience is that read() couldn't read as many characters as you'd asked it to read but write() happily wrote the junk you gave it.

UCS-2LE text file parsing

I have a text file which was created using some Microsoft reporting tool. The text file includes the BOM 0xFFFE in the beginning and then ASCII character output with nulls between characters (i.e "F.i.e.l.d.1."). I can use iconv to convert this to UTF-8 using UCS-2LE as an input format and UTF-8 as an output format... it works great.
My problem is that I want to read in lines from the UCS-2LE file into strings and parse out the field values and then write them out to a ASCII text file (i.e. Field1 Field2). I have tried the string and wstring-based versions of getline – while it reads the string from the file, functions like substr(start, length) do interpret the string as 8-bit values, so the start and length values are off.
How do I read the UCS-2LE data into a C++ String and extract the data values? I have looked at boost and icu as well as numerous google searches but have not found anything that works. What am I missing here? Please help!
My example code looks like this:
wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
wstring field1;
field1 = srcBuf.substr(12, 12);
...
...
}
So, if, for example, srcBuf contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s." then the substr() above returns ".k. i.n. g.e" instead of "g.e.n.e.r.a.l.i.t.i.e.s.".
What I want is to read in the string and process it without having to worry about the multi-byte representation. Does anybody have an example of using boost (or something else) to read these strings from the file and convert them to a fixed width representation for internal use?
BTW, I am on a Mac using Eclipse and gcc.. Is it possible my STL does not understand wide character strings?
Thanks!
Having spent some good hours tackling this question, here are my conclusions:
Reading an UTF-16 (or UCS2-LE) file is apparently manageable in C++11, see How do I write a UTF-8 encoded string to a file in Windows, in C++
Since the boost::locale library is now part of C++11, one can just use codecvt_utf16 (see bullet below for eventual code samples)
However, in older compilers (e.g. MSVC 2008), you can use locale and a custom codecvt facet/"recipe", as very nicely exemplified in this answer to Writing UTF16 to file in binary mode
Alternatively, one can also try this method of reading, though it did not work in my case. The output would be missing lines which were replaced by garbage chars.
I wasn't able to get this done in my pre-C++11 compiler and had to resort to scripting it in Ruby and spawning a process (it's just in test so I think that kind of complications are ok there) to execute my task.
Hope this spares others some time, happy to help.
substr works fine for me on Linux with g++ 4.3.3. The program
#include <string>
#include <iostream>
using namespace std;
int main()
{
wstring s1 = L"Hello, world";
wstring s2 = s1.substr(3,5);
wcout << s2 << endl;
}
prints "lo, w" as it should.
However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.