Qt UTF-8 File to std::string Adds extra characters - c++

I have a UTF-8 encoded text file, which has characters such as ²,³,Ç and ó. When I read the file using the below, the file appears to be read appropriately (at least according to what I can see in Visual Studio's editor when viewing the contents of the contents variable)
QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
return;
}
QString contents;
QTextStream stream( &file );
contents.append( stream.readAll() );
file.close();
However, as soon as the contents get converted to an std::string the additional characters are added. For example, the ² gets converted to ², when it should just be ². This appears to happen for every non-ANSI character, the extra  is added, which, of course, means that when the a new file is saved, the characters are not correct in the output file.
I have, of course, tried simply doing toStdString(), I've also tried toUtf8 and have even tried using the QTextCodec but each fails to give the proper values.
I do not understand why going from UTF-8 file, to QString, then to std::string loses the UTF-8 characters. It should be able to reproduce the exact file that was originally read, or am I completely missing something?

As Daniel Kamil Kozar mentioned in his answer, the QTextStream does not read in the encoding, and, therefore, does not actually read the file correctly. The QTextStream must set its Codec prior to reading the file in order to properly parse the characters. Added a comment to the code below to show the extra file needed.
QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
return;
}
QString contents;
QTextStream stream( &file );
stream.setCodec( QTextCodec::codecForName( "UTF-8" ) ); // This is required.
contents.append( stream.readAll() );
file.close();

What you're seeing is actually the expected behaviour.
The string ² consists of the bytes C3 82 C2 B2 when encoded as UTF-8. Assuming that QTextStream actually recognises UTF-8 correctly (which isn't all that obvious, judging from the documentation, which only mentions character encoding detection when there's a BOM present, and you haven't said anything about the input file having a BOM), we can assume that the QString which is returned by QTextStream::readAll actually contains the string ².
QString::toStdString() returns a UTF-8 encoded variant of the string that the given QString represents, so the return value should contain the same bytes as the input file - namely C3 82 C2 B2.
Now, about what you're seeing in the debugger :
You've stated in one of the comments that "QString only has 0xC2 0xB2 in the string (which is correct).". This is only partially true : QString uses UTF-16LE internally, which means that its internal character array contains two 16-bit values : 0x00C2 0x00B2. These, in fact, map to the characters  and ² when each is encoded as UTF-16, which proves that the QString is constructed correctly based on the input from the file. However, your debugger seems to be smart enough to know that the bytes which make up a QString are encoded in UTF-16 and thus renders the characters correctly.
You've also stated that the debugger shows the content of the std::string returned from QString::toStdString as ². Assuming that your debugger uses the dreaded "ANSI code page" for resolving bytes to characters when no encoding is stated explicitly, and you're using a English-language Windows which uses Windows-1252 as its default legacy code page, everything fits into place : the std::string actually contains the bytes C3 82 C2 B2, which map to the characters ² in Windows-1252.
Shameless self plug : I delivered a talk about character encodings at a conference last year. Perhaps watching it will help you understand some of these problems better.
One last thing : ANSI is not an encoding. It can mean a number of different encodings based on Windows' regional settings.

Related

Why am i getting these invalid characters before my file data?

I am trying to read a file into a string either by getline function or fileContents.assign( (istreambuf_iterator<char>(myFile)), (istreambuf_iterator<char>()));
Either of the way gives me the above output which shown in the image.
First way:
string fileContents;
ifstream myFile("textFile.txt");
while(getline(myFile,fileContents))
cout<<fileContents<<endl;
Alternate way:
string fileContents;
ifstream myFile(fileName.c_str());
if (myFile.is_open())
{
fileContents.assign( (istreambuf_iterator<char>(myFile) ),
(istreambuf_iterator<char>() ) );
cout<<fileContents;
}
The file begins with those characters, most likely a BOM to tell you what the encoding of the file is.
You probably are not able to see them in Windows Notepad because Notepad hides the encoding bytes. Get a decent text editor that lets you see the binary of the file and you will see those characters.
Your file starts with a UTF-8 BOM (bytes 0xEF 0xBB 0xBF). You are reading the file's raw bytes as-is and outputting them to a display that is using an OEM font for codepage 437. To handle text files properly, especially Unicode-encoded text files, you need to read the first few bytes, check for a BOM (and there are several you can look for), and if detected then seek past the BOM and interpret the remaining bytes of the file in the specified encoding, in this case UTF-8.

std::wstring to QString conversion with Hiragana

I'm trying to convert text containing Hiragana from a wstring to a QString, so that it can be used on a label's text property. However my code is not working and I'm not sure why that is.
The following conversion method obviously tells me that I made something wrong:
std::wstring myWString = L"Some Hiragana: あ い う え お";
ui->label->setText(QString::fromStdWString(myWString));
Output: Some Hiragana: ゠ㄠㆠ㈠ãŠ
I can print Hiragana on a label if I put them in a string directly:
ui->label->setText("Some Hiragana: あ い う え お");
Output: Some Hiragana: あ い う え お
That means I can avoid this problem by simply using std::string instead of std::wstring, but I'd like to know why this is happening.
VS is interpreting the file as Windows-1252 instead of UTF-8.
As an example, 'あ' in UTF-8 is E3 81 82, but the compiler is reading each byte as a single Windows-1252 char before converting it to the respective UTF-16 codepoints E3 201A, which works out as 'ã‚' (81 is either ignored by VS as it is reserved in Windows-1252, or not printed by qt if VS happens to convert it to the respective C1 control character).
The direct version works because the compiler doesn't perform any conversions and leaves the string as E3 81 82.
To fix your issue you will need to inform VS that the file is UTF-8, according to other posts one way is to ensure the file has a UTF-8 BOM.
The only portable way of fixing this is to use escape sequences instead:
L"Some Hiragana: \u3042 \u3044 \u3046 \u3048 \u304A"

Output data not the same as input data

I'm doing some file io and created the test below, but I thought testoutput2.txt would be the same as testinputdata.txt after running it?
testinputdata.txt:
some plain
text
data with
a number
42.0
testoutput2.txt (In some editors its on seperate lines, but in others its all on one line)
some plain
਍ऀ琀攀砀琀ഀഀ
data with
਍ 愀  渀甀洀戀攀爀ഀഀ
42.0
int main()
{
//Read plain text data
std::ifstream filein("testinputdata.txt");
filein.seekg(0,std::ios::end);
std::streampos length = filein.tellg();
filein.seekg(0,std::ios::beg);
std::vector<char> datain(length);
filein.read(&datain[0], length);
filein.close();
//Write data
std::ofstream fileoutBinary("testoutput.dat");
fileoutBinary.write(&datain[0], datain.size());
fileoutBinary.close();
//Read file
std::ifstream filein2("testoutput.dat");
std::vector<char> datain2;
filein2.seekg(0,std::ios::end);
length = filein2.tellg();
filein2.seekg(0,std::ios::beg);
datain2.resize(length);
filein2.read(&datain2[0], datain2.size());
filein2.close();
//Write data
std::ofstream fileout("testoutput2.txt");
fileout.write(&datain2[0], datain2.size());
fileout.close();
}
Its working fine on my side, i have run your program on VC++ 6.0 and checked the output on notepad and MS Word. can you specify name of editor where you are facing problem.
You can't read Unicode text into a std::vector<char>. The char data type only works with narrow strings, and my guess is that the text file you're reading in (testinputdata.txt) is saved with either UTF-8 or UTF-16 encoding.
Try using the wchar_t type for your characters, instead. It is specifically designed to work with "wide" (or Unicode) characters.
Thou shalt verify thy input was successful! Although this would sort you out, you should also note that number of bytes in the file has no direct relationship to the number of characters being read: there can be less characters than bytes (think Unicode character using multiple bytes using UTF8 to be encoded) or vice versa (although the latter doesn't happen with any of the Unicode encodings). All you experience is that read() couldn't read as many characters as you'd asked it to read but write() happily wrote the junk you gave it.

Size of UTF-8 string in bytes

I am using QString to store strings, and now I need to store these strings (converted to UTF-8 encoding) in POD structures, which looks like this :
template < int N >
struct StringWrapper
{
char theString[N];
};
To convert raw data from the QString, I do it like this :
QString str1( "abc" );
StringWrapper< 20 > str2;
strcpy( str2.theString, str1.toUtf8().constData() );
Now the question. I noticed that if I convert from normal string, it works fine :
QString str( "abc" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;
will produce as the output :
abc
but if I use some special characters, like for example :
QString str( "Schöne Grüße" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;
I get a garbage like this:
Gr\xC3\x83\xC2\xBC\xC3\x83\xC2\x9F
I am obviously missing something, but what exactly is wrong?
ADDITIONAL QUESTION
What is a maximum size of an UTF-8 encoded character? I read it here it is 4 bytes.
The first question you need to answer is what is the encoding of your source files is? QString default constructor assumes it's Latin1 unless you change it with QTextStream::setCodecForCStrings(). So if your sources are in anything else than Latin1 (say, UTF-8), you get a wrong result at this point:
QString str( "Schöne Grüße" );
Now, if your sources are in UTF-8, you need to replace it with:
QString str = QString::fromUtf8( "Schöne Grüße" );
Or, better yet, use QObject::trUf8() wherever possible as it gives you i18n capabilities as a free bonus.
The next thing to check is what is the encoding of your console is. You try to print a UTF-8 string to it, but does it support UTF-8? If it's a Windows console, it probably doesn't. If it's something xterm-compatible using a Unicode font on a *nix system with some *.UTF-8 locale, it should be fine.
To your edited question:
I don't see any reason not to trust Wikipedia, especially when it refers to a particular standard. It also mentions that UTF-8 used to have up to 6 bytes characters, though. From my experience, 3 bytes is maximum you get with reasonable native language characters like Latin/Cyrillic/Hebrew/Chinese/Japanese. 4 bytes are probably used for something much more exotic, you can always check the standards if you are really curious.
The first thing that goes wrong is your stated assumption. QString doesn't store UTF-8, it stores unicode strings. That's why you need to call str1.toUtf8(). It creates a temporary UTF-8 string.
The second part is just how UTF-8 works. It's a multi-byte extension of ASCII. üß aren't ASCII characters, and you do expect that both characters get a multi-byte representation. std::cout apparently doesn't expect UTF-8. This depends on the std::locale used.

Typographic apostrophe + wide string literal broke my wofstream (C++)

I’ve just encountered some strange behaviour when dealing with the ominous typographic apostrophe ( ’ ) – not the typewriter apostrophe ( ' ). Used with wide string literal, the apostrophe breaks wofstream.
This code works
ofstream file("test.txt");
file << "A’B" ;
file.close();
==> A’B
This code works
wofstream file("test.txt");
file << "A’B" ;
file.close();
==> A’B
This code fails
wofstream file("test.txt");
file << L"A’B" ;
file.close();
==> A
This code fails...
wstring test = L"A’B";
wofstream file("test.txt");
file << test ;
file.close();
==> A
Any idea ?
You should "enable" locale before using wofstream:
std::locale::global(std::locale()); // Enable locale support
wofstream file("test.txt");
file << L"A’B";
So if you have system locale en_US.UTF-8 then the file test.txt will include
utf8 encoded data (4 byes), if you have system locale en_US.ISO8859-1, then it would encode it as 8 bit encoding (3 bytes), unless ISO 8859-1 misses such character.
wofstream file("test.txt");
file << "A’B" ;
file.close();
This code works because "A’B" is actually utf-8 string and you save utf-8
string to file byte by byte.
Note: I assume you are using POSIX like OS, and you have default locale different from "C" that is the default locale.
Are you sure it's not your compiler's support for unicode characters in source files that is "broken"? What if you use \x or similar to encode the character in the string literal? Is your source file even in whatever encoding might might to a wchar_t for your compiler?
Try wrapping the stream insertion character in a try-catch block and tell us what, if any, exception it throws.
I am not sure what is going on here, but I'll harass a guess anyway. The typographic apostrophe probably has a value that fits into one byte. This works with "A’B" since it blindly copies bytes without bothering about the underlying encoding. However, with L"A’B", an implementation dependent encoding factor comes into play. It probably doesn't find the proper UTF-16 (if you are on Windows) or UTF-32 (if you are on *nix/Mac) value to store for this particular character.