qDebug outputs QString UTF-8 non-Ascii symbols as like \uxxxx - c++

I am trying to convert string (QString) in unicode to utf-8.
qDebug prints string like this:
"Fault code soap:Client: \u041F\u043E\u043B\u044C\u0437\u043E\u0432\u0430\u0442\u0435\u043B\u044C \u0441 \u0438\u0434\u0435\u043D\u0442\u0438\u0444\u0438\u043A\u0430\u0442\u043E\u0440\u043E\u043C \u00AB16163341545811\u00BB \u043D\u0435 \u043D\u0430\u0439\u0434\u0435\u043D"
I have tried using QTextCodec like this but it outputs same unreadable string:
QTextCodec *codec = QTextCodec::codecForName("UTF-8");
QString readableStr = QString::fromUtf8(codec->fromUnicode(str));
What am I doing wrong?
EDIT:
I wonder what is going on but it happens when qDebug prints QString...
The following code
qDebug() << QString::fromUtf8("тест") << "тест" << QString::fromUtf8("тест").toUtf8().data();
prints out:
"\u0442\u0435\u0441\u0442" тест тест

I don't know the exact thread on the Qt mailing list, but this behaviour was recently introduced, as qDebug is originally meant to debug an objects internal state. Non ASCII characters are now put out like this, which most people seem to dislike, but the developer or maintainer responsible wants to keep it this way.

I assume that the variable str has type QString. Your readableStr has the same contents as str. UTF-8 is an encoding of Unicode strings that uses 8 bit characters, that can be stored in a QByteArray. qDebug uses some special functions to display string in an console or debugging buffer to help you understand the contents of the string. If you put a QString in any GUI element you will see the expected readable content.

Related

QTextBrowser not displaying non-english characters

I'm developing a Qt GUI application to parse out a custom windows binary file that stores unicode text using wchar_t (default UTF-16 encoding). I've constructed a QString using QString::fromWcharArray and passed it to QTextBrowser::insertPlainText like this
wchar_t *p = ; // pointer to a wchar_t string in the binary file
QString t = QString::fromWCharArray(p);
ui.logBrowser->insertPlainText(t);
The displayed text displays ASCII characters correctly, but non-ASCII characters are displayed as a rectangular box instead. I've followed the code in a debugger and p points to a valid wchar_t string and the constructed QString t is also a valid string matching the wchar_t string. The problem happens when printing it out on a QTextBrowser.
How do I fix this?
First of all read documentation. So depending on system you will have different encoding UCS-4 or UTF-16! What is the size of wchar_t?
Secondly there is alternative API: try QString::fromUtf16.
Finally what kind of character are you using? Hebrew/Cyrillic/Japanese/???. Are you sure those characters are supported by font you are using?

Check if UTF-8 string is valid in Qt

In Qt, is there a way to check if a byte array is a valid UTF-8 sequence?
It seems that QString::fromUtf8() silently suppresses or replaces invalid sequences, without notifying the caller that there were any. This is from its documentation:
However, invalid sequences are possible with
UTF-8 and, if any such are found, they will be replaced with one or
more "replacement characters", or suppressed.
Try with QTextCodec::toUnicode and passing a ConverterState instance. ConverterState has members like invalidChars. They are not documented via doxygen though, but I assume them to be public API, as they are mentioned in the QTextCodec documentation.
Sample code:
QTextCodec::ConverterState state;
QTextCodec *codec = QTextCodec::codecForName("UTF-8");
const QString text = codec->toUnicode(byteArray.constData(), byteArray.size(), &state);
if (state.invalidChars > 0) {
qDebug() << "Not a valid UTF-8 sequence.";
}
The ConverterState way, which has already been reported here by Frank Osterfeld, works even if the text hasn't got a "BOM (Byte Order Mark)" (*).
(*) Unlike QTextCodec::codecForUtfText(), which needs a BOM in the text in order to know that it's in Utf-8.

Can't read unicode (japanese) from a file

Hi I have a file containing japanese text, saved as unicode file.
I need to read from the file and display the information to the stardard output.
I am using Visual studio 2008
int main()
{
wstring line;
wifstream myfile("D:\sample.txt"); //file containing japanese characters, saved as unicode file
//myfile.imbue(locale("Japanese_Japan"));
if(!myfile)
cout<<"While opening a file an error is encountered"<<endl;
else
cout << "File is successfully opened" << endl;
//wcout.imbue (locale("Japanese_Japan"));
while ( myfile.good() )
{
getline(myfile,line);
wcout << line << endl;
}
myfile.close();
system("PAUSE");
return 0;
}
This program generates some random output and I don't see any japanese text on the screen.
Oh boy. Welcome to the Fun, Fun world of character encodings.
The first thing you need to know is that your console is not unicode on windows. The only way you'll ever see Japanese characters in a console application is if you set your non-unicode (ANSI) locale to Japanese. Which will also make backslashes look like yen symbols and break paths containing european accented characters for programs using the ANSI Windows API (which was supposed to have been deprecated when Windows XP came around, but people still use to this day...)
So first thing you'll want to do is build a GUI program instead. But I'll leave that as an exercise to the interested reader.
Second, there are a lot of ways to represent text. You first need to figure out the encoding in use. Is is UTF-8? UTF-16 (and if so, little or big endian?) Shift-JIS? EUC-JP? You can only use a wstream to read directly if the file is in little-endian UTF-16. And even then you need to futz with its internal buffer. Anything other than UTF-16 and you'll get unreadable junk. And this is all only the case on Windows as well! Other OSes may have a different wstream representation. It's best not to use wstreams at all really.
So, let's assume it's not UTF-16 (for full generality). In this case you must read it as a char stream - not using a wstream. You must then convert this character string into UTF-16 (assuming you're using windows! Other OSes tend to use UTF-8 char*s). On windows this can be done with MultiByteToWideChar. Make sure you pass in the right code page value, and CP_ACP or CP_OEMCP are almost always the wrong answer.
Now, you may be wondering how to determine which code page (ie, character encoding) is correct. The short answer is you don't. There is no prima facie way of looking at a text string and saying which encoding it is. Sure, there may be hints - eg, if you see a byte order mark, chances are it's whatever variant of unicode makes that mark. But in general, you have to be told by the user, or make an attempt to guess, relying on the user to correct you if you're wrong, or you have to select a fixed character set and don't attempt to support any others.
Someone here had the same problem with Russian characters (He's using basic_ifstream<wchar_t> wich should be the same as wifstream according to this page). In the comments of that question they also link to this which should help you further.
If understood everything correctly, it seems that wifstream reads the characters correctly but your program tries to convert them to whatever locale your program is running in.
Two errors:
std::wifstream(L"D:\\sample.txt");
And do not mix cout and wcout.
Also check that your file is encoded in UTF-16, Little-Endian. If not so, you will be in trouble reading it.
wfstream uses wfilebuf for the actual reading and writing of the data. wfilebuf defaults to using a char buffer internally which means that the text in the file is assumed narrow, and converted to wide before you see it. Since the text was actually wide, you get a mess.
The solution is to replace the wfilebuf buffer with a wide one.
You probably also need to open the file as binary.
const size_t bufsize = 128;
wchar_t buffer[bufsize];
wifstream myfile("D:\\sample.txt", ios::binary);
myfile.rdbuf()->pubsetbuf(buffer, 128);
Make sure the buffer outlives the stream object!
See details here: http://msdn.microsoft.com/en-us/library/tzf8k3z8(v=VS.80).aspx

Size of UTF-8 string in bytes

I am using QString to store strings, and now I need to store these strings (converted to UTF-8 encoding) in POD structures, which looks like this :
template < int N >
struct StringWrapper
{
char theString[N];
};
To convert raw data from the QString, I do it like this :
QString str1( "abc" );
StringWrapper< 20 > str2;
strcpy( str2.theString, str1.toUtf8().constData() );
Now the question. I noticed that if I convert from normal string, it works fine :
QString str( "abc" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;
will produce as the output :
abc
but if I use some special characters, like for example :
QString str( "Schöne Grüße" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;
I get a garbage like this:
Gr\xC3\x83\xC2\xBC\xC3\x83\xC2\x9F
I am obviously missing something, but what exactly is wrong?
ADDITIONAL QUESTION
What is a maximum size of an UTF-8 encoded character? I read it here it is 4 bytes.
The first question you need to answer is what is the encoding of your source files is? QString default constructor assumes it's Latin1 unless you change it with QTextStream::setCodecForCStrings(). So if your sources are in anything else than Latin1 (say, UTF-8), you get a wrong result at this point:
QString str( "Schöne Grüße" );
Now, if your sources are in UTF-8, you need to replace it with:
QString str = QString::fromUtf8( "Schöne Grüße" );
Or, better yet, use QObject::trUf8() wherever possible as it gives you i18n capabilities as a free bonus.
The next thing to check is what is the encoding of your console is. You try to print a UTF-8 string to it, but does it support UTF-8? If it's a Windows console, it probably doesn't. If it's something xterm-compatible using a Unicode font on a *nix system with some *.UTF-8 locale, it should be fine.
To your edited question:
I don't see any reason not to trust Wikipedia, especially when it refers to a particular standard. It also mentions that UTF-8 used to have up to 6 bytes characters, though. From my experience, 3 bytes is maximum you get with reasonable native language characters like Latin/Cyrillic/Hebrew/Chinese/Japanese. 4 bytes are probably used for something much more exotic, you can always check the standards if you are really curious.
The first thing that goes wrong is your stated assumption. QString doesn't store UTF-8, it stores unicode strings. That's why you need to call str1.toUtf8(). It creates a temporary UTF-8 string.
The second part is just how UTF-8 works. It's a multi-byte extension of ASCII. üß aren't ASCII characters, and you do expect that both characters get a multi-byte representation. std::cout apparently doesn't expect UTF-8. This depends on the std::locale used.

UCS-2LE text file parsing

I have a text file which was created using some Microsoft reporting tool. The text file includes the BOM 0xFFFE in the beginning and then ASCII character output with nulls between characters (i.e "F.i.e.l.d.1."). I can use iconv to convert this to UTF-8 using UCS-2LE as an input format and UTF-8 as an output format... it works great.
My problem is that I want to read in lines from the UCS-2LE file into strings and parse out the field values and then write them out to a ASCII text file (i.e. Field1 Field2). I have tried the string and wstring-based versions of getline – while it reads the string from the file, functions like substr(start, length) do interpret the string as 8-bit values, so the start and length values are off.
How do I read the UCS-2LE data into a C++ String and extract the data values? I have looked at boost and icu as well as numerous google searches but have not found anything that works. What am I missing here? Please help!
My example code looks like this:
wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
wstring field1;
field1 = srcBuf.substr(12, 12);
...
...
}
So, if, for example, srcBuf contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s." then the substr() above returns ".k. i.n. g.e" instead of "g.e.n.e.r.a.l.i.t.i.e.s.".
What I want is to read in the string and process it without having to worry about the multi-byte representation. Does anybody have an example of using boost (or something else) to read these strings from the file and convert them to a fixed width representation for internal use?
BTW, I am on a Mac using Eclipse and gcc.. Is it possible my STL does not understand wide character strings?
Thanks!
Having spent some good hours tackling this question, here are my conclusions:
Reading an UTF-16 (or UCS2-LE) file is apparently manageable in C++11, see How do I write a UTF-8 encoded string to a file in Windows, in C++
Since the boost::locale library is now part of C++11, one can just use codecvt_utf16 (see bullet below for eventual code samples)
However, in older compilers (e.g. MSVC 2008), you can use locale and a custom codecvt facet/"recipe", as very nicely exemplified in this answer to Writing UTF16 to file in binary mode
Alternatively, one can also try this method of reading, though it did not work in my case. The output would be missing lines which were replaced by garbage chars.
I wasn't able to get this done in my pre-C++11 compiler and had to resort to scripting it in Ruby and spawning a process (it's just in test so I think that kind of complications are ok there) to execute my task.
Hope this spares others some time, happy to help.
substr works fine for me on Linux with g++ 4.3.3. The program
#include <string>
#include <iostream>
using namespace std;
int main()
{
wstring s1 = L"Hello, world";
wstring s2 = s1.substr(3,5);
wcout << s2 << endl;
}
prints "lo, w" as it should.
However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.