Size of UTF-8 string in bytes - c++

I am using QString to store strings, and now I need to store these strings (converted to UTF-8 encoding) in POD structures, which looks like this :
template < int N >
struct StringWrapper
{
char theString[N];
};
To convert raw data from the QString, I do it like this :
QString str1( "abc" );
StringWrapper< 20 > str2;
strcpy( str2.theString, str1.toUtf8().constData() );
Now the question. I noticed that if I convert from normal string, it works fine :
QString str( "abc" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;
will produce as the output :
abc
but if I use some special characters, like for example :
QString str( "Schöne Grüße" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;
I get a garbage like this:
Gr\xC3\x83\xC2\xBC\xC3\x83\xC2\x9F
I am obviously missing something, but what exactly is wrong?
ADDITIONAL QUESTION
What is a maximum size of an UTF-8 encoded character? I read it here it is 4 bytes.

The first question you need to answer is what is the encoding of your source files is? QString default constructor assumes it's Latin1 unless you change it with QTextStream::setCodecForCStrings(). So if your sources are in anything else than Latin1 (say, UTF-8), you get a wrong result at this point:
QString str( "Schöne Grüße" );
Now, if your sources are in UTF-8, you need to replace it with:
QString str = QString::fromUtf8( "Schöne Grüße" );
Or, better yet, use QObject::trUf8() wherever possible as it gives you i18n capabilities as a free bonus.
The next thing to check is what is the encoding of your console is. You try to print a UTF-8 string to it, but does it support UTF-8? If it's a Windows console, it probably doesn't. If it's something xterm-compatible using a Unicode font on a *nix system with some *.UTF-8 locale, it should be fine.
To your edited question:
I don't see any reason not to trust Wikipedia, especially when it refers to a particular standard. It also mentions that UTF-8 used to have up to 6 bytes characters, though. From my experience, 3 bytes is maximum you get with reasonable native language characters like Latin/Cyrillic/Hebrew/Chinese/Japanese. 4 bytes are probably used for something much more exotic, you can always check the standards if you are really curious.

The first thing that goes wrong is your stated assumption. QString doesn't store UTF-8, it stores unicode strings. That's why you need to call str1.toUtf8(). It creates a temporary UTF-8 string.
The second part is just how UTF-8 works. It's a multi-byte extension of ASCII. üß aren't ASCII characters, and you do expect that both characters get a multi-byte representation. std::cout apparently doesn't expect UTF-8. This depends on the std::locale used.

Related

Qt UTF-8 File to std::string Adds extra characters

I have a UTF-8 encoded text file, which has characters such as ²,³,Ç and ó. When I read the file using the below, the file appears to be read appropriately (at least according to what I can see in Visual Studio's editor when viewing the contents of the contents variable)
QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
return;
}
QString contents;
QTextStream stream( &file );
contents.append( stream.readAll() );
file.close();
However, as soon as the contents get converted to an std::string the additional characters are added. For example, the ² gets converted to ², when it should just be ². This appears to happen for every non-ANSI character, the extra  is added, which, of course, means that when the a new file is saved, the characters are not correct in the output file.
I have, of course, tried simply doing toStdString(), I've also tried toUtf8 and have even tried using the QTextCodec but each fails to give the proper values.
I do not understand why going from UTF-8 file, to QString, then to std::string loses the UTF-8 characters. It should be able to reproduce the exact file that was originally read, or am I completely missing something?
As Daniel Kamil Kozar mentioned in his answer, the QTextStream does not read in the encoding, and, therefore, does not actually read the file correctly. The QTextStream must set its Codec prior to reading the file in order to properly parse the characters. Added a comment to the code below to show the extra file needed.
QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
return;
}
QString contents;
QTextStream stream( &file );
stream.setCodec( QTextCodec::codecForName( "UTF-8" ) ); // This is required.
contents.append( stream.readAll() );
file.close();
What you're seeing is actually the expected behaviour.
The string ² consists of the bytes C3 82 C2 B2 when encoded as UTF-8. Assuming that QTextStream actually recognises UTF-8 correctly (which isn't all that obvious, judging from the documentation, which only mentions character encoding detection when there's a BOM present, and you haven't said anything about the input file having a BOM), we can assume that the QString which is returned by QTextStream::readAll actually contains the string ².
QString::toStdString() returns a UTF-8 encoded variant of the string that the given QString represents, so the return value should contain the same bytes as the input file - namely C3 82 C2 B2.
Now, about what you're seeing in the debugger :
You've stated in one of the comments that "QString only has 0xC2 0xB2 in the string (which is correct).". This is only partially true : QString uses UTF-16LE internally, which means that its internal character array contains two 16-bit values : 0x00C2 0x00B2. These, in fact, map to the characters  and ² when each is encoded as UTF-16, which proves that the QString is constructed correctly based on the input from the file. However, your debugger seems to be smart enough to know that the bytes which make up a QString are encoded in UTF-16 and thus renders the characters correctly.
You've also stated that the debugger shows the content of the std::string returned from QString::toStdString as ². Assuming that your debugger uses the dreaded "ANSI code page" for resolving bytes to characters when no encoding is stated explicitly, and you're using a English-language Windows which uses Windows-1252 as its default legacy code page, everything fits into place : the std::string actually contains the bytes C3 82 C2 B2, which map to the characters ² in Windows-1252.
Shameless self plug : I delivered a talk about character encodings at a conference last year. Perhaps watching it will help you understand some of these problems better.
One last thing : ANSI is not an encoding. It can mean a number of different encodings based on Windows' regional settings.

qDebug outputs QString UTF-8 non-Ascii symbols as like \uxxxx

I am trying to convert string (QString) in unicode to utf-8.
qDebug prints string like this:
"Fault code soap:Client: \u041F\u043E\u043B\u044C\u0437\u043E\u0432\u0430\u0442\u0435\u043B\u044C \u0441 \u0438\u0434\u0435\u043D\u0442\u0438\u0444\u0438\u043A\u0430\u0442\u043E\u0440\u043E\u043C \u00AB16163341545811\u00BB \u043D\u0435 \u043D\u0430\u0439\u0434\u0435\u043D"
I have tried using QTextCodec like this but it outputs same unreadable string:
QTextCodec *codec = QTextCodec::codecForName("UTF-8");
QString readableStr = QString::fromUtf8(codec->fromUnicode(str));
What am I doing wrong?
EDIT:
I wonder what is going on but it happens when qDebug prints QString...
The following code
qDebug() << QString::fromUtf8("тест") << "тест" << QString::fromUtf8("тест").toUtf8().data();
prints out:
"\u0442\u0435\u0441\u0442" тест тест
I don't know the exact thread on the Qt mailing list, but this behaviour was recently introduced, as qDebug is originally meant to debug an objects internal state. Non ASCII characters are now put out like this, which most people seem to dislike, but the developer or maintainer responsible wants to keep it this way.
I assume that the variable str has type QString. Your readableStr has the same contents as str. UTF-8 is an encoding of Unicode strings that uses 8 bit characters, that can be stored in a QByteArray. qDebug uses some special functions to display string in an console or debugging buffer to help you understand the contents of the string. If you put a QString in any GUI element you will see the expected readable content.

Why does my colon character disappear when I go from char[] to string?

In an old Windows application I'm working on I need to get a path from an environment variable and then append onto it to build a path to a file. So the code looks something like this:
static std::string PathRoot; // Private variable stored in class' header file
char EnvVarValue[1024];
if (! GetEnvironmentVariable(L"ENV_ROOT", (LPWSTR) EnvVarValue, 1024))
{
cout << "Could not retrieve the ROOT env variable" << endl;
return;
}
else
{
PathRoot = EnvVarValue;
}
// Added just for testing purposes - Returning -1
int foundAt = PathRoot.find_first_of(':');
std::string FullFilePath = PathRoot;
FullFilePath.append("\\data\\Config.xml");
The environment value for ENV_ROOT is set to "c:\RootDir" in the Windows System Control Panel. But when I run the program I keep ending up with a string in FullFilePath that is missing the colon char and anything that followed in the root folder. It looks like this: "c\data\Config.xml".
Using the Visual Studio debugger I looked at EnvVarValue after passing the GetEnvironmentVariable line and it shows me an array that seems to have all the characters I'd expect, including the colon. But after it gets assigned to PathRoot, mousing over PathRoot only shows the C and drilling down it says something about a bad ptr. As I noted the find_first_of() call doesn't find the colon char. And when the append is done it only keeps the initial C and drops the rest of the RootDir value.
So there seems to be something about the colon character that is confusing the string constructor. Yes, there are a number of ways I could work around this by leaving the colon out of the env variable and adding it later in the code. But I'd prefer to find a way to have it read and used properly from the environment variable as it is.
You cannot simply cast a char* to a wchar_t* (by casting to LPWSTR) and expect things to work. The two are fundamentally distinct types, and in Windows, they signify different encoding.
You obviously have WinAPI defines set such that GetEnvironmentVariable resolves to GetEnvironmentVariableW, which uses UTF-16 to encode the string. In practice, this means a 0 byte follows every ASCII character in memory.
You then construct a std::string out of this, so it takes the first 0 byte (at char index 1) as the string terminator, so you get just "c".
You have several options:
Use std::wstring and wchar_t EnvVarValue[1024];
Call GetEnvironmentVariableA() (which uses char and ASCII)
Use wchar_t EnvVarValue[1024]; and convert the returned value to a std::string using something like wcstombs.
It seems you are building with wide-character functions (as indicated by your cast to LPWSTR). This means that the string in EnvVarValue is a wide-character string, and you you should be using wchar_t and std::wstring instead.
I would guess that the contents in the array array after the GetEnvironmentVariable call is actually the ASCII values 0x43 0x00 0x3a 0x00 0x5c 0x00 etc. (that is the wide-char representation of "C:\"). The first zero acts as the string terminator for a narrow-character string, which is why the narrow-character string PathRoot only contains the 'C'.
The problem might be that EnvVarValue is not a wchar. Try using wchar_t and std::wstrîng.

UCS-2LE text file parsing

I have a text file which was created using some Microsoft reporting tool. The text file includes the BOM 0xFFFE in the beginning and then ASCII character output with nulls between characters (i.e "F.i.e.l.d.1."). I can use iconv to convert this to UTF-8 using UCS-2LE as an input format and UTF-8 as an output format... it works great.
My problem is that I want to read in lines from the UCS-2LE file into strings and parse out the field values and then write them out to a ASCII text file (i.e. Field1 Field2). I have tried the string and wstring-based versions of getline – while it reads the string from the file, functions like substr(start, length) do interpret the string as 8-bit values, so the start and length values are off.
How do I read the UCS-2LE data into a C++ String and extract the data values? I have looked at boost and icu as well as numerous google searches but have not found anything that works. What am I missing here? Please help!
My example code looks like this:
wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
wstring field1;
field1 = srcBuf.substr(12, 12);
...
...
}
So, if, for example, srcBuf contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s." then the substr() above returns ".k. i.n. g.e" instead of "g.e.n.e.r.a.l.i.t.i.e.s.".
What I want is to read in the string and process it without having to worry about the multi-byte representation. Does anybody have an example of using boost (or something else) to read these strings from the file and convert them to a fixed width representation for internal use?
BTW, I am on a Mac using Eclipse and gcc.. Is it possible my STL does not understand wide character strings?
Thanks!
Having spent some good hours tackling this question, here are my conclusions:
Reading an UTF-16 (or UCS2-LE) file is apparently manageable in C++11, see How do I write a UTF-8 encoded string to a file in Windows, in C++
Since the boost::locale library is now part of C++11, one can just use codecvt_utf16 (see bullet below for eventual code samples)
However, in older compilers (e.g. MSVC 2008), you can use locale and a custom codecvt facet/"recipe", as very nicely exemplified in this answer to Writing UTF16 to file in binary mode
Alternatively, one can also try this method of reading, though it did not work in my case. The output would be missing lines which were replaced by garbage chars.
I wasn't able to get this done in my pre-C++11 compiler and had to resort to scripting it in Ruby and spawning a process (it's just in test so I think that kind of complications are ok there) to execute my task.
Hope this spares others some time, happy to help.
substr works fine for me on Linux with g++ 4.3.3. The program
#include <string>
#include <iostream>
using namespace std;
int main()
{
wstring s1 = L"Hello, world";
wstring s2 = s1.substr(3,5);
wcout << s2 << endl;
}
prints "lo, w" as it should.
However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.

How can I embed unicode string constants in a source file?

I'm writing some unit tests which are going to verify our handling of various resources that use other character sets apart from the normal latin alphabet: Cyrilic, Hebrew etc.
The problem I have is that I cannot find a way to embed the expectations in the test source file: here's an example of what I'm trying to do...
///
/// Protected: TestGetHebrewConfigString
///
void CPrIniFileReaderTest::TestGetHebrewConfigString()
{
prwstring strHebrewTestFilePath = GetTestFilePath( strHebrewTestFileName );
CPrIniFileReader prIniListReader( strHebrewTestFilePath.c_str() );
prIniListReader.SetCurrentSection( strHebrewSubSection );
CPPUNIT_ASSERT( prIniListReader.GetConfigString( L"דונדארןמע" ) == L"דונהשךוק") );
}
This quite simply doesnt work. Previously I worked around this using a macro which calls a routine to transform a narrow string to a wide string (we use towstring all over the place in our applications so it's existing code)
#define UNICODE_CONSTANT( CONSTANT ) towstring( CONSTANT )
wstring towstring( LPCSTR lpszValue )
{
wostringstream os;
os << lpszValue;
return os.str();
}
The assertion in the test above then became:
CPPUNIT_ASSERT( prIniListReader.GetConfigString( UNICODE_CONSTANT( "דונדארןמע" ) ) == UNICODE_CONSTANT( "דונהשךוק" ) );
This worked OK on OS X but now I'm porting to linux and I'm finding that the tests are all failing: it all feels rather hackish as well. Can anyone tell me if they have a nicer solution to this problem?
A tedious but portable way is to build your strings using numeric escape codes. For example:
wchar_t *string = L"דונדארןמע";
becomes:
wchar_t *string = "\x05d3\x05d5\x05e0\x05d3\x05d0\x05e8\x05df\x05de\x05e2";
You have to convert all your Unicode characters to numeric escapes. That way your source code becomes encoding-independent.
You can use online tools for conversion, such as this one. It outputs the JavaScript escape format \uXXXX, so just search & replace \u with \x to get the C format.
You have to tell GCC which encoding your file uses to code those characters into the file.
Use the option -finput-charset=charset, for example -finput-charset=UTF-8. Then you need to tell it about the encoding used for those string literals at runtime. That will determine the values of the wchar_t items in the strings. You set that encoding using -fwide-exec-charset=charset, for example -fwide-exec-charset=UTF-32. Beware that the size of the encoding (utf-32 needs 32bits, utf-16 needs 16bits) must not exceed the size of wchar_t gcc uses.
You can adjust that. That option is mainly useful for compiling programs for wine, designed to be compatible with windows. The option is called -fshort-wchar, and will most likely then be 16bits instead of 32bits, which is its usual width for gcc on linux.
Those options are described in more detail in man gcc, the gcc manpage.
#define UNICODE_CONSTANT( CONSTANT ) towstring( CONSTANT )
wstring towstring( LPCSTR lpszValue ) {
wostringstream os;
os << lpszValue;
return os.str();
}
This does not actually convert at all between Unicode encodings, which requires a dedicated routine. You need to keep your source code and data encodings unified- most people use UTF-8- and then convert that to the OS-specific encoding if necessary (such as UTF-16 on Winders).