std::wstring to QString conversion with Hiragana

std::wstring to QString conversion with Hiragana - c++

I'm trying to convert text containing Hiragana from a wstring to a QString, so that it can be used on a label's text property. However my code is not working and I'm not sure why that is.
The following conversion method obviously tells me that I made something wrong:
std::wstring myWString = L"Some Hiragana: あ い う え お";
ui->label->setText(QString::fromStdWString(myWString));
Output: Some Hiragana: ã‚ ã„ ã† ãˆ ãŠ
I can print Hiragana on a label if I put them in a string directly:
ui->label->setText("Some Hiragana: あ い う え お");
Output: Some Hiragana: あ い う え お
That means I can avoid this problem by simply using std::string instead of std::wstring, but I'd like to know why this is happening.

VS is interpreting the file as Windows-1252 instead of UTF-8.
As an example, 'あ' in UTF-8 is E3 81 82, but the compiler is reading each byte as a single Windows-1252 char before converting it to the respective UTF-16 codepoints E3 201A, which works out as 'ã‚' (81 is either ignored by VS as it is reserved in Windows-1252, or not printed by qt if VS happens to convert it to the respective C1 control character).
The direct version works because the compiler doesn't perform any conversions and leaves the string as E3 81 82.
To fix your issue you will need to inform VS that the file is UTF-8, according to other posts one way is to ensure the file has a UTF-8 BOM.
The only portable way of fixing this is to use escape sequences instead:
L"Some Hiragana: \u3042 \u3044 \u3046 \u3048 \u304A"

Related

Issue when converting utf16 wide std::wstring to utf8 narrow std::string for rare characters

Why do some utf16 encoded wide strings, when converted to utf8 encoded narrow strings convert to hex values that don't appear to be correct when converted using this commonly found conversion function?
std::string convert_string(const std::wstring& str)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
return conv.to_bytes(str);
}
Hello. I have a C++ app on Windows which takes some user input on the command line. I'm using the wide character main entry point to get the input as a utf16 string which I'm converting to a utf8 narrow string using the above function.
This function can be found in many places online and works in almost all cases. I have however found a few examples where it doesn't seem to work as expected.
For example if I input an emojii character "🤢" as a string literal (in my utf8 encoded cpp file) and write it to disk, the file (FILE-1) contains the following data (which are the correct utf8 hex values specified here https://www.fileformat.info/info/unicode/char/1f922/index.htm):
0xF0 0x9F 0xA4 0xA2
However if I pass the emojii to my application on the command line and convert it to a utf8 string using the conversion function above and then write it to disk, the file (FILE-2) contains different raw bytes:
0xED 0xA0 0xBE 0xED 0xB4 0xA2
While the second file seems to indicate the conversion has produced the wrong output if you copy and paste the hex values (in notepad++ at least) it produces the correct emojii. Also WinMerge considers the two files to be identical.
so to conclude I would really like to know the following:
how the incorrect-looking converted hex values map correctly to the right utf8 character in the example above
why the conversion function converts some characters to this form while almost all other characters produce the expected raw bytes
As a bonus I would like to know if it is possible to modify the conversion function to stop it from outputting these rare characters in this form
I should note that I already have a workaround function below which uses WinAPI calls, however using standard library calls only is the dream :)
std::string convert_string(const std::wstring& wstr)
{
if(wstr.empty())
return std::string();
int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
std::string strTo(size_needed, 0);
WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
return strTo;
}

The problem is that std::wstring_convert<std::codecvt_utf8<wchar_t>> converts from UCS-2, not from UTF-16. Characters inside of the BMP (U+0000..U+FFFF) have identical encodings in both UCS-2 and UTF-16 and so will work, but characters outside of the BMP (U+FFFF..U+10FFFF), such as your Emoji, do not exist in UCS-2 at all. This means the conversion doesn't understand the character and produces incorrect UTF-8 bytes (technically, it's converted each half of the UTF-16 surrogate pair into a separate UTF-8 character).
You need to use std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> instead.

There is already a validated answer here. But for the records, here some additional information.
The encoding of the nauseated face emoji was introduced in Unicode in 2016. It is 4 utf-8 bytes (0xF0 0x9F 0xA4 0xA2) or 2 utf-16 words (0xD83E 0xDD22).
The surprising encoding of 0xED 0xA0 0xBE 0xED 0xB4 0xA2 corresponds in fact to an UCS surrogate pair:
0xED 0xA0 0xBE is the utf8 encoding of the high surrogate 0xD83E according to this conversion table.
0xED 0xB4 0xA2 corresponds to the utf8 encoding of the low surrogate 0xDD22 according to this table.
So basically, your first encoding is the direct utf8. The second encoding is the encoding in utf8 of an UCS-2 encoding that corresponds to the utf-16 encoding of the desired character.
As the accepted answer rightly pointed out, the std::codecvt_utf8<wchar_t> is the culprit, because it's about UCS-2 and not UTF-16.
It's quite astonishing nowadays to find in standard libraries this obsolete encoding, but I suspect that this is still a reminiscence of Microsoft's lobying in the standard committee that dates back from the old Windows support for unicode with UCS-2.

Qt UTF-8 File to std::string Adds extra characters

I have a UTF-8 encoded text file, which has characters such as ²,³,Ç and ó. When I read the file using the below, the file appears to be read appropriately (at least according to what I can see in Visual Studio's editor when viewing the contents of the contents variable)
QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
return;
}
QString contents;
QTextStream stream( &file );
contents.append( stream.readAll() );
file.close();
However, as soon as the contents get converted to an std::string the additional characters are added. For example, the ² gets converted to Ã‚Â², when it should just be Â². This appears to happen for every non-ANSI character, the extra Ã‚ is added, which, of course, means that when the a new file is saved, the characters are not correct in the output file.
I have, of course, tried simply doing toStdString(), I've also tried toUtf8 and have even tried using the QTextCodec but each fails to give the proper values.
I do not understand why going from UTF-8 file, to QString, then to std::string loses the UTF-8 characters. It should be able to reproduce the exact file that was originally read, or am I completely missing something?

As Daniel Kamil Kozar mentioned in his answer, the QTextStream does not read in the encoding, and, therefore, does not actually read the file correctly. The QTextStream must set its Codec prior to reading the file in order to properly parse the characters. Added a comment to the code below to show the extra file needed.
QFile file( filePath );
if ( !file.open( QFile::ReadOnly | QFile::Text ) ) {
return;
}
QString contents;
QTextStream stream( &file );
stream.setCodec( QTextCodec::codecForName( "UTF-8" ) ); // This is required.
contents.append( stream.readAll() );
file.close();

What you're seeing is actually the expected behaviour.
The string Â² consists of the bytes C3 82 C2 B2 when encoded as UTF-8. Assuming that QTextStream actually recognises UTF-8 correctly (which isn't all that obvious, judging from the documentation, which only mentions character encoding detection when there's a BOM present, and you haven't said anything about the input file having a BOM), we can assume that the QString which is returned by QTextStream::readAll actually contains the string Â².
QString::toStdString() returns a UTF-8 encoded variant of the string that the given QString represents, so the return value should contain the same bytes as the input file - namely C3 82 C2 B2.
Now, about what you're seeing in the debugger :
You've stated in one of the comments that "QString only has 0xC2 0xB2 in the string (which is correct).". This is only partially true : QString uses UTF-16LE internally, which means that its internal character array contains two 16-bit values : 0x00C2 0x00B2. These, in fact, map to the characters Â and ² when each is encoded as UTF-16, which proves that the QString is constructed correctly based on the input from the file. However, your debugger seems to be smart enough to know that the bytes which make up a QString are encoded in UTF-16 and thus renders the characters correctly.
You've also stated that the debugger shows the content of the std::string returned from QString::toStdString as Ã‚Â². Assuming that your debugger uses the dreaded "ANSI code page" for resolving bytes to characters when no encoding is stated explicitly, and you're using a English-language Windows which uses Windows-1252 as its default legacy code page, everything fits into place : the std::string actually contains the bytes C3 82 C2 B2, which map to the characters Ã‚Â² in Windows-1252.
Shameless self plug : I delivered a talk about character encodings at a conference last year. Perhaps watching it will help you understand some of these problems better.
One last thing : ANSI is not an encoding. It can mean a number of different encodings based on Windows' regional settings.

Rendering font with UTF8 in SDL_ttf

I am trying to render characters using the TTF_RenderUTF8_Blended method provided by the SDL_ttf library. I implemented user input (keyboard) and pressing 'ä' or 'ß' for example works fine. These are special characters of the German language. In this case, they are even in the extended ASCII 8-bit code, but even when I copy and paste some Greek letters for example, the fonts get rendered correctly using UTF8. (However not all the UNICODE glyphs you can find here (http://unicode-table.com/) am I able to render as I recognized during testing but I guess that is normal because the Arial font might not have every single glyph. Anyways most of the UNICODE glyphs work fine.)
My problem is that passing strings (parameter as const char*) the additional characters (to ASCII) aren't rendered correctly. So entering 'Ä', 'ß', or some other UNICODE chars with the keyboard at runtime works but passing them as a parameter to get - let's say a title for my game - inside the code like this does not work:
font_srf = TTF_RenderUTF8_Blended(font, "Hällö", font_clr);
I don't really understand why this is happening. What I get on the screen is:
H_ll_
And I am using _ to represent the typical vertical rectangle that the guy who gave the following speech used as a funny way of an introduction:
https://www.youtube.com/watch?v=MW884pluTw8
Ironically, when I use TTF_RenderText_Blended(font, "Hällö", font_clr); it works because 'ä' and 'ö' are 8-bit extended ASCII encoded, but what I want is UNICODE support, so that does not help.
Edit & Semi-Solution
I kind of (not really good) fixed the problem, Because my input works fine, I just checked what values I get as input when I press 'ä', 'ß', ... on my keyboard using the following code:
const char* c = input.c_str();
for (int i = 0; i < input.length(); i++)
{
std::cout << int(c[i]) << " ";
}
Then I printed those characters in the following way:
const char char_array[] = {-61, -74, -61, -97, '\0'};
const char* char_pointer = char_array;
-61, -74 is 'ö' and -61, -97 is 'ß'.
This does fit the UTF8 encoding right?
U+00F6 | ö | C3 B6 (from UTF8 data table)
256-61=195 which is C3
and 256-74=182 which is B6
const char char_array[] = {0xC3, 0xB6};
This code works fine as well in case some of you were wondering. And I think this is what I will keep doing for now. Looking up the Hex-code for some Unicode glyphs isn't that hard.
But what I still can't figure out is how to get to the extended ASCII integer value of 246. Plus, isn't there a more human-friendly solution to my problem?

If you have non-ASCII characters in a source file, the character encoding of that source code file matters. So in your text editor or IDE, you need to set the character set (e.g. UTF-8) when you save it.
Alternatively, you can use the \x... or \u.... format to specify non-ASCII characters using only ASCII characters, so source file encoding doesn't matter.
Microsoft doc, but not MS-specific:
https://msdn.microsoft.com/en-us/library/6aw8xdf2.aspx

LPTSTR contains only one letter

I'm creating a DLL for a application. The application calls the DLL and receives a string from 8 to 50 in length.
The problem I'm having is that only the first letter of any message the application receives is shown.
Below is the GetMethodVersion function.
#include "stdafx.h"
STDAPI_(void) GetMethodVersion(LPTSTR out_strMethodVersion, int in_intSize)
{
if ((int)staticMethodVersion.length() > in_intSize)
return;
_tcscpy_s(out_strMethodVersion, 12, _T("Test"));
//staticMethodVersion should be insted of _T("Test")
}
The project settings are set to Unicode.
I belive after some research that there is a problem with Unicode format and how it functions. Thanks for any help you can give.

You wrote in your question that the project settings are Unicode: is this true for both the DLL and the calling EXE? Make sure that they both match.
In Unicode builds, the ugly TCHAR macros become:
LPTSTR --> wchar_t*
_tcscpy_s --> wcscpy_s
_T("Test") --> L"Test"
So you have:
STDAPI_(void) GetMethodVersion(wchar_t* out_strMethodVersion,
int in_intSize)
{
...
wcscpy_s(out_strMethodVersion, 12, L"Test");
}
Are you sure the "magic number" 12 is correct? Is the destination string buffer pointed to by out_strMethodVersion of size at least 12 wchar_ts (including the terminating NUL)?
Then, have a look at the call site (which you haven't showed).
How do you print the returned string? Maybe you are using an ANSI char function, so the returned string is misinterpreted as a char* ANSI string, and so the first 0x00 byte of the Unicode UTF-16 string is misinterpreted as a NUL-terminator at the call site, and the string gets truncated at the first character when printed?
Text: T e s t NUL
UTF-16 bytes: 54 00 65 00 73 00 74 00 00 00
(hex) **<--+
|
First 00 byte misinterpreted as
NUL terminator in char* ANSI string,
so only 'T' (the first character) gets printed.
EDIT
The fact that you clarified in the comments that:
I switched the DLL to ANSI, the EXE apparently was that as well, though the exe was documented as Unicode.
makes me think that the EXE assumes the UTF-8 Unicode encoding.
Just as in ANSI strings, a 0x00 byte in UTF-8 is a string NUL terminator, so the previous analysis of UTF-16 0x00 byte (in a wchar_t) misinterpreted as string NUL terminator applies.
Note that pure ASCII is a proper subset of UTF-8: so your code may work if you just use pure ASCII characters (like in "Test") and pass them to the EXE.
However, if the EXE is documented to be using Unicode UTF-8, you may want to Do The Right Thing and return a UTF-8 string from the DLL.
The string is returned via char* (as for ANSI strings), but it's important that you make sure that UTF-8 is the encoding used by the DLL to return that string, to avoid subtle bugs in the future.
While the general terminology used in Windows APIs and Visual Studio is "Unicode", it actually means the UTF-16 Unicode encoding in those contexts.
However, UTF-16 is not the only Unicode encoding available. For example, to exchange text on the Internet, the UTF-8 encoding is widely used. In your case, it sounds like your EXE is expecting a Unicode UTF-8 string.

It is to late to #define UNICODE after #include "stdafx.h". It should be defined before the first #include in stdafx.h itself. But the proper way is to set it in project properties (menu Project > Properties > Configuration Properties > General > Character Set > "Use Unicode Character Set").

QTextBrowser not displaying non-english characters

I'm developing a Qt GUI application to parse out a custom windows binary file that stores unicode text using wchar_t (default UTF-16 encoding). I've constructed a QString using QString::fromWcharArray and passed it to QTextBrowser::insertPlainText like this
wchar_t *p = ; // pointer to a wchar_t string in the binary file
QString t = QString::fromWCharArray(p);
ui.logBrowser->insertPlainText(t);
The displayed text displays ASCII characters correctly, but non-ASCII characters are displayed as a rectangular box instead. I've followed the code in a debugger and p points to a valid wchar_t string and the constructed QString t is also a valid string matching the wchar_t string. The problem happens when printing it out on a QTextBrowser.
How do I fix this?

First of all read documentation. So depending on system you will have different encoding UCS-4 or UTF-16! What is the size of wchar_t?
Secondly there is alternative API: try QString::fromUtf16.
Finally what kind of character are you using? Hebrew/Cyrillic/Japanese/???. Are you sure those characters are supported by font you are using?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js