QTextBrowser not displaying non-english characters - c++

I'm developing a Qt GUI application to parse out a custom windows binary file that stores unicode text using wchar_t (default UTF-16 encoding). I've constructed a QString using QString::fromWcharArray and passed it to QTextBrowser::insertPlainText like this
wchar_t *p = ; // pointer to a wchar_t string in the binary file
QString t = QString::fromWCharArray(p);
ui.logBrowser->insertPlainText(t);
The displayed text displays ASCII characters correctly, but non-ASCII characters are displayed as a rectangular box instead. I've followed the code in a debugger and p points to a valid wchar_t string and the constructed QString t is also a valid string matching the wchar_t string. The problem happens when printing it out on a QTextBrowser.
How do I fix this?

First of all read documentation. So depending on system you will have different encoding UCS-4 or UTF-16! What is the size of wchar_t?
Secondly there is alternative API: try QString::fromUtf16.
Finally what kind of character are you using? Hebrew/Cyrillic/Japanese/???. Are you sure those characters are supported by font you are using?

Related

Convert utf8 wstring to string on windows in C++

I am representing folder paths with boost::filesystem::path which is a wstring on windows OS and I would like to convert it to std::string with the following method:
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv1;
shared_dir = conv1.to_bytes(temp.wstring());
but unfortunatelly the result of the following text is this:
"c:\git\myproject\bin\árvíztűrőtükörfúrógép" ->
"c:\git\myproject\bin\árvíztűrőtükörfúrógép"
What do I do wrong?
#include <string>
#include <locale>
#include <codecvt>
int main()
{
// wide character data
std::wstring wstr = L"árvíztűrőtükörfúrógép";
// wide to UTF-8
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv1;
std::string str = conv1.to_bytes(wstr);
}
I was checking the value of the variable in visual studio debug mode.
The code is fine.
You're taking a wstring that stores UTF-16 encoded data, and creating a string that stores UTF-8 encoded data.
I was checking the value of the variable in visual studio debug mode.
Visual Studio's debugger has no idea that your string stores UTF-8. A string just contains bytes. Only you (and people reading your documentation!) know that you put UTF-8 data inside it. You could have put something else inside it.
So, in the absence of anything more sensible to do, the debugger just renders the string as ASCII*. What you're seeing is the ASCII* representation of the bytes in your string.
Nothing is wrong here.
If you were to output the string like std::cout << str, and if you were running the program in a command line window set to UTF-8, you'd get your expected result. Furthermore, if you inspect the individual bytes in your string, you'll see that they are encoded correctly and hold your desired values.
You can push the IDE to decode the string as UTF-8, though, on an as-needed basis: in the Watch window type str,s8; or, in the Command window, type ? &str[0],s8. These techniques are explored by Giovanni Dicanio in his article "What's Wrong with My UTF-8 Strings in Visual Studio?".
It's not even really ASCII; it'll be some 8-bit encoding decided by your system, most likely the code page Windows-1252 given the platform. ASCII only defines the lower 7 bits. Historically, the various 8-bit code pages have been colloquially (if incorrectly) called "extended ASCII" in various settings. But the point is that the multi-byte nature of the data is not at all considered by the component rendering the string to your screen, let alone specifically its UTF-8-ness.

Converting UTF16(Windows wchar_t) to UTF8 in C++ Non-English letters corrupted(Korean)

I'm trying to make a multiplatform app. On the Windows Store App(winrt) side, open a file and read its path in Platform::String format which is wchar_t, UTF16 in Windows.
Since my core logic is platform independent and only use standard C++ data types, I've converted the path into std::string in UTF8 via this code:
Platform::String^ copyPath = copy->Path;
std::wstring source(copyPath->Data());
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t >, wchar_t > convert;
std::string u8CopyPath = convert.to_bytes(source);
However, when I check u8CopyPath in debugger, it shows corrupted letters for non-English chars. Far as I know, UTF-8 is perfectly capable of encoding non-English languages since it can use multiple bytes for a single letter. Is there something in the conversion that corrupts the non-English letters?
It turns out it's just a debugger thing. Once I wrote it to a file and examine it, it printed out correctly.

Convert wide CString to char*

There are lots of times this question has been asked and as many answers - none of which work for me and, it seems, many others. The question is about wide CStrings and 8bit chars under MFC. We all want an answer that will work in ALL cases, not a specific instance.
void Dosomething(CString csFileName)
{
char cLocFileNamestr[1024];
char cIntFileNamestr[1024];
// Convert from whatever version of CString is supplied
// to an 8 bit char string
cIntFileNamestr = ConvertCStochar(csFileName);
sprintf_s(cLocFileNamestr, "%s_%s", cIntFileNamestr, "pling.txt" );
m_KFile = fopen(LocFileNamestr, "wt");
}
This is an addition to existing code (by somebody else) for debugging.
I don't want to change the function signature, it is used in many places.
I cannot change the signature of sprintf_s, it is a library function.
You are leaving out a lot of details, or ignoring them. If you are building with UNICODE defined (which it seems you are), then the easiest way to convert to MBCS is like this:
CStringA strAIntFileNameStr = csFileName.GetString(); // uses default code page
CStringA is the 8-bit/MBCS version of CString.
However, it will fill with some garbage characters if the unicode string you are translating from contains characters that are not in the default code page.
Instead of using fopen(), you could use _wfopen() which will open a file with a unicode filename. To create your file name, you would use swprintf_s().
an answer that will work in ALL cases, not a specific instance...
There is no such thing.
It's easy to convert "ABCD..." from wchar_t* to char*, but it doesn't work that way with non-Latin languages.
Stick to CString and wchar_t when your project is unicode.
If you need to upload data to webpage or something, then use CW2A and CA2W for utf-8 and utf-16 conversion.
CStringW unicode = L"Россия";
MessageBoxW(0,unicode,L"Russian",0);//should be okay
CStringA utf8 = CW2A(unicode, CP_UTF8);
::MessageBoxA(0,utf8,"format error",0);//WinApi doesn't get UTF-8
char buf[1024];
strcpy(buf, utf8);
::MessageBoxA(0,buf,"format error",0);//same problem
//send this buf to webpage or other utf-8 systems
//this should be compatible with notepad etc.
//text will appear correctly
ofstream f(L"c:\\stuff\\okay.txt");
f.write(buf, strlen(buf));
//convert utf8 back to utf16
unicode = CA2W(buf, CP_UTF8);
::MessageBoxW(0,unicode,L"okay",0);

Why a Windows console with Chinese code page set can show a UTF-16 encoded character?

Per MSDN:
"For the Microsoft C/C++ compiler, the source and execution character sets are both ASCII."
C++03
2.1 Phases of translation
"..Any source file character not in the basic source character set
(2.2) is replaced by the universal-character-name that designates that
character. (An implementation may use any internal encoding, so long
as an actual extended character encountered in the source file, and
the same extended character expressed in the source file as a
universal-character-name (i.e. using the \uXXXX notation), are handled
equivalently.)"
2.13.2 Character literals
"A universal-character-name is translated to the encoding, in the
execution character set, of the character named. If there is no such
encoding, the universal-character-name is translated to an
implementation-defined encoding."
To test which execution character set is used by MSVC++, I wrote the following code:
wchar_t *str = L"中";
unsigned char *p = reinterpret_cast<unsigned char*>(str);
for (int i = 0; i < sizeof(L"中"); ++i)
{
printf ("%x ", *(p + i));
}
The output shows that 2d 4e 0 0, and 0x4e2d is the UTF-16 encoding of this Chinese character. So I conclude: UTF-16 is used as execution character set by MSVC (My version: 2012 4.5.50709)
After, I tried to print this character out to a Windows console. Since the default locale used by console is "C", I set the locale to code page 936 representing simplified Chinese characters.
// use the execution environment locale setting, which is 936
wchar_t *str = L"中";
char* locale = setlocale(LC_ALL, "");
wprintf (L"%ls\n", str);
Which outputs:
中
What I'm curious about is, how can a character encoded in UTF-16 be decoded by a Windows console whose locale(decoder) is set to non-UTF-16(MS code page 936)? How can that happen?
how can a character encoded in UTF-16 be decoded by a Windows console whose locale(decoder) is set to non-UTF-16
There are two ways you can write text to the console. The byte way, using the Win32 API WriteConsoleA, gives you characters from bytes interpreted using the console's code page ("ANSI"). The Unicode way, WriteConsoleW, receives a UTF-16LE string and writes the characters to the console directly without having to worry about what code page it is using.
The stdio function printf uses WriteConsoleA when the output is an interactive console. The wprintf function, from VS 2005 on at least, calls WriteConsoleW.
I think I get it.
In Microsoft C++ 2008(probably 2005+), CRT functions as wprintf, wcout are implemented such that they convert a wide string literal as L"中" encoded in UTF-16, under the hood, to match the current locale/code page setting. So what happens here is that L"中" is converted to bytes D6 D0 in code page 936 for simplified Chinese.
I was wrong that setlocale set the console code page. It just set the current program code page which is used by CRT functions during the "conversion". For changing console code page, command chcp or Win API SetConsoleOputputCP() achieves.
Since my console's default page is 936, that character can be correctly shown w/o problem.

String encoding VB6 / C++ dll

I am having a problem with some characters in 2 strings that my program uses.
String #1 is filled using VB code that gets data from a 3rd party application.
String #2 gets similar data from the same 3rd party application, but it gets it with a C++ dll and sends it to VB.
The data has some weird symbols in it.
I don't know a whole lot about encoding and different character sets, but I'll try to explain it the best I can.
I will use "Т" as my example character.
"Т" (note this isnt a normal capital t) it is unicode decimal value 1058
http://www.unicodemap.org/details/0x0422/index.html
When this character appears in String #1 during runtime it appears as "?", which I believe is just what VB6 does to show some unicode characters. When I use AscW on the character it returns the correct value of 1058.
When I output the string to a text file, it appears as "?".
The same character in String #2 from the C++ DLL appears as 2 characters "Т"
When I output that string to a text file, the character appears properly as "Т".
I was only outputting things to text files for testing purposes. I only need the 2 strings to be encoded / appear the same during run time.
Any idea whats going on here? Any way for me to get weird characters to appear the same in both strings?
Thanks
edit: also the C++ dll is in multi character set and sends the data in a BSTR string
CODE IN C++ DLL
allChat is a CString
BSTR Message;
int len = allChat.GetLength();
Message = SysAllocStringByteLen ((LPCTSTR)allChat,len+1);
Message is returned to the VB app.. and nothing happens to the string after that.
String #1 is just a regular VB string
From the way Cyrillic "T" becomes "Т", you get your string as a UTF8 encoded string (I verified that with Notepad++ by switching encodings). You need to convert it to Unicode before sending it to your VB app. Note that your VB app needs to be Unicode, not ASCII.
You can convert UTF8 to std::wstring with this function:
std::wstring utf8to16( const char* src )
{
vector<wchar_t> buffer;
buffer.resize(MultiByteToWideChar(CP_UTF8, 0, src, -1, 0, 0));
MultiByteToWideChar(CP_UTF8, 0, src, -1, &buffer[0], buffer.size());
return &buffer[0];
}