:QML word wrap with nbsp on Linux - c++

I have the following problem:
When I build my application on Windows QML texts do actually wrap correctly with respect to the nbsp character (U+00A0 I think). On my Raspberry Pi with Raspbian however, it seems that the nbsp is ignored and the text is wrapped as if it was just a normal space.
There are several things that may have some importance here:
On Windows I have QT 5.4 whereas on the Raspberry Pi there is 5.2
I think it may have something to do with encoding. The thing is I remember it worked before I forced the G++ compiler on Pi to take the input files as CP1250 (I added QMAKE_CXXFLAGS += -finput-charset=CP1250 to the project file). Well I had to make this tweak because of the diacritics in some of the string literals (otherwise the texts are absolutely broken on raspberry). So as I said I think the word wrap have worked before I changed this compiler switch.
But still, there is not a single problem with the displaying of anything except that the texts happen to be breaked where they shouldn't. Note that there is not any "random" character or something but a regular space. That's absolutely strange as this looks there is no problem with encoding but rather with the word wrapping algorith itslef. But as I said it used to work when it thought the string literals are whatever the default on Linux is (UTF-8 I guess...).
As for the QML Text assignment these strings are taken from C array and assigned to the QML text using QObject::setProperty if that is of any importance...
Also note that I probably cannot change the encoding of my sources to UTF-8 because the file with the strings is shared also for some embedded project that works on the other side of the communication and this one has to be CP1250 because of the IDE.
Thanks in advance
EDIT:
I have some additional information: If I go through one of the affected string literals on Windows, it is in fact shorter than the same literal compiled on Raspberry, even when the source encoding is set to CP1250. For example the nbsp is encoded in only one byte on Windows (160d), but it is two bytes on Raspberry (194d,160d). That's strange, isn't it? I'd expect that after explaining g++ that the source code is encoded in CP1250, it should encode the literals in the same way? Or maybe not because this is then encoding of the string in the memory which is different by default on both Windows and Linux. But still I don't see where's the problem.

As suggested by Kevin Krammer,
QString::fromLocal8Bit()
was the solution.

Related

Why are ANSI escape codes not displaying properly?

I am implementing a Python console application, which uses ANSI escape codes to colorize various things. I develop on Pop OS (an Ubuntu derivative), and colorization works as designed.
I just tried the app on a Centos machine, and while the colors come out correctly, there is additional text (tiny boxes containing digits, stacked vertically), surrounding the colorized text, that apparently corresponds to the escape codes.
The escape codes are all specified in this bit of Python:
style = ('\033[1m\033[3m' if bold and italic else
'\033[1m' if bold else
'\033[3m' if italic else
'\033[0m')
return f'\001{style}\002\001\033[38;5;{color.code}m\002{s}\001\033[0m\002'
(The project I'm working on is https://github.com/geophile/marcel, and the above code comes from marcel.util.colorize().)
What's really odd is that in some cases, the extra characters aren't there, in other cases they are. Also, if I ssh from my pop os machine to my centos machine, the text is colorized correctly in all cases.
What explains this difference in behavior -- something in .bashrc? Something about X configuration?
That \002 is not an "ANSI" escape code. Some programs (not terminals) may interpret it, and depending on how the string is used, that may bypass the programs which were intended to process the extra escapes. (Some terminals may of course provide their own interpretation for 002, etc., but you're unlikely to find that documented anywhere except for their source-code).

C++: Qt 5.3 fails to display UTF-8 character

I am trying to display a unicode character (Euro sign) on a button using Qt and C++ in Visual Studio 2013. I tried the following code:
_rotateLeftButton->setText("\u20AC");
and
_rotateLeftButton->setText("€");
and
_rotateLeftButton->setText(QString::fromUtf8("\u20AC"));
and
_rotateLeftButton->setText(QString::fromUtf8("€"));
However, all of those lines result in the following:
All my code files are UTF-8 encoded, except for the moc files (.cxx). For whichever reason the moc executable does not generate them using unicode. Yet I was not able to get this unicode symbol displayed correctly. I also tried setting another font than the default one withouth success. Does anyone know what could be the problem?
Thank you for your help.
QString::fromUtf8("€")
Will work if the file really is handled as UTF-8. As #n.m. commented, VS requires some help from a faux-BOM to ensure this.
QString::fromUtf8("\u20AC")
\u doesn't make sense in a byte string literal. You could spell it using \x byte escapes for the UTF-8 encoded version:
QString::fromUtf8("\xE2\x82\xAC")
Or use a wide string literal:
QString::fromWCharArray(L"\u20AC")

Unicode, UTF-8, UTF-16 and UTF-32 questions [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I read a lot about Unicode, ASCII, code pages, all the history, the invention of UTF-8, UTF-16 (UCS-2), UTF-32 (UCS-4) and who use them and so on, but I still having some questions that I tried hardly to find answers but I couldn't and I hope you to help me.
1 - Unicode is a standard for encoding characters and they specify a code point for each character. Something like U+0000 (example). Imagine that I have a file that has those code points (\u0000), in which point of my application I'm going to use it?
This might be a silly question but I really don't know in which point of my application I'm going to use it.
I'm creating an application that can read file that has those code points using the escape \u and I know that I can read it, decode it but now the next question.
2 - To which character set (code page) do I need to convert it? I saw some C++ libraries that they uses the name utf8_to_unicode or utf8-to-utf16 and also only utf8_decode, and this is what makes me confuse.
I don't know if will appear answers like this, but some might say: You need to convert it into code pages that you are going to use, but what if my application needs to be internationalized?
3 - I was wondering, in C++ if I try to display non-ASCII characters on terminal I got some confusing words. The question is: What makes the characters to be displayed are the fonts?
#include <iostream>
int main()
{
std::cout << "ö" << std::endl;
return 0;
}
The output (Windows):
├Â
4 - In which part of that process the encoding enter? It encodes, takes the code point and try to find the word that is equal on the fonts?
5 = WebKit is an engine for rendering web pages in web browsers, if you specify the charset as UTF-8 it works nicely with all characters, but if I specify another charset it doesn't, doesn't matter the font that I'm using, what happen?
<html>
<head>
<meta charset="iso-8859-1">
</head>
<body>
<p>ö</p>
</body>
</html>
The output:
ö
Works using:
<meta charset="utf-8">
6 - Imagine now that I read the file, I encode it, I have all the code points and I need to save the file again. Do I need to save it encoded (\u0000) or I need to decode first to transform again into characters and then save?
7 - Why the word "unicode" is a bit overloaded and is sometimes understood to mean utf-16? (source)
That's all for now. Thanks in advance.
I'm creating an application that can read file that has those code points using the escape \u and I know that I can read it, decode it but now the next question.
If you're writing a program that processes some kind of custom escapes, such as \uXXXX, it's entirely up to you when to convert these escapes into Unicode code points.
To which character set (code page) do I need to convert it?
That depends on what you want to do. If you're using some other library that requires a specific code page then it's up to you to convert data from one encoding into the encoding required by that library. If you don't have any hard requirements imposed by such third party libraries then there may be no reason to do any conversion.
I was wondering, in C++ if I try to display non-ASCII characters on terminal I got some confusing words.
This is because various layers of the technology stack use different encodings. From the sample output you give, "├Â" I can see that what's happening is that your compiler is encoding the string literal as UTF-8, but the console is using Windows codepage 850. Normally when there are encoding problems with the console you can fix them by setting the console output codepage to the correct value, unfortunately passing UTF-8 through std::cout currently has some unique problems. Using printf instead worked for me in VS2012:
#include <cstdio>
#include <Windows.h>
int main() {
SetConsoleOutputCP(CP_UTF8);
std::printf("%s\n", "ö");
}
Hopefully Microsoft fixes the C++ libraries if they haven't already done so in VS 14.
In which part of that process the encoding enter? It encodes, takes the code point and try to find the word that is equal on the fonts?
Bytes of data are meaningless unless you know the encoding. So the encoding matters in all parts of the process.
I don't understand the second question here.
if you specify the charset as UTF-8 it works nicely with all characters, but if I specify another charset it doesn't, doesn't matter the font that I'm using, what happen?
What's going on here is that when you write charset="iso-8859-1" you also have to actually convert the document to that encoding. You're not doing that and instead you're leaving the document as UTF-8 encoded.
As a little exercise, say I have a file that contains the following two bytes:
0xC3 0xB6
Using information on UTF-8 encoding and decoding, what codepoint do the bytes decode to?
Now using this 8859-1 codepage, what do the same bytes decode to?
As another exercise, save two copies of your HTML document, one using charset="iso-8859-1" and one with charset="utf-8". Now use a hex editor and examine the contents of both files.
Imagine now that I read the file, I encode it, I have all the code points and I need to save the file again. Do I need to save it encoded (\u0000) or I need to decode first to transform again into characters and then save?
This depends on the program that will need to read the file. If the program expects all non-ASCII characters to be escaped like that then you have to save the file that way. But escaping characters with \u is not a normal thing to do. I only see this done in a few places, such as JSON data and C++ source code.
Why the word "unicode" is a bit overloaded and is sometimes understood to mean utf-16?
Largely because Microsoft uses the term this way. They do so for historical reasons: When they added Unicode support they named all their options and setting "Unicode" but the only encoding they supported was UTF-16.

what locale does wstring support?

In my program I used wstring to print out text I needed but it gave me random ciphers (those due to different encoding scheme). For example, I have this block of code.
wstring text;
text.append(L"Some text");
Then I use directX to render it on screen. I used to use wchar_t but I heard it has portability problem so I switched to swtring. wchar_t worked fine but it seemed only took English character from what I can tell (the print out just totally ignore the non-English character entered), which was fine, until I switch to wstring: I only got random ciphers that looked like Chinese and Korean mixed together. And interestingly, my computer locale for non-unicode text is Chinese. Based on what I saw I suspected that it would render Chinese character correctly, so then I tried and it does display the charactor correctly but with a square in front (which is still kind of incorrect display). I then guessed the encoding might depend on the language locale so I switched the locale to English(US) (I use win8), then I restart and saw my Chinese test character in the source file became some random stuff (my file is not saved in unicode format since all texts are English) then I tried with English character, but no luck, the display seemed exactly the same and have nothing to do with the locale. But I don't understand why it doesn't display correctly and looked like asian charactor (even I use English locale).
Is there some conversion should be done or should I save my file in different encoding format? The problem is I wanted to display English charactore correctly which is the default.
In the absence of code that demonstrates your problem, I will give you a correspondingly general answer.
You are trying to display English characters, but see Chinese characters. That is what happens when you pass 8 bit ANSI text to an API that receives UTF-16 text. Look for somewhere in your program where you cast from char* to wchar_t*.
First of all what is type of file you are trying to store text in?Normal txt files stores in ANSI by default (so does excel). So when you are trying to print a Unicode character to a ANSI file it will print junk. Two ways of over coming this problem is:
try to open the file in UTF-8 or 16 mode and then write
convert Unicode to ANSI before writing in file. If you are using windows then MSDN provides particular API to do Unicode to ANSI conversion and vice-verse. If you are using Linux then Google for conversion of Unicode to ANSI. There are lot of solution out there.
Hope this helps!!!
std::wstring does not have any locale/internationalisation support at all. It is just a container for storing sequences of wchar_t.
The problem with wchar_t is that its encoding is unspecified. It might be Unicode UTF-16, or Unicode UTF-32, or Shift-JIS, or something completely different. There is no way to tell from within a program.
You will have the best chances of getting things to work if you ensure that the encoding of your source code is the same as the encoding used by the locale under which the program will run.
But, the use of third-party libraries (like DirectX) can place additional constraints due to possible limitations in what encodings those libraries expect and support.
Bug solved, it turns out to be the CASTING problem (not rendering problem as previously said).
The bugged text is a intermediate product during some internal conversion process using swtringstream (which I forgot to mention), the code is as follows
wstringstream wss;
wstring text;
textToGenerate.append(L"some text");
wss << timer->getTime()
text.append(wss.str());
Right after this process the debugger shows the text as a bunch of random stuff but later somehow it converts back so it's readable. But the problem appears at rendering stage using DirectX. I somehow left the casting for wchar_t*, which results in the incorrect rendering.
old:
LPCWSTR lpcwstrText = (LPCWSTR)textToDraw->getText();
new:
LPCWSTR lpcwstrText = (*textToDraw->getText()).c_str();
By changing that solves the problem.
So, this is resulted by a bad cast. As some kind people provided correction to my statement.

Encoding issue using XZIP

I wrote a c++ program that needs to zip files in it's work. For creating these zip files I used the XZip library. While developing this program ran on a Win7 machine and it works fine.
Now the program should be used on a WindowsXP machine. The issue I run into is:
If I let XZip create the zip archive "ü.zip" and add the file "ü.txt" to it on Win7 it is working as intended. On WindowsXP however I end up having the "ü.zip" file with "³.txt" as file in it.
The "³" => "ü" thing is of course an encoding issue between UTF8 and Ascii (ü = 252 in UTF8 and 252 = ³ in Ascii) BUT I can't really imagine how this could affect the creating of the internal zip structure in different ways depending on the OS.
//EDIT to clear it up:
the problem is that I run a test with XZip on Win7 and get the archive "ü.zip" containing the file with name "ü.txt".
When I run that test on an XP machine I get the archive "ü.zip" containing the file "³.txt".
//Edit2:
The thing that makes me wonder about that is, what exactly causes the zip to change between XP and Win7. The fact that it does change means that either a windows function behaves differently or XZip has specific behavior for different OS built in.
When having a quick look at XZip I can't see that it changes the encoding flag on the zip archives. The question of course only can be answered by people who did have a closer look into this exact problem before.
As a general rule, if you want any sort of portability between locales, OS's (including different versions) and what have you, you should limit your filenames to the usual 26 letters, the 10 digits, and perhaps '_' and '-' (and I'm not even sure about the latter), and one '.', no more than three characters from the end. Once you start using letters beyond the original ASCII character set, you're at the merci of the various programs which interpret the character set.
Also, 252 isn't anything in ASCII, since ASCII only uses character codes in the range 0...127. And in UTF-8, 252 would be the first byte of a six byte character. Something that doesn't exist in Unicode: in UTF-8, LATIN SMALL LETTER U WITH DIAERESIS would be the two byte sequence 0xC3, 0xBC. 256 is the encoding of LATIN SMALL LETTER U WITH DIAERESIS in ISO 8859-1, otherwise known as Latin-1; it's also the encoding in UTF-16 and UTF-32.
None of this, of course, should affect what is in the file.
May be you are building your Win32 program (or the library) as ASCII (not as UNICODE). It may help if you build your Win32 applications with UNICODE configuration setting (you may change it in your Visual Studio project settings).
It is impossible to say what happened in your program without seeing your code. May be your library or the archive format is not UNICODE-aware, may be your program's code is not UNICODE-aware, may be you don't handle strings careful enough, or may be you just have to change your project setting to UNICODE. Also your "8-bit encoding for non-Unicode programs" Windows OS setting matters if you don't use UNICODE strings.
As for 252, UTF8 and ASCII read post by James Kanze. It is more or less safe to use ASCII file names with no ':', '?', '*', '/', '\' characters. Using non-ASCII characters may lead to encoding problems if you are not using UNICODE-based programs and file-systems.