Unicode troubles in FreeType - c++

So, I've got an implementation that parses an xml that, among other things, positions and strings of Wikipedia's main page. The parsing is done with rapidxml after which the strings are converted from UTF-8 to UTF-32 by http://utfcpp.sourceforge.net/. The UTF-32 code is then used in freetype's:
unsigned long c = FT_Get_Char_Index(face,*p);
FT_Load_Glyph(face,c,FT_LOAD_RENDER);
where *p is the UTF-32 char code. This glyph is then rendered in OpenGL.
Now, I can't seem to get cryllic characters to work, nor any chinese or japanese or viet, I am sure that *p corresponds to the correct code, and I would be thankful for any pointers I can get.
For these fonts Microsofts arial.ttf is used, from the Arch linux package and from what I've seen in fontviewing programs, it should contain the characters that I want.

Two things to suggest:
First, have you called FT_Select_Charmap to specify you're using a Unicode encoding?
FT_Select_Charmap(face , ft_encoding_unicode);
Second, not all Arial fonts have all characters, and some font viewers (on Windows, anyway) can mislead by automatically substituting glyphs from different faces. Try ArialUni.ttf if you can find it.

Do not forget to set the font size right after loading the face.
FT_Error err = FT_Set_Pixel_Sizes(face, (width), (height));

Related

Issue with compound character font rendering in openGL using FTGL

In my application, FTGL renders unicode characters of true type fonts well except compound characters. Compound character is a combination of unicode consonant and vowel sounds.
For example, in an indic language Tamil, the following string input
"கா கி கீ கு கூ கெ கே கை கொ கோ கௌ"
is displayed as
in the openGL viewer.
output of some characters is not same as input.
Can anybody help on this?
my code snippet,
std::ifstream fontFile("latha.ttf", std::ios::binary);//tamil unicode font
if (fontFile.fail())
return NULL;
fontFile.seekg(0, std::ios::end);
std::fstream::pos_type fontFileSize = fontFile.tellg();
fontFile.seekg(0);
unsigned char *fontBuffer = new unsigned char[fontFileSize];
fontFile.read((char *)fontBuffer, fontFileSize);
FTBitmapFont* m_pFTFont = new FTBitmapFont(fontBuffer, fontFileSize);
m_pFTFont->Render("கா கி கீ கு கூ கெ கே கை கொ கோ கௌ");
Welcome to the wonderful world of "complex text layout". Enjoy your stay ;)
To put it simply, a script that uses complex text layout does not have a simple 1:1 mapping between Unicode codepoints and the glyph to be rendered. And Tamil is such a script.
FTGL only handles scripts that have simple layout, because complex text layout is complex (very much so). So it's not going to be able to reproduce Tamil script correctly.
You will need to use a layout engine like Harfbuzz to layout your text. FTGL can still render the script, but you'll need Harfbuzz to tell you which glyphs in the font to render.

how to display extended ascii character in QTextEdit

All the ASCII codes greater than 127 are replaced by Diamond? symbol. How can I display those characters. I have an unsigned char buffer[1024] which contains values from 0 to 256.
Use the QString class's fromAscii() method. By default this will treat Ascii chars above 128 as Latin-1 chars. To change this behavior use QTextCodec::setCodecForCStrings method to set the correct codec for your usage.
I believe QT5 may have taken out the setCodecForCStrings method.
EDIT: Adnan supplied the QT5 alternative to setCodecForCStrings method, adding to answer for completeness.
Qt5 alternative for setCodecForCStrings is QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
This is a rabbit hole with no end. Qt does not fully support printing ascii > 127 as it is not well defined. The current method is to use "fromLocal8bit()" which will take a char array and transform it into the "right" Unicode string (the only thing Qt supports printing).
QTextCodec::setCodecForLocale can be used to identify the character set you wish to transform from. Many codecs are supported, but for some reason IBM437 (the character set used by IBM PCs in the US for decades) is not supported, where several other codecs used by Europe, etc. are. Probably some characters in IBM437 were never assigned proper code points in Unicode, so transforming it isn't possible?
What's frustrating is that there are fonts with all 256 ascii code points, but it is simply not possible to display these in Qt as they only work with Unicode strings. There are a handful of glyphs they don't support, and it seems to grow with newer versions of Qt. Currently I know of 9, 10, 12, 13, and 173. Some of these are for obvious reasons (usually you don't want to print a carriage return glyph, though it did exist in DOS), but others used to work in Qt and now do not.
In my application, I resorted to creating a new font that has copies of the unprintable glyphs in higher unicode codepoints, and translate them before printing them on the screen. It's quite silly but Qt gave up on ascii many years ago, so it's the best option I could find.

what locale does wstring support?

In my program I used wstring to print out text I needed but it gave me random ciphers (those due to different encoding scheme). For example, I have this block of code.
wstring text;
text.append(L"Some text");
Then I use directX to render it on screen. I used to use wchar_t but I heard it has portability problem so I switched to swtring. wchar_t worked fine but it seemed only took English character from what I can tell (the print out just totally ignore the non-English character entered), which was fine, until I switch to wstring: I only got random ciphers that looked like Chinese and Korean mixed together. And interestingly, my computer locale for non-unicode text is Chinese. Based on what I saw I suspected that it would render Chinese character correctly, so then I tried and it does display the charactor correctly but with a square in front (which is still kind of incorrect display). I then guessed the encoding might depend on the language locale so I switched the locale to English(US) (I use win8), then I restart and saw my Chinese test character in the source file became some random stuff (my file is not saved in unicode format since all texts are English) then I tried with English character, but no luck, the display seemed exactly the same and have nothing to do with the locale. But I don't understand why it doesn't display correctly and looked like asian charactor (even I use English locale).
Is there some conversion should be done or should I save my file in different encoding format? The problem is I wanted to display English charactore correctly which is the default.
In the absence of code that demonstrates your problem, I will give you a correspondingly general answer.
You are trying to display English characters, but see Chinese characters. That is what happens when you pass 8 bit ANSI text to an API that receives UTF-16 text. Look for somewhere in your program where you cast from char* to wchar_t*.
First of all what is type of file you are trying to store text in?Normal txt files stores in ANSI by default (so does excel). So when you are trying to print a Unicode character to a ANSI file it will print junk. Two ways of over coming this problem is:
try to open the file in UTF-8 or 16 mode and then write
convert Unicode to ANSI before writing in file. If you are using windows then MSDN provides particular API to do Unicode to ANSI conversion and vice-verse. If you are using Linux then Google for conversion of Unicode to ANSI. There are lot of solution out there.
Hope this helps!!!
std::wstring does not have any locale/internationalisation support at all. It is just a container for storing sequences of wchar_t.
The problem with wchar_t is that its encoding is unspecified. It might be Unicode UTF-16, or Unicode UTF-32, or Shift-JIS, or something completely different. There is no way to tell from within a program.
You will have the best chances of getting things to work if you ensure that the encoding of your source code is the same as the encoding used by the locale under which the program will run.
But, the use of third-party libraries (like DirectX) can place additional constraints due to possible limitations in what encodings those libraries expect and support.
Bug solved, it turns out to be the CASTING problem (not rendering problem as previously said).
The bugged text is a intermediate product during some internal conversion process using swtringstream (which I forgot to mention), the code is as follows
wstringstream wss;
wstring text;
textToGenerate.append(L"some text");
wss << timer->getTime()
text.append(wss.str());
Right after this process the debugger shows the text as a bunch of random stuff but later somehow it converts back so it's readable. But the problem appears at rendering stage using DirectX. I somehow left the casting for wchar_t*, which results in the incorrect rendering.
old:
LPCWSTR lpcwstrText = (LPCWSTR)textToDraw->getText();
new:
LPCWSTR lpcwstrText = (*textToDraw->getText()).c_str();
By changing that solves the problem.
So, this is resulted by a bad cast. As some kind people provided correction to my statement.

Unicode character for superscript shows a square box: ࠚ

Using the following code to create a Unicode string:
wchar_t HELLO[20];
wsprintf(HELLO, TEXT("%c"), 0x2074);
When I display this onto a Win32 Control like a Text box or a button it gets mapped to a [] Square.
How do I fix this ?
I tried compiling with both Eclipse(MinGW) and Microsoft Visual C++ (2010).
Also, UNICODE is defined at the top
Edit:
I think it might be something to do with my system, because when I visit: http://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts
some of the unicode characters don't appear.
The font you are using does not contain a glyph for that character. You will likely need to install some new fonts to overcome this deficiency.
The character you have picked out is 'SAMARITAN MODIFIER LETTER EPENTHETIC YUT' (U+081A). Perhaps you were after U+2074, i.e. 'SUPERSCRIPT FOUR' (U+2074). You need hex for that: 0x2074.
Note you changed the question to read 0x2074 but the original version read 2074. Either way, if you see a box that indicates your font is missing that glyph.
The characters you are getting from Wikipedia are expressed in hexadecimal, so your code should be:
wchar_t HELLO[20];
wsprintf(HELLO, TEXT("%c"), (wchar_t)0x2074); // or TEXT('\x2074')
If it still doesn't work, it's a font problem; if you need a pan-Unicode font, it seems that Code2000 is one of the most complete out there.
Funny fact: the character that has the decimal code 2074 (i.e. hex 81a) seems to actually be a box (or it's such a strange beast that even the image outline at FileFormat.Info is wrong). :)
For the curious ones: it turns out that 0x081a is this thing:

Problem rendering non-English unicode text using freetype font on OpenGL

I am currently following NeHe tutorial lesson 43 ( http://nehe.gamedev.net/data/lessons/lesson.asp?lesson=43). The code works satisfactorily only for English text, not Unicoded languages. Fortunately, I follow a link from NeHe lesson 43 to http://www.cs.northwestern.edu/~sco590/fonts_tutorial.html and found another identical tutorial sample with only one difference: it uses w_char, and the site claims that you can run on a language other than English.
So I give it a try:
freetype::print(our_font, 320, 200, (unsigned short*)L"Active FreeType Text หกโด้กี่ดุ öáæé おはよ。- %7.2f", cnt1);
the function print of namespace freetype has the 4th argument as *const unsigned short** so I typecasted it. I also put an L in front of the double quoted string for long characters and put in some Asian characters for testing purpose.
The result is all the English text can displayed just fine, but all the Thai characters become "[]B[]I[]5H[]8". The [] are square boxes. From what I understand, this implies that the font does not have the specified language, so I tried out other fonts, but all other Thai fonts give out these same square boxes. For the Japanese font, it is the same. All boxes along with some English characters next to them. The substring öáæé is being rendered just fine without any problem.
Am I forgetting something here? How can we display non-English Unicode language here?
Fortunately, the author has uploaded a modified version of his tutorial in his website (specified in the question) and it uses wchar_t (in the original version, the author uses *const unsigned short** as an argument in the print function), which allows non-English languages.
It looks like print() in lesson 43 is not even anywhere near Unicode capable. All NeHe is doing is creating 256 display lists for the first 256 ASCII characters, not accepting a UTF8 string and converting it to UTF32 for FreeType.
Transliterating this into C++ has worked quite well for me.
Also, grab a copy of the GNU Unifont to make sure you have glyphs for all of the BMP.