Convert Glyph indices to Unicode character - c++

I am working on a printer driver sample. In this sample i am hooking to DrvTextOut() call. when the call back is called i get the text as Glyph indices. i want to convert these Glyph indices to Unicode character.
please let me know how to convert it.

In general the answer is "you cannot". In PDFs, for instance, you might have an embedded character map that lets you look up the characters corresponding to the glyphs (e.g. if you used the cmap package with pdfLaTeX to make the the PDF), but glyphs are private to a font, and there may be many glyphs that get used for the same character and vice versa, thanks to the magic of the GSUB tables.
If you're really desperate and have access to the font in question, you could try to build a character map yourself from the font file, but you better know which font you are currently looking at.
Edit: I think your question is tagged poorly; are you referring to this function? Perhaps the FONTOBJ structure that you already own exposes some sort of character map of the font, I wouldn't know.

If you mean you need to handle the case where STROBJ->flAccel has SO_GLYPHINDEX_TEXTOUT set, then see this answer here from Microsoft's Bobby Mattappally:
http://www.winvistatips.com/glyph-handles-drvtextout-t183048.html
There is not always a 1:1 mapping
between glyph indices and character
codes and vice versa. This is
expecially true with international
character sets like Hebrew, Arabic
etc.
So once you get this as
SO_GLYPHINDEX_TEXTOUT, your driver
should handle the glyph itself instead
of trying to convert it back to
Unicode.

Related

Convert \xc3\xd8\xe8\xa7\xc3\xb4\xd to human readable format

I am having trouble converting '\xc3\xd8\xe8\xa7\xc3\xb4\xd' (which is a Thai text) to a readable format. I get this value from a smart card, and it basically was working for Windows but not in Linux.
If I print in my Python console, I get:
����ô
I tried to follow some google hints but I am unable to accomplish my goal.
Any suggestion is appreciated.
Your text does not seem to be a Unicode text. Instead, it looks like it is in one of Thai encodings. Hence, you must know the encoding before printing the text.
For example, if we assume your data is encoded in TIS-620 (and the last character is \xd2 instead of \xd) then it will be "รุ่งรดา".
To work with the non-Unicode strings in Python, you may try: myString.decode("tis-620") or even sys.setdefaultencoding("tis-620")

how to display extended ascii character in QTextEdit

All the ASCII codes greater than 127 are replaced by Diamond? symbol. How can I display those characters. I have an unsigned char buffer[1024] which contains values from 0 to 256.
Use the QString class's fromAscii() method. By default this will treat Ascii chars above 128 as Latin-1 chars. To change this behavior use QTextCodec::setCodecForCStrings method to set the correct codec for your usage.
I believe QT5 may have taken out the setCodecForCStrings method.
EDIT: Adnan supplied the QT5 alternative to setCodecForCStrings method, adding to answer for completeness.
Qt5 alternative for setCodecForCStrings is QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
This is a rabbit hole with no end. Qt does not fully support printing ascii > 127 as it is not well defined. The current method is to use "fromLocal8bit()" which will take a char array and transform it into the "right" Unicode string (the only thing Qt supports printing).
QTextCodec::setCodecForLocale can be used to identify the character set you wish to transform from. Many codecs are supported, but for some reason IBM437 (the character set used by IBM PCs in the US for decades) is not supported, where several other codecs used by Europe, etc. are. Probably some characters in IBM437 were never assigned proper code points in Unicode, so transforming it isn't possible?
What's frustrating is that there are fonts with all 256 ascii code points, but it is simply not possible to display these in Qt as they only work with Unicode strings. There are a handful of glyphs they don't support, and it seems to grow with newer versions of Qt. Currently I know of 9, 10, 12, 13, and 173. Some of these are for obvious reasons (usually you don't want to print a carriage return glyph, though it did exist in DOS), but others used to work in Qt and now do not.
In my application, I resorted to creating a new font that has copies of the unprintable glyphs in higher unicode codepoints, and translate them before printing them on the screen. It's quite silly but Qt gave up on ascii many years ago, so it's the best option I could find.

Unicode troubles in FreeType

So, I've got an implementation that parses an xml that, among other things, positions and strings of Wikipedia's main page. The parsing is done with rapidxml after which the strings are converted from UTF-8 to UTF-32 by http://utfcpp.sourceforge.net/. The UTF-32 code is then used in freetype's:
unsigned long c = FT_Get_Char_Index(face,*p);
FT_Load_Glyph(face,c,FT_LOAD_RENDER);
where *p is the UTF-32 char code. This glyph is then rendered in OpenGL.
Now, I can't seem to get cryllic characters to work, nor any chinese or japanese or viet, I am sure that *p corresponds to the correct code, and I would be thankful for any pointers I can get.
For these fonts Microsofts arial.ttf is used, from the Arch linux package and from what I've seen in fontviewing programs, it should contain the characters that I want.
Two things to suggest:
First, have you called FT_Select_Charmap to specify you're using a Unicode encoding?
FT_Select_Charmap(face , ft_encoding_unicode);
Second, not all Arial fonts have all characters, and some font viewers (on Windows, anyway) can mislead by automatically substituting glyphs from different faces. Try ArialUni.ttf if you can find it.
Do not forget to set the font size right after loading the face.
FT_Error err = FT_Set_Pixel_Sizes(face, (width), (height));

Unicode character for superscript shows a square box: ࠚ

Using the following code to create a Unicode string:
wchar_t HELLO[20];
wsprintf(HELLO, TEXT("%c"), 0x2074);
When I display this onto a Win32 Control like a Text box or a button it gets mapped to a [] Square.
How do I fix this ?
I tried compiling with both Eclipse(MinGW) and Microsoft Visual C++ (2010).
Also, UNICODE is defined at the top
Edit:
I think it might be something to do with my system, because when I visit: http://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts
some of the unicode characters don't appear.
The font you are using does not contain a glyph for that character. You will likely need to install some new fonts to overcome this deficiency.
The character you have picked out is 'SAMARITAN MODIFIER LETTER EPENTHETIC YUT' (U+081A). Perhaps you were after U+2074, i.e. 'SUPERSCRIPT FOUR' (U+2074). You need hex for that: 0x2074.
Note you changed the question to read 0x2074 but the original version read 2074. Either way, if you see a box that indicates your font is missing that glyph.
The characters you are getting from Wikipedia are expressed in hexadecimal, so your code should be:
wchar_t HELLO[20];
wsprintf(HELLO, TEXT("%c"), (wchar_t)0x2074); // or TEXT('\x2074')
If it still doesn't work, it's a font problem; if you need a pan-Unicode font, it seems that Code2000 is one of the most complete out there.
Funny fact: the character that has the decimal code 2074 (i.e. hex 81a) seems to actually be a box (or it's such a strange beast that even the image outline at FileFormat.Info is wrong). :)
For the curious ones: it turns out that 0x081a is this thing:

How can I recognize RTL strings in C++

I need to know the direction of my text before printing.
I'm using Unicode Characters.
How can I do that in C++?
If you don't want to use ICU, you can always manually parse the unicode database (.e.g., with a python script). It's a semicolon-separated text file, with each line representing a character code point. Look for the fifth record in each line - that's the character class. If it's R or AL, you have an RTL character, and 'L' is an LTR character. Other classes are weak or neutral types (like numerals), which I guess you'd want to ignore. Using that info, you can generate a lookup table of all RTL characters and then use it in your C++ code. If you really care about code size, you can minimize the size the lookup table takes in your code by using ranges (instead of an entry for each character), since most characters come in blocks of their BiDi class.
Now, define a function called GetCharDirection(wchar_t ch) which returns an enum value (say: Dir_LTR, Dir_RTL or Dir_Neutral) by checking the lookup table.
Now you can define a function GetStringDirection(const wchar_t*) which runs through all characters in the string until it encounters a character which is not Dir_Neutral. This first non-neutral character in the string should set the base direction for that string. Or at least that's how ICU seems to work.
You could use the ICU library, which has a functions for that (ubidi_getDirection ubidi_getBaseDirection).
The size of ICU can be reduced, by recompiling the data library (which is normally about 15MB big), to include only the converters/locals which are needed for the project.
The section Reducing the Size of ICU's Data: Conversion Tables of the site http://userguide.icu-project.org/icudata, contains information how you can reduce the size of the data library.
If only need support for the most common encodings (US-ASCII, ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8), the data library wont be needed anyway.
From Boaz Yaniv said before, maybe something like this will easier and faster than parsing the whole file:
int aft_isrtl(int c){
if (
(c==0x05BE)||(c==0x05C0)||(c==0x05C3)||(c==0x05C6)||
((c>=0x05D0)&&(c<=0x05F4))||
(c==0x0608)||(c==0x060B)||(c==0x060D)||
((c>=0x061B)&&(c<=0x064A))||
((c>=0x066D)&&(c<=0x066F))||
((c>=0x0671)&&(c<=0x06D5))||
((c>=0x06E5)&&(c<=0x06E6))||
((c>=0x06EE)&&(c<=0x06EF))||
((c>=0x06FA)&&(c<=0x0710))||
((c>=0x0712)&&(c<=0x072F))||
((c>=0x074D)&&(c<=0x07A5))||
((c>=0x07B1)&&(c<=0x07EA))||
((c>=0x07F4)&&(c<=0x07F5))||
((c>=0x07FA)&&(c<=0x0815))||
(c==0x081A)||(c==0x0824)||(c==0x0828)||
((c>=0x0830)&&(c<=0x0858))||
((c>=0x085E)&&(c<=0x08AC))||
(c==0x200F)||(c==0xFB1D)||
((c>=0xFB1F)&&(c<=0xFB28))||
((c>=0xFB2A)&&(c<=0xFD3D))||
((c>=0xFD50)&&(c<=0xFDFC))||
((c>=0xFE70)&&(c<=0xFEFC))||
((c>=0x10800)&&(c<=0x1091B))||
((c>=0x10920)&&(c<=0x10A00))||
((c>=0x10A10)&&(c<=0x10A33))||
((c>=0x10A40)&&(c<=0x10B35))||
((c>=0x10B40)&&(c<=0x10C48))||
((c>=0x1EE00)&&(c<=0x1EEBB))
) return 1;
return 0;
}
If you are using Windows GDI, it would seem that GetFontLanguageInfo(HDC) returns a DWORD; if GCP_REORDER is set, the language requires reordering for display, for example, Hebrew or Arabic.