Arabic: 'source' Unicode to final display Unicode - c++

simple question:
this is the final display string I am looking for
لعبة ديدة
now below is each of the separate characters, before being 'glued' together (so I've put a space between each of them to stop the joining)
ل ع ب ة د ي د ة
note how they are NOT the same characters, there is some magical transform that melds them together and converts them to new Unicode characters.
and then in that above, the characters are actually appearing right to left (in memory, they are left to right)
so my simple question is this: where do I get a platform independent c/c++ function that will take my source 16 bit Unicode string, and do the transform on it to result in the Unicode string that will create the one first quoted above? doing the RTL conversion, and the joining?
that's all I want, one function that does that.
UPDATE:
ok, yes, I know that the 'characters' are the same in the two above examples, they are the same 'letters' but (viewing in chrome, or latest IE) anyone can CLEARLY see that the glyphs are different. now I'm fairly confident that this transform that needs to be done can be done on the unicode level, because my font file, and the unicode standard, seems to specify the different glyphs for both the separate, and various joined versions of the characters/letters. (unicode.org/charts/PDF/UFB50.pdf unicode.org/charts/PDF/UFE70.pdf)
so, can I just put my unicode into a function and get the transformed unicode out?

The joining and RTL conversion don't happen at the level of Unicode characters.
In other words: the order of the characters and the actual unicode codepoints are not changed during this process.
In fact, the merging and handling RTL/LTR transitions is handled by the text rendering engine.
This quote from the Wikipedia article on the Arabic alphabet explains it quite nicely:
Finally, the Unicode encoding of Arabic is in logical order, that is, the characters are entered, and stored in computer memory, in the order that they are written and pronounced without worrying about the direction in which they will be displayed on paper or on the screen. Again, it is left to the rendering engine to present the characters in the correct direction, using Unicode's bi-directional text features. In this regard, if the Arabic words on this page are written left to right, it is an indication that the Unicode rendering engine used to display them is out-of-date.

The processing you're looking for is called ligature. Unlike many latin-based languages, where you can simply put one character after another to render the text, ligatures are fundamental in arabic. The substitution is done in the text rendering engine, and the ligature infos are generally stored in font files.
note how they are NOT the same characters
They are the same for an Arabic reader. It is still readable.
There is no transform to do on your Unicode16 source text. You must provide the whole string to your text renderer. In C/C++, and as you are going the platform independent way, you can use Pango for rendering.
Note : Perhaps you wanted to write لعبة جديدة (i.e. new game) ? Because what you give as an example has no meaning in Arabic.

I realise this is an old question, but what you're looking for is FriBidi, the GNU implementation of the Unicode bidirectional algorithm.
This program does the glyph selection that was asked about in the question, as well as handling bidirectional text (mixture of right-to-left and left-to-right text).

What you are looking for is an Arabic script synthesis algorithm. I'm not aware one exists as open source. If you arrive at one please post.
Some points:
At the storage level, there is no Unicode transform. There is an abstract representation of the string as pointed out by other answers.
At the rendering level, you could choose to use Unicode Presentation Forms, but you could also choose to use other forms. Unicode Presentation Forms are not a standard for what presentation output encoding should be - rather they are just one example of presentation codes that can be output by the rendering engine using script synthesis.
To make it clearer: There wouldn't be a single standard transform (ie synthesis algorithm) that would transform from A to B, where A is standard Unicode Arabic page, and B is standard Unicode Arabic Presentation Forms. Rather, there would be different transformations that can vary in complexity and can have different encoding systems for B, but one of the encodings that can be used for B is the Unicode Presentation Forms.
For example, a simple typewriter style would require a simple rendering algorithm that would not require Presentation Forms. Indeed there does exist modern writing styles (not in common usage though) where A and B are actually identical, only that a different font page would be used to do the rendering. On the other hand, the transform to render typesetting or traditional calligraphic forms would be more complex and require something similar to the Unicode Presentation Forms.
Here are a couple of pointers for more information on the topic:
http://unicode.org/faq/ligature_digraph.html#Pf1
http://www.decotype.com/publications/unicode-tutorial.pdf

PLease see: http://www.fileformat.info/info/unicode/block/arabic_presentation_forms_b/list.htm and Have a look at this repo: https://github.com/Accorpa/Arabic-Converter-From-and-To-Arabic-Presentation-Forms-B

Related

Checking if the content language uses the right to left direction?

Is there a built-in method in Qt or another way to check if the content language uses the Right-to-Left direction?
QFile fileHandle("c:/file.txt");
if(!fileHandle.open(QFile::ReadOnly|QFile::Text))
return;
QTextStream fileContent(&fileHandle);
fileContent.setCodec("UTF-8");
fileContent.setGenerateByteOrderMark(false);
ui->plainTextEdit->setPlainText(fileContent.readAll());
fileHandle.close();
I haven't work too much with right-to-left languages, but hope these suggestions can help you:
If you know your content is in UNICODE you can check out this answer (use QTextCodec::codecForUtfText) to detect exact encoding. Then, classify the symbols to detect the dominant subset (left-to-right: English, Cyrillic..., right-to-left: Arabic, Hebrew...), probably a histogram will be enough. You could use a language detection framework instead, but I think you only need the type of language, not the language itself (which is by far more complex).
Search for the right-to-left mark (RLM) (a non-printed character commonly used to indicate bi-directional text). If you create the content you can add the RLM at the beginning of the file (the opposite (LRM) also exists).

How to display characters of any language on the screen using opengl

My requirement is to display string of any language on the screen.
Currently we are using opengl to display English characters.
Same APIs are not working for other languages. Instead of characters, boxes are displayed on screen.
Can someone help in understanding opengl and find appropriate APIs to display charterers of any language?
Currently we are using opengl to display English characters.
No, you're not using OpenGL. How do I know this? Because OpenGL does not do text rendering. All it does it points, lines and triangles.
What you're using is some library that knows how to draw characters with points, lines and triangles and then uses OpenGL to get that job done. And the particular library you're using apparently doesn't know, how to deal with characters outside of the ASCII character set.
Of course it's not just that what matters. Encoding matters as well. The most recent versions of C++ support Unicode in program sources (so that you can write unicode in string literals), but that does not automatically give you unicode support in your program – it's just the compiler who knows how to deal with it, but that knowledge does not automatically transpire into the compiled program.
So far there is only one operating system in which Unicode support is so deeply ingrained that no extra work is required; in fact a particular way of encoding Unicode was invented for it, but unfortunately this is one of the most niche OS projects there is around: Plan9
Apart from Unicode, there are also many other character encoding schemes, all incompatible with each other, each for a particular kind of writing. Which means, that it's also impossible to mix characters from different writing systems in texts encoding with such localized characters sets. Hence a universal encoding scheme was invented.
You're most likely on Windows, Linux, BSD, Solaris or MacOS X. And in all of them making non-ASCII-characters work means extra work for you, the programmer. MacOS X is probably the one OS with the least barrier of entry.
So here are the questions you have to answer for yourself:
what character encoding used (hopefully Unicode)?
does the text renderer library used support code points in that encoding?
does the text renderer library come with a layout engine (the thing that positions characters) or does this have to be supplied extra?
Among the existing text renderers that can draw to OpenGL, currently Freetype-GL is the most capable; it has support for Unicode
https://github.com/rougier/freetype-gl

Using General Unicode Properties

I am trying to take advantage of the regex functionality : \p{UNICODE PROPERTY NAME}
However, I am struggling with understanding the a mapping of those property names.
I went direct to the Unicode.org website ( http://www.unicode.org/Public/UCD/latest/ucd/) and downloaded a file 'UnicodeData.txt' which has the catagory listed... but this only shows 27,268 character values.
But I understand there are 65k characters in utf-8 or ucs-2 .... so I am confused why the Unicode.org download only has 24k rows.
... am I missing a point here somewhere ?
I am sure I'm just being blind to something simple here ... if someone can help me understand.... I'd be grateful !
Everything is fine so far. The characters you see are all but the CJK ones (Chinese-Japanese-Korean). The Unicode consortium let those out of the main UnicodeData file to keep it at a reasonable size.
If you want to look up properties for single characters only (and not for bulks), you can use websites, that prepare that data for you, like Graphemica, FileFormat or (my own) Codepoints.net.
If, however, you need bulk lookups, Unicode also provides the data as an XML file with a specific syntax, that groups codepoints together. That might be the best choice for processing the data.

How to correctly display characters from different languages?

I am finishing application in Visual C++/Windows API and I am using MySql C Connector.
Whole application code uses ANSI, MySql C Connector is in ANSI too.
This program will be used on Polish and German computers with Windows XP/Vista/7 or 8.
I want to correcly display german umlauts and polish accent characters on:
DialogBox controls (strings are loaded from language files)
Generated XHTML documents
Strings retrieved from MySql database displayed on controls and in XHTML documents
I have heard about MultiByteToWideChar and Unicode functions (MessageBoxW etc.), but application code is nearly finished, converting is a lot of work...
How to make character encoding correctly with the least work and time?
Maybe changing system code page for non-Unicode program?
First, of course: what code set is MySQL returning? Or perhaps:
what code set was used when writing the data into the data base?
Other than that, I don't think you'll be able to avoid using
either wide characters or multibyte characters: for single byte
characters, German would use ISO 8859-1 (code page 1252) or
ISO 8859-15, Polish ISO 8859-2 (code page 1250). But what are
you doing with the characters in your own code? You may be able
to get away with UTF-8 (code page 65001), without many changes.
The real question is where the characters originally come from
(although it might not be too difficult to translate them into
UTF-8 immediately at the source); I don't think that Windows
respects the code page for input.
Although it doesn't help you much to know it, you're dealing
with an almost impossible problem, since so much depends on
things outside your program: things like the encoding of the
display font, or the keyboard driver, for example. In fact,
it's not rare for programs to display one thing on the screen,
and something different when outputting to the printer, or to
display one thing on the screen, but something different if the
data is written to a file, and read with another program. The
situation is improving—modern Unix and the Internet are
gradually (very gradually) standardizing on UTF-8, everywhere
and for everything, and Windows normally uses UTF-16 for
everything that is pure Windows (but needs to support UTF-8 for
the Internet). But even using the platform standard won't help
if the human client has installed (and is using) fonts which
don't have the characters you need.

What is the native narrow string encoding on Windows?

The Subversion API has a number of functions for converting from "natively-encoded" strings to strings that are encoded in UTF-8. My question is: what is this native encoding on Windows? Does it depend on locale?
"Natively encoded" strings are strings written in whatever code page the user is using. That is, they are numbers that are translated to the appropriate glyphs based on the correct code page. Assuming the file was saved that way and not as a UTF-8 file.
This is a candidate question for Joel's article on Unicode.
Specifically:
Eventually this OEM free-for-all got
codified in the ANSI standard. In the
ANSI standard, everybody agreed on
what to do below 128, which was pretty
much the same as ASCII, but there were
lots of different ways to handle the
characters from 128 and on up,
depending on where you lived. These
different systems were called code
pages. So for example in Israel DOS
used a code page called 862, while
Greek users used 737. They were the
same below 128 but different from 128
up, where all the funny letters
resided. The national versions of
MS-DOS had dozens of these code pages,
handling everything from English to
Icelandic and they even had a few
"multilingual" code pages that could
do Esperanto and Galician on the same
computer! Wow! But getting, say,
Hebrew and Greek on the same computer
was a complete impossibility unless
you wrote your own custom program that
displayed everything using bitmapped
graphics, because Hebrew and Greek
required different code pages with
different interpretations of the high
numbers.
Windows 1252. Jukka Korpela has an excellent page on character encodings, with an extensive discussion of the Windows character set.
From the header svn_string.h you can see that the relevant svn_strings are just plain old const char* + a length element.
I would guess that the "natively encoded" svn strings are interpreted according to your system locale (I do not know this for sure, but this is the convention). On Windows 7 you can check your locale by selecting "Start-->Control Panel-->Region and Language-->Administrative-->Change system locale" where any value of English would probably entail the character encoding Windows 1252. However, a different system locale, for example Hebrew (Israel), would entail a different character encoding (Windows 1255 for the case of Hebrew).
Sadly the MSVC version of the C library does not support UTF-8 and uses legacy codepages only, but cygwin provides a UTF-8 locale as part of its emulation layer. If your svn is built on cygwin, you should be able to use UTF-8 just fine.