Checking if the content language uses the right to left direction? - c++

Is there a built-in method in Qt or another way to check if the content language uses the Right-to-Left direction?
QFile fileHandle("c:/file.txt");
if(!fileHandle.open(QFile::ReadOnly|QFile::Text))
return;
QTextStream fileContent(&fileHandle);
fileContent.setCodec("UTF-8");
fileContent.setGenerateByteOrderMark(false);
ui->plainTextEdit->setPlainText(fileContent.readAll());
fileHandle.close();

I haven't work too much with right-to-left languages, but hope these suggestions can help you:
If you know your content is in UNICODE you can check out this answer (use QTextCodec::codecForUtfText) to detect exact encoding. Then, classify the symbols to detect the dominant subset (left-to-right: English, Cyrillic..., right-to-left: Arabic, Hebrew...), probably a histogram will be enough. You could use a language detection framework instead, but I think you only need the type of language, not the language itself (which is by far more complex).
Search for the right-to-left mark (RLM) (a non-printed character commonly used to indicate bi-directional text). If you create the content you can add the RLM at the beginning of the file (the opposite (LRM) also exists).

Related

Using General Unicode Properties

I am trying to take advantage of the regex functionality : \p{UNICODE PROPERTY NAME}
However, I am struggling with understanding the a mapping of those property names.
I went direct to the Unicode.org website ( http://www.unicode.org/Public/UCD/latest/ucd/) and downloaded a file 'UnicodeData.txt' which has the catagory listed... but this only shows 27,268 character values.
But I understand there are 65k characters in utf-8 or ucs-2 .... so I am confused why the Unicode.org download only has 24k rows.
... am I missing a point here somewhere ?
I am sure I'm just being blind to something simple here ... if someone can help me understand.... I'd be grateful !
Everything is fine so far. The characters you see are all but the CJK ones (Chinese-Japanese-Korean). The Unicode consortium let those out of the main UnicodeData file to keep it at a reasonable size.
If you want to look up properties for single characters only (and not for bulks), you can use websites, that prepare that data for you, like Graphemica, FileFormat or (my own) Codepoints.net.
If, however, you need bulk lookups, Unicode also provides the data as an XML file with a specific syntax, that groups codepoints together. That might be the best choice for processing the data.

Effective and simple way of storing multiline strings of text?

I'm trying to find a way to do this.
My game is multilingual.
I have English.txt, French.txt, etc..
I'm wondering what would be a good way to store it in the file for example:
<sendbutton.tooltip>
Use this button to send text.
The text can be as long as you like!
</sendbutton.tooltip>
or
sendbutton.tooltip = Use this button to send text.\n\nThe text can be as long as you like!
I then will map these strings to their element name for runtime use.
Other than using a standard like XML, what is usually done to do this?
Thanks
Usually this kind of localization tasks is done with GNU gettext.
It depends when are you going to load the file.
For standard translation stuff, I recommend you take a look at gettext. It provides translation tools and easy way to include it. You can store English text a C strings enclosed with translation macro () or T() or whatever, and gettext would provide you with strings that need translating. It also tracks the translations that need to be updated when original English text changes. You store all translations for specific language in separate files.
Not sure for C++ but maybe resource files where you create a separate file for each language and have key/value pairs for the lookup with each langauge using the same key but message text is in the correct language e.g.
resources.de
LOGOUT, Abmelden
resources.en
LOGOUT, Logout
then depending on users language choice, you load the appropriate resource file to display the correct text. Think you would just store the mutlilines with /n as in your second example.

Arabic: 'source' Unicode to final display Unicode

simple question:
this is the final display string I am looking for
لعبة ديدة
now below is each of the separate characters, before being 'glued' together (so I've put a space between each of them to stop the joining)
ل ع ب ة د ي د ة
note how they are NOT the same characters, there is some magical transform that melds them together and converts them to new Unicode characters.
and then in that above, the characters are actually appearing right to left (in memory, they are left to right)
so my simple question is this: where do I get a platform independent c/c++ function that will take my source 16 bit Unicode string, and do the transform on it to result in the Unicode string that will create the one first quoted above? doing the RTL conversion, and the joining?
that's all I want, one function that does that.
UPDATE:
ok, yes, I know that the 'characters' are the same in the two above examples, they are the same 'letters' but (viewing in chrome, or latest IE) anyone can CLEARLY see that the glyphs are different. now I'm fairly confident that this transform that needs to be done can be done on the unicode level, because my font file, and the unicode standard, seems to specify the different glyphs for both the separate, and various joined versions of the characters/letters. (unicode.org/charts/PDF/UFB50.pdf unicode.org/charts/PDF/UFE70.pdf)
so, can I just put my unicode into a function and get the transformed unicode out?
The joining and RTL conversion don't happen at the level of Unicode characters.
In other words: the order of the characters and the actual unicode codepoints are not changed during this process.
In fact, the merging and handling RTL/LTR transitions is handled by the text rendering engine.
This quote from the Wikipedia article on the Arabic alphabet explains it quite nicely:
Finally, the Unicode encoding of Arabic is in logical order, that is, the characters are entered, and stored in computer memory, in the order that they are written and pronounced without worrying about the direction in which they will be displayed on paper or on the screen. Again, it is left to the rendering engine to present the characters in the correct direction, using Unicode's bi-directional text features. In this regard, if the Arabic words on this page are written left to right, it is an indication that the Unicode rendering engine used to display them is out-of-date.
The processing you're looking for is called ligature. Unlike many latin-based languages, where you can simply put one character after another to render the text, ligatures are fundamental in arabic. The substitution is done in the text rendering engine, and the ligature infos are generally stored in font files.
note how they are NOT the same characters
They are the same for an Arabic reader. It is still readable.
There is no transform to do on your Unicode16 source text. You must provide the whole string to your text renderer. In C/C++, and as you are going the platform independent way, you can use Pango for rendering.
Note : Perhaps you wanted to write لعبة جديدة (i.e. new game) ? Because what you give as an example has no meaning in Arabic.
I realise this is an old question, but what you're looking for is FriBidi, the GNU implementation of the Unicode bidirectional algorithm.
This program does the glyph selection that was asked about in the question, as well as handling bidirectional text (mixture of right-to-left and left-to-right text).
What you are looking for is an Arabic script synthesis algorithm. I'm not aware one exists as open source. If you arrive at one please post.
Some points:
At the storage level, there is no Unicode transform. There is an abstract representation of the string as pointed out by other answers.
At the rendering level, you could choose to use Unicode Presentation Forms, but you could also choose to use other forms. Unicode Presentation Forms are not a standard for what presentation output encoding should be - rather they are just one example of presentation codes that can be output by the rendering engine using script synthesis.
To make it clearer: There wouldn't be a single standard transform (ie synthesis algorithm) that would transform from A to B, where A is standard Unicode Arabic page, and B is standard Unicode Arabic Presentation Forms. Rather, there would be different transformations that can vary in complexity and can have different encoding systems for B, but one of the encodings that can be used for B is the Unicode Presentation Forms.
For example, a simple typewriter style would require a simple rendering algorithm that would not require Presentation Forms. Indeed there does exist modern writing styles (not in common usage though) where A and B are actually identical, only that a different font page would be used to do the rendering. On the other hand, the transform to render typesetting or traditional calligraphic forms would be more complex and require something similar to the Unicode Presentation Forms.
Here are a couple of pointers for more information on the topic:
http://unicode.org/faq/ligature_digraph.html#Pf1
http://www.decotype.com/publications/unicode-tutorial.pdf
PLease see: http://www.fileformat.info/info/unicode/block/arabic_presentation_forms_b/list.htm and Have a look at this repo: https://github.com/Accorpa/Arabic-Converter-From-and-To-Arabic-Presentation-Forms-B

C++ screen partition

I want to partition the output screen into two parts (just like frames do it in HTML). So that one part may remain fixed and display some content which is updated based on input received from the other part.
I do not wish to venture into GUI stuff therefore OpenGL, SDL etc are ruled out (I wish to do it in command line mode). I have Borland C++ with graphics.h support, but it is just too old to carry on.
What alternatives do I have at my disposal (If not C++, a solution in C will also be Ok.)
You may want to take a look at curses-like libraries like PDCurses.
Other than that, you may use ANSI terminal escape sequences to control the cursor on a text window, this may be quicker if what you are doing is simple, otherwise use PDCurses and it will handle the escape sequences for you.
Check out Curses / NCurses.

How can I highlight different types of file in dired mode in Emacs?

In a nutshell, I want to have different faces for some types of file in dired mode. I don't think it matters, but I am using Aquamacs.
The example I will use here is .tex files. If I can do it for .tex, then I can just apply the same structure to do create other faces for other types of files.
From what I understand, I have to create a variable, write a regular expression, then apply a hook. I read a bit about regex and so far I have
^(.+)\.tex$
I think my structure and regular expression are not really correct. I am not a programmer (though I have an interest on it), I have only been using Emacs for 2 weeks or so, so any help would be greatly appreciated.
What I need is at least the basic structure of what I have to do. I understand there may be modes already created that do something similar (such as maybe Wdired and Dired-X), and I would not complain if someone told me about them, but what I really want is to have an elisp code (either already written or that I can work on), as I plan on learning a bit of elisp to be able to write my own customisations and this would be a way to learn.
Thank you!
Since you want to learn how to do it, try checking out the extension dired+.el. This mode does a lot more than what you want, but it does add new faces. Specifically, look for the variable diredp-font-lock-keywords-1 and how it is used. That should get you going.
Other SO questions that seem relevant are:
Match regular expression as keyword in define-generic-mode
Highlighting correctly in an emacs major mode
A hello world example for a major mode in emacs?