Latin based Japanse input - c++

I'm working on Japanese input system for device with 10-digit numeric keypad.
That's why I'm searching for some not complicated input method that can be implemented on c++.
I've implemented Pinyin method for Chinese input. Each Chinese symbol can be reached by some Latin symbol combination and then chosen from the list. For example if user types "ca" I show "擦拆礤嚓" Chinese characters list.
Is there something similar for Japanese?
I have found the table with Japanese transliteration: あa でde がga じji まm のno たta わwa 阿ae ... 芭bac 八bap 捌bat 覇bax 冊cek 測ces 策cez 癌aib. And I suggest to base on this information while implementing Japanese input. For example user need to type "bac" and device will show possible replacement option "芭".
Is there some common used input method that is based Latin symbols input?

Related

i am building a program for Urdu language analysis so how can I make my program to accept text file in Urdu language in c++

I am building a language analysis program I have a program which counts the words in text and give the ratio of every word in text as a output, but this program can not work on file containing Urdu text. how can I make it work
Encoding
Urdu may be presented in two¹ forms: Unicode and Code Page 868. This is convenient to you because the two ranges do not overlap. It is inconvenient because the Unicode code range is U+0600 – U+06FF, which means encoding is an issue:
CP-868 will encode each one as a single-byte value in the range 128–252
UTF-8 will encode each one as a two-byte sequence with bits 110x xxxx and 10xx xxxx
UTF-16 encodes every character as two-byte entities
UTF-32 encodes every character as four-byte entities
This means that you should be aware of encoding issues, and for an easy life, use UTF-16 internally (std::u16string), and accept files as (default) UTF-8 / CP-868, or as UTF-16/32 if there is a BOM indicating such.
Your other option is to simply require all input to be UTF-8 / CP-868.
¹ AFAIK. There may be other ways of storing Urdu text.
  Three forms. See comments below.
Word separation
As you know, the end of a word is generally marked with a special letter form.
So, all you need is a table of end-of-word letters listing letters in both the CP-868 range and the Unicode Arabic text range.
Then, every time you find a space or a letter in that table you know you have found the end of a word.
Histogram
As you read words, store them in a histogram. For C++ a map <u16string, size_t> will do. The actual content of each word does not matter.
After that you have all the information necessary to print stats about the text.
Edit
The approach presented above is designed to be simple at the cost of some correctness. If you are doing something for the workplace, for example, and assuming it matters, you should also consider:
Normalizing word forms
For example, the same word may be presented in standard Arabic text codes or using the Urdu-specific codes. If you do not convert to the Urdu equivalent characters then you will have two words that should compare equal but do not.
Use something internally consistent. I recommend UZT, as it is the most complete Urdu text representation. You will also need an additional lookup for the original text representation from the UZT representation.
Dictionaries
As complete a dictionary (as an unordered_set <u16string>) of words in Urdu as you can get.
This is how it is done with languages like Japanese, for example, to find breaks between words.
Then use the dictionary to find all the words you can, and fall back on letterform recognition and/or spaces for what remains.

Reverse of an Arabic string in c++

How an Arabic string can be reversed using C++? For instance, the reverse of كلمة is ةملك. Shape of Arabic letters differs according to position in the word. (initial,medial or final of word). Are there other rules to concatenate Arabic letters?
As Petesh says and according to the references I can find such as Wikipedia the rendering engine should take of using the appropriate glyphs for you. Quoting the article:
For example, many Arabic letters are represented by a different glyph when the letter appears at the end of a word than when the letter appears at the beginning of a word. Unicode's approach prefers to have these letters mapped to the same character for ease of internal machine text processing and storage. To complement this approach, the text software must select different glyph variants for display of the character based on its context
A quick experiment with an online unicode convertor seem to confirm that:
كلمة
in hex code points is:
0643 0644 0645 0629
while:
ةملك
is:
0629 0645 0644 0643
which is the exact reverse of the previous code points.

Tokenizing japanese string and converting to hiragana

I am using string tokenizer and transform APIs to convert kanji characters to hiragana.
The code in query (What is the replacement for Language Analysis framework's Morpheme analysis deprecated APIs) converts most of kanji characters to hiragana but these APIs fails to convert kanji word having 3-4 characters.
like-
a) 現人神 is converted to latin - 'gen ren shen' and in hiragana- 'げんじんしん'
whereas it should be - in latin - 'Arahitogami ' and in hiragana- 'あらひとがみ'
b) 安本丹 is converted to latin - 'an ben dan' and in hiragana- 'やすもとまこと'
whereas it should be - in latin as - 'Yasumoto makoto ' and in hiragana- 'あんぽんたん'
My main purpose is to obtain the ruby text for given japanese text. I cant use lang analysis framework as its unavailable in 64-bit.
Any suggestions? Are there other APIs to perform such string conversion?
So in both cases your API uses onyomi but shouldn't. So I assume it just guesses "3 or more characters ? onyomi should be more appropriate in most cases, so I use it". Sounds like an actual dictionary is needed for your problem, which you can download.
Names ( for b) ) should still be a problem tho. I don't see how a computer should be able to get the correct name from kanjis, as even native japanese people sometimes fail at it. jisho.org doesn't even find a single name for 安本丹.
( Btw you mixed up your hiragana in b), and the latin for 'あんぽんたん'. I can't write comments yet with my rep so I'm leaving this here )

hyphen character and apostrophe character - the same ASCII code in different languages?

I need to specify a regex for validation of user input that allows the user to enter a hyphen character or apostrophe character on Windows Desktop operating systems or Mac OS/X desktop operating systems.
The user may have configured for the following languages:
English
French
Spanish
Portuguese
Hawaiian
I wan't to understand if I use a standard ASCII regex for hyphen and apostophe (e.g. ['-]) whether that will catch the hyphen or apostrophe keys typed by the user in most cases. I appreciate my definition is quite loose as there are many different keyboard layouts, OS versions, and language definitions (e.g. fr_FR, ca_FR).
I have checked the following resources and generally searched on google, but could not find anything in particular about saying that the ASCII code generated by a hyphen key or apostrophe key will always be ASCII code 45 and ASCII code 39 respectively.
http://en.wikipedia.org/wiki/Keyboard_layout
http://en.wikipedia.org/wiki/Hyphen
http://en.wikipedia.org/wiki/Apostrophe
NOTE: If you feel this question is badly worded, please add a comment to help me improve it.
You're mixing up a couple of things:
keyboard layout is what determines what value get assigned to a scancode.
localization settings determine in what language you should address the user, and wether the user expects a decimal point or comma.
character encoding is how a glyph is encoded into the bits memory and, in reverse, how to decode bits into glyphs
If you're validating user input, you shouldn't be interested in scancodes. A DVORAK layout user on a QWERTY keyboard will be pressing the Q key to input an '. And you shouldn't mess with that. So you have no business dealing with keyboard layouts.
The existence of this keyboard, should remind you, that what keys do is not your head-ache, but up to the user.
The localization settings will matter to you, but not for your regex. They will, however, tell you in what language you should put your error message, in case the user input is invalid. A good coding practice is to use a library like gettext to manage this.
What matters most, when you are validating input. Is just those 2 things: what is valid and what is the input.
You (or your domain expert) decide what is valid. Wether a hyphen-minus is just as acceptable as a hyphen or n-dash.
The input will be in encoded; computers work with bits, not strings of glyphs. It could be ASCII, but I'd steer towards unicode if I could help it.
As for your real concern, if I may rephrase it: "Can all users easily enter ' and -?". I guess they probably can. Many important programming languages use those glyphs to resp. denote strings and as a subtraction operator. And if your application needs to (dis)allow certain glyphs you can put unicode code points or categories in your regex.

How do I remove words from multilingual text?

I have two versions of the same document (D, say) containing multilingual text (English and others):
I. One is encoded in ASCII with Unicode code-points represented as character entity references (i.e. Unicode characters are of the form &#N, where N is the decimal equivalent of the Unicode hex value)
II. The other is UTF-8 encoding.
Q 1:
I have a separate list of words (encoded in UTF-8, and in more than one language), that I have to remove from the document D. How should I proceed?
Can I use regex to clean D? For doc type I, I believe I have to specify the whole &#N patterns for each word in the list when I form the regex.
Should the task be easier for doc type II, now that I can specify the non-English characters directly in the regex (my emacs is configured to use these non-English fonts) ?
Q 2:
I have a huge collections of such document D's. What should be the best algorithm to remove words from each of these documents? A table look-up is straight-forward but probably the slowest. Should I regex through each?
I suggest processing the entities first so that the two sorts of files look the same. When you’re done removing, put the first set back into their encoded form.