I need to know the direction of my text before printing.
I'm using Unicode Characters.
How can I do that in C++?
If you don't want to use ICU, you can always manually parse the unicode database (.e.g., with a python script). It's a semicolon-separated text file, with each line representing a character code point. Look for the fifth record in each line - that's the character class. If it's R or AL, you have an RTL character, and 'L' is an LTR character. Other classes are weak or neutral types (like numerals), which I guess you'd want to ignore. Using that info, you can generate a lookup table of all RTL characters and then use it in your C++ code. If you really care about code size, you can minimize the size the lookup table takes in your code by using ranges (instead of an entry for each character), since most characters come in blocks of their BiDi class.
Now, define a function called GetCharDirection(wchar_t ch) which returns an enum value (say: Dir_LTR, Dir_RTL or Dir_Neutral) by checking the lookup table.
Now you can define a function GetStringDirection(const wchar_t*) which runs through all characters in the string until it encounters a character which is not Dir_Neutral. This first non-neutral character in the string should set the base direction for that string. Or at least that's how ICU seems to work.
You could use the ICU library, which has a functions for that (ubidi_getDirection ubidi_getBaseDirection).
The size of ICU can be reduced, by recompiling the data library (which is normally about 15MB big), to include only the converters/locals which are needed for the project.
The section Reducing the Size of ICU's Data: Conversion Tables of the site http://userguide.icu-project.org/icudata, contains information how you can reduce the size of the data library.
If only need support for the most common encodings (US-ASCII, ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8), the data library wont be needed anyway.
From Boaz Yaniv said before, maybe something like this will easier and faster than parsing the whole file:
int aft_isrtl(int c){
if (
(c==0x05BE)||(c==0x05C0)||(c==0x05C3)||(c==0x05C6)||
((c>=0x05D0)&&(c<=0x05F4))||
(c==0x0608)||(c==0x060B)||(c==0x060D)||
((c>=0x061B)&&(c<=0x064A))||
((c>=0x066D)&&(c<=0x066F))||
((c>=0x0671)&&(c<=0x06D5))||
((c>=0x06E5)&&(c<=0x06E6))||
((c>=0x06EE)&&(c<=0x06EF))||
((c>=0x06FA)&&(c<=0x0710))||
((c>=0x0712)&&(c<=0x072F))||
((c>=0x074D)&&(c<=0x07A5))||
((c>=0x07B1)&&(c<=0x07EA))||
((c>=0x07F4)&&(c<=0x07F5))||
((c>=0x07FA)&&(c<=0x0815))||
(c==0x081A)||(c==0x0824)||(c==0x0828)||
((c>=0x0830)&&(c<=0x0858))||
((c>=0x085E)&&(c<=0x08AC))||
(c==0x200F)||(c==0xFB1D)||
((c>=0xFB1F)&&(c<=0xFB28))||
((c>=0xFB2A)&&(c<=0xFD3D))||
((c>=0xFD50)&&(c<=0xFDFC))||
((c>=0xFE70)&&(c<=0xFEFC))||
((c>=0x10800)&&(c<=0x1091B))||
((c>=0x10920)&&(c<=0x10A00))||
((c>=0x10A10)&&(c<=0x10A33))||
((c>=0x10A40)&&(c<=0x10B35))||
((c>=0x10B40)&&(c<=0x10C48))||
((c>=0x1EE00)&&(c<=0x1EEBB))
) return 1;
return 0;
}
If you are using Windows GDI, it would seem that GetFontLanguageInfo(HDC) returns a DWORD; if GCP_REORDER is set, the language requires reordering for display, for example, Hebrew or Arabic.
Related
I am building a language analysis program I have a program which counts the words in text and give the ratio of every word in text as a output, but this program can not work on file containing Urdu text. how can I make it work
Encoding
Urdu may be presented in two¹ forms: Unicode and Code Page 868. This is convenient to you because the two ranges do not overlap. It is inconvenient because the Unicode code range is U+0600 – U+06FF, which means encoding is an issue:
CP-868 will encode each one as a single-byte value in the range 128–252
UTF-8 will encode each one as a two-byte sequence with bits 110x xxxx and 10xx xxxx
UTF-16 encodes every character as two-byte entities
UTF-32 encodes every character as four-byte entities
This means that you should be aware of encoding issues, and for an easy life, use UTF-16 internally (std::u16string), and accept files as (default) UTF-8 / CP-868, or as UTF-16/32 if there is a BOM indicating such.
Your other option is to simply require all input to be UTF-8 / CP-868.
¹ AFAIK. There may be other ways of storing Urdu text.
Three forms. See comments below.
Word separation
As you know, the end of a word is generally marked with a special letter form.
So, all you need is a table of end-of-word letters listing letters in both the CP-868 range and the Unicode Arabic text range.
Then, every time you find a space or a letter in that table you know you have found the end of a word.
Histogram
As you read words, store them in a histogram. For C++ a map <u16string, size_t> will do. The actual content of each word does not matter.
After that you have all the information necessary to print stats about the text.
Edit
The approach presented above is designed to be simple at the cost of some correctness. If you are doing something for the workplace, for example, and assuming it matters, you should also consider:
Normalizing word forms
For example, the same word may be presented in standard Arabic text codes or using the Urdu-specific codes. If you do not convert to the Urdu equivalent characters then you will have two words that should compare equal but do not.
Use something internally consistent. I recommend UZT, as it is the most complete Urdu text representation. You will also need an additional lookup for the original text representation from the UZT representation.
Dictionaries
As complete a dictionary (as an unordered_set <u16string>) of words in Urdu as you can get.
This is how it is done with languages like Japanese, for example, to find breaks between words.
Then use the dictionary to find all the words you can, and fall back on letterform recognition and/or spaces for what remains.
I want to detect if the user has inputted a non-ASCII (otherwise incorrectly known as Unicode) character (for example, り) in a file save dialog box. As I am using Qt, any non-ASCII characters are properly saved in a QString, but I can't figure out how to determine if any of the characters in that string are non-ASCII before converting the string to ASCII. That character above ends up getting written to the filesystem as ã‚Š.
There is no such a built-in feature in my understanding.
About 1-2 years ago, I was proposing an isAscii() method for QString/QChar to wrap the low-level Unix isacii() and the corresponding Windows function, but it was rejected. You could have written then something like this:
bool isUnicode = !myString.at(3).isAcii();
I still think this would be a handy feature if you can convince the maintainer. :-)
Other than that, you would need to check against the ascii boundary yourself, I am afraid. You can do this yourself as follows:
bool isUnicode = myChar.unicode() > 127;
See the documentation for details:
ushort QChar::unicode () const
This is an overloaded function.
The simplest way is to check every charachter's code (QChar::unicode()) to be below 128 if you need pure 7-bit ASCII.
To write it in compact way without loop, you can use regular expression:
bool containsNonASCII = myString.contains(QRegularExpression(QStringLiteral("[^\\x{0000}-\\x{007F}]")));
this works for me :
isLetterOrNumber()
ot_id += QChar((short) b.to_ulong()).isLetterOrNumber() ? QChar((short) b.to_ulong()) : QString("");
All the ASCII codes greater than 127 are replaced by Diamond? symbol. How can I display those characters. I have an unsigned char buffer[1024] which contains values from 0 to 256.
Use the QString class's fromAscii() method. By default this will treat Ascii chars above 128 as Latin-1 chars. To change this behavior use QTextCodec::setCodecForCStrings method to set the correct codec for your usage.
I believe QT5 may have taken out the setCodecForCStrings method.
EDIT: Adnan supplied the QT5 alternative to setCodecForCStrings method, adding to answer for completeness.
Qt5 alternative for setCodecForCStrings is QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
This is a rabbit hole with no end. Qt does not fully support printing ascii > 127 as it is not well defined. The current method is to use "fromLocal8bit()" which will take a char array and transform it into the "right" Unicode string (the only thing Qt supports printing).
QTextCodec::setCodecForLocale can be used to identify the character set you wish to transform from. Many codecs are supported, but for some reason IBM437 (the character set used by IBM PCs in the US for decades) is not supported, where several other codecs used by Europe, etc. are. Probably some characters in IBM437 were never assigned proper code points in Unicode, so transforming it isn't possible?
What's frustrating is that there are fonts with all 256 ascii code points, but it is simply not possible to display these in Qt as they only work with Unicode strings. There are a handful of glyphs they don't support, and it seems to grow with newer versions of Qt. Currently I know of 9, 10, 12, 13, and 173. Some of these are for obvious reasons (usually you don't want to print a carriage return glyph, though it did exist in DOS), but others used to work in Qt and now do not.
In my application, I resorted to creating a new font that has copies of the unprintable glyphs in higher unicode codepoints, and translate them before printing them on the screen. It's quite silly but Qt gave up on ascii many years ago, so it's the best option I could find.
I need to convert a bunch of bytes in ISO-2022-JP and ISO-2022-JP-2 (and other variations of ISO-2022) into Unicode. I am trying to use ICU (link text), but the following code doesn't work.
std::string input = "\x1B\x28\x4A" "ABC\xA6\xA7"; //the first 3 chars are escape sequence to use JIS_X201 character set in GL/GR
UErrorCode status = U_ZERO_ERROR;
UConverter *conv;
// set up the converter
conv = ucnv_open("ISO-2022-JP", &status);
if (status != U_ZERO_ERROR) return false; //couldn't find character set
UChar * convDest = new UChar[2*input.length()]; //ucnv_toUChars will use up to 2*length
// convert to Unicode
int resultLen = (int)ucnv_toUChars(conv, convDest, 2*input.length(), input.c_str(), input.length(), &status);
This doesn't work. The result contains '?' charcters for anything I put in that was above ASCII. The status has no error. What am I doing wrong?
On top of that I was having trouble compiling the library ver 4.4 as the MSVC 9 project would not convert to MSVC 10 project.
I am also aware of libiconv open source library. I couldn't compile that one on windows. If anyone has any advice on a different library, that's also welcome.
Thanks.
EDIT
The escape sequence I originally used was wrong. So now ICU takes the string, strips out the escape sequence - which is a step in the right direction. But the result still contains '?' chars.
EDIT2 The reason I couldn't convert to MSVC 10 project was because x64 platform wasn't installed (it isn't by default). Alternatively I could open all the projects in text editor and remove all mention of x64 target.
This doesn't resemble an ISO 2022 encoding. The high bits are supposed to be zero. The escape sequence looks somewhat recognizable, but it starts with ESC. 0x1b, not 0xb0. No idea what those byte values really mean.
(This question looks familiar, Hi again.)
A minor, minor nit: You want to check the error status with if(U_FAILURE(status)) (or conversely, U_SUCCESS(status)).
I couldn't get the conversion to work for JIS_X201 character set in ISO-2022-JP encoding. And I couldn't generate a "valid" one using any tools at my disposal - tried Java (ICU and non ICU implementation of ISO2022) and C++.
So I basically just wrote a function to do a code lookup and convert to Unicode using this table: wikipedia.
EDIT
As I started filling out the bug report I wanted to include the RFC for ISO-2022-JP. Then I found this line in the RFC "The Kana set of JIS X 0201 is not used in ISO-2022-JP messages." link text. So it appears that the standard doesn't actually define the upper bits. The ISO-2022-JP-3 WILL map the upper bits, but to lower plane. So I have to take each byte and subtract 0x80 from it, and pass it through ISO-2022-JP-3, and take the other bytes < 128 and pass them through ISO-2022-JP converter for full JIS_X201 character set. Well it's a lot easier to just do it myself.
So strictly speaking I would say it's not a bug. It's a huge headache though.
P.S. the whole messed up stream that I'm trying to decode comes from DICOM. See pdf page 107 to see what they consider acceptable.
I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.