I am somewhat new to unicode and unicode strings. I'm trying to determine the difference between "fullwidth" symbol and a normal one.
Take these two for example:
Normal: http://www.fileformat.info/info/unicode/char/20a9/index.htm
Fullwidth: http://www.fileformat.info/info/unicode/char/ffe6/index.htm
I notice that the fullwidth is defined as U+20A9 and coincidentally 20A9 is the normal one. So what is the value of U?
When using libraries like ICU is there a way to specify always return normal versus full?
Thanks,
U+number is a notational convention for a Unicode code point. There is no 'value' of U.
U+0020, for example, is a space. The value in memory is 32 decimal, 20 hex.
Full width characters are a whole other story.
Back in the days of the 3270, Hanzi took up two positions in memory in the display. So they also took up two columns on the screen. To make things line up neatly, IBM defined a set of 'full-width' (better would have been 'double-width') letters and numbers.
If some ICU API is delivering full-width, you can use the Normalizer to get rid of it. You might also post a ticket to their ticket system, this seems odd.
The 'U' in "U+2049" just denotes that "2049" is a Unicode code point, the value of the Won character in the Unicode codespace. It's a notation used in the Unicode Standard. The "U+" shall be followed by a hexadecimal number, using at least 4 digits, such as "U+1234" or "U+10FFFD".
U+20A9 (₩) is the WON SIGN
U+FFE6 (₩) is the FULLWIDTH WON SIGN
This is a legacy of older character encodings. The "width" affected layout. The Unicode spec says:
Compatibility variants are a subset of compatibility characters, and have the further characteristic that they represent variants of existing, ordinary, Unicode characters. For example, compatibility variants might represent various presentation or styled forms of basic letters: superscript or subscript forms, variant glyph shapes, or vertical presentation forms. They also include halfwidth or fullwidth characters from East Asian character encoding standards, Arabic contextual form glyphs from pre-existing Arabic code pages, Arabic ligatures and ligatures from other scripts, and so on. Compatibility variants also include CJK compatibility ideographs, many of which are minor glyph variants of an encoded unified CJK ideograph.
Including these forms in Unicode allows the conversion of text from (and to) the older encodings without loss of meaning.
References:
General Structure
Southeast Asian Scripts
Annex #11: East Asian Width
Related
I am building a language analysis program I have a program which counts the words in text and give the ratio of every word in text as a output, but this program can not work on file containing Urdu text. how can I make it work
Encoding
Urdu may be presented in two¹ forms: Unicode and Code Page 868. This is convenient to you because the two ranges do not overlap. It is inconvenient because the Unicode code range is U+0600 – U+06FF, which means encoding is an issue:
CP-868 will encode each one as a single-byte value in the range 128–252
UTF-8 will encode each one as a two-byte sequence with bits 110x xxxx and 10xx xxxx
UTF-16 encodes every character as two-byte entities
UTF-32 encodes every character as four-byte entities
This means that you should be aware of encoding issues, and for an easy life, use UTF-16 internally (std::u16string), and accept files as (default) UTF-8 / CP-868, or as UTF-16/32 if there is a BOM indicating such.
Your other option is to simply require all input to be UTF-8 / CP-868.
¹ AFAIK. There may be other ways of storing Urdu text.
Three forms. See comments below.
Word separation
As you know, the end of a word is generally marked with a special letter form.
So, all you need is a table of end-of-word letters listing letters in both the CP-868 range and the Unicode Arabic text range.
Then, every time you find a space or a letter in that table you know you have found the end of a word.
Histogram
As you read words, store them in a histogram. For C++ a map <u16string, size_t> will do. The actual content of each word does not matter.
After that you have all the information necessary to print stats about the text.
Edit
The approach presented above is designed to be simple at the cost of some correctness. If you are doing something for the workplace, for example, and assuming it matters, you should also consider:
Normalizing word forms
For example, the same word may be presented in standard Arabic text codes or using the Urdu-specific codes. If you do not convert to the Urdu equivalent characters then you will have two words that should compare equal but do not.
Use something internally consistent. I recommend UZT, as it is the most complete Urdu text representation. You will also need an additional lookup for the original text representation from the UZT representation.
Dictionaries
As complete a dictionary (as an unordered_set <u16string>) of words in Urdu as you can get.
This is how it is done with languages like Japanese, for example, to find breaks between words.
Then use the dictionary to find all the words you can, and fall back on letterform recognition and/or spaces for what remains.
I have custom print functions I use to print numbers. I made an ASCII version and a UTF-16LE version. The UTF-16LE version uses the Fullwidth codes/characters for 0-9 and A-F for hexadecimal. When debugging my functions I noticed the characters looked a little different in Visual Studio than the ASCII characters, and while this didn't bother me, it got me thinking about it. So I decided to do a quick google search for "Unicode halfwidth vs fullwidth"
... And I found several pages that talk about the "Fullwidth" form referring to the Visual width of the characters, while I thought "Fullwidth" referred to the width of the encoding (2 Bytes or more)...
Here are a few pages and quotes from them:
https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms
ICU Unicode Normal vs Fullwidth
To make things line up neatly, IBM defined a set of 'full-width' (better would have been 'double-width') letters and numbers.
https://en.wikipedia.org/wiki/Half-width_kana
Half-width kana are katakana characters displayed at half their normal width (a 1:2 aspect ratio), instead of the usual square (1:1) aspect ratio. For example, the usual (full-width) form of the katakana ka is カ while the half-width form is カ.
It doesn't make sense to me that "Fullwidth" would refer to the visual width, when we have different Fonts for size and alignment.
Why does "Fullwidth" refer to the visual width? Where in the Unicode UTF-16 spec does it say this?
Is having the choice to output as Halfwidth or Fullwidth using flags be desirable?
Half-width Kana as you've found is just a subset of Halfwidth and fullwidth forms, and it's a property of the codepoint/glyph, not of the encoding. UTF-16 is one of the encodings for Unicode.
The reason for the existence of those characters is because Unicode was designed for lossless back-and-forth conversion between legacy character sets. If you look closer at the Unicode blocks you'll see there are a lot of redundant characters like Ⅶ Ⅷ Ⅸ ㎆ ㎇ ㎎ ㎏ ㎐ Dz dz NJ.... They're all purely for compatibility purpose because they've been used in some character sets.
See also What issues lead people to use Japanese-specific encodings rather than Unicode?
As a Developer/Programmer, would having the choice to output as Halfwidth or Fullwidth using flags be desirable?
Personally I see no reason for using them except in some rare cases, like displaying characters on a square grid. What's worse is that those Japanese characters are often rendered without cleartype and antialiasing (in small sizes) so it's a pain in the eyes to read. If you're in Japan you'll notice some forms that requires the use of halfwidth or fullwidth characters without automatic conversion, which is bad.
You found your own answers to the origination of fullwidth vs. halfwidth so I won't get into that. Yes, the designation refers to the visual width of the characters. Sorry but I don't have any official reference for that.
One of the goals of Unicode is to handle round-trip conversions from/to any legacy character set without loss. Since there are legacy character sets with fullwidth characters, they must also be part of Unicode or they would get converted incorrectly.
I find it hard to imagine a circumstance in modern code where you would want a choice between normal and fullwidth characters. It's really only for legacy support.
I need to specify a regex for validation of user input that allows the user to enter a hyphen character or apostrophe character on Windows Desktop operating systems or Mac OS/X desktop operating systems.
The user may have configured for the following languages:
English
French
Spanish
Portuguese
Hawaiian
I wan't to understand if I use a standard ASCII regex for hyphen and apostophe (e.g. ['-]) whether that will catch the hyphen or apostrophe keys typed by the user in most cases. I appreciate my definition is quite loose as there are many different keyboard layouts, OS versions, and language definitions (e.g. fr_FR, ca_FR).
I have checked the following resources and generally searched on google, but could not find anything in particular about saying that the ASCII code generated by a hyphen key or apostrophe key will always be ASCII code 45 and ASCII code 39 respectively.
http://en.wikipedia.org/wiki/Keyboard_layout
http://en.wikipedia.org/wiki/Hyphen
http://en.wikipedia.org/wiki/Apostrophe
NOTE: If you feel this question is badly worded, please add a comment to help me improve it.
You're mixing up a couple of things:
keyboard layout is what determines what value get assigned to a scancode.
localization settings determine in what language you should address the user, and wether the user expects a decimal point or comma.
character encoding is how a glyph is encoded into the bits memory and, in reverse, how to decode bits into glyphs
If you're validating user input, you shouldn't be interested in scancodes. A DVORAK layout user on a QWERTY keyboard will be pressing the Q key to input an '. And you shouldn't mess with that. So you have no business dealing with keyboard layouts.
The existence of this keyboard, should remind you, that what keys do is not your head-ache, but up to the user.
The localization settings will matter to you, but not for your regex. They will, however, tell you in what language you should put your error message, in case the user input is invalid. A good coding practice is to use a library like gettext to manage this.
What matters most, when you are validating input. Is just those 2 things: what is valid and what is the input.
You (or your domain expert) decide what is valid. Wether a hyphen-minus is just as acceptable as a hyphen or n-dash.
The input will be in encoded; computers work with bits, not strings of glyphs. It could be ASCII, but I'd steer towards unicode if I could help it.
As for your real concern, if I may rephrase it: "Can all users easily enter ' and -?". I guess they probably can. Many important programming languages use those glyphs to resp. denote strings and as a subtraction operator. And if your application needs to (dis)allow certain glyphs you can put unicode code points or categories in your regex.
All the ASCII codes greater than 127 are replaced by Diamond? symbol. How can I display those characters. I have an unsigned char buffer[1024] which contains values from 0 to 256.
Use the QString class's fromAscii() method. By default this will treat Ascii chars above 128 as Latin-1 chars. To change this behavior use QTextCodec::setCodecForCStrings method to set the correct codec for your usage.
I believe QT5 may have taken out the setCodecForCStrings method.
EDIT: Adnan supplied the QT5 alternative to setCodecForCStrings method, adding to answer for completeness.
Qt5 alternative for setCodecForCStrings is QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
This is a rabbit hole with no end. Qt does not fully support printing ascii > 127 as it is not well defined. The current method is to use "fromLocal8bit()" which will take a char array and transform it into the "right" Unicode string (the only thing Qt supports printing).
QTextCodec::setCodecForLocale can be used to identify the character set you wish to transform from. Many codecs are supported, but for some reason IBM437 (the character set used by IBM PCs in the US for decades) is not supported, where several other codecs used by Europe, etc. are. Probably some characters in IBM437 were never assigned proper code points in Unicode, so transforming it isn't possible?
What's frustrating is that there are fonts with all 256 ascii code points, but it is simply not possible to display these in Qt as they only work with Unicode strings. There are a handful of glyphs they don't support, and it seems to grow with newer versions of Qt. Currently I know of 9, 10, 12, 13, and 173. Some of these are for obvious reasons (usually you don't want to print a carriage return glyph, though it did exist in DOS), but others used to work in Qt and now do not.
In my application, I resorted to creating a new font that has copies of the unprintable glyphs in higher unicode codepoints, and translate them before printing them on the screen. It's quite silly but Qt gave up on ascii many years ago, so it's the best option I could find.
I'm writing a terminal (console) application that is supposed to wrap arbitrary unicode text.
Terminals are usually using a monospaced (fixed width) font, so to wrap a text, it's barely more than counting characters and watching whether a word fits into a line or not and act accordingly.
Problem is that there are fullwidth characters in the Unicode table that take up the width of 2 characters in a terminal.
Counting these would see 1 unicode character, but the printed character is 2 "normal" (halfwidth) characters wide, breaking the wrapping routine as it is not aware of chars that take up twice the width.
As an example, this is a fullwidth character (U+3004, the JIS symbol)
〄
12
It does not take up the full width of 2 characters here although it's preformatted, but it does use twice the width of a western character in a terminal.
To deal with this, I have to distinguish between fullwidth or halfwidth characters, but I cannot find a way to do so in C++. Is it really necessary to know all fullwidth characters in the unicode table to get around the problem?
You should use ICU u_getIntPropertyValue with the UCHAR_EAST_ASIAN_WIDTH property.
For example:
bool is_fullwidth(UChar32 c) {
int width = u_getIntPropertyValue(c, UCHAR_EAST_ASIAN_WIDTH);
return width == U_EA_FULLWIDTH || width == U_EA_WIDE;
}
Note that if your graphics library supports combining characters then you'll have to consider those as well when determining how many cells a sequence uses; for example e followed by U+0301 COMBINING ACUTE ACCENT will only take up 1 cell.
There's no need to build tables, people from Unicode have already done that:
http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
The same code is used in terminal emulating software such as xterm[1], konsole[2] and quite likely others...