I'm writing a terminal (console) application that is supposed to wrap arbitrary unicode text.
Terminals are usually using a monospaced (fixed width) font, so to wrap a text, it's barely more than counting characters and watching whether a word fits into a line or not and act accordingly.
Problem is that there are fullwidth characters in the Unicode table that take up the width of 2 characters in a terminal.
Counting these would see 1 unicode character, but the printed character is 2 "normal" (halfwidth) characters wide, breaking the wrapping routine as it is not aware of chars that take up twice the width.
As an example, this is a fullwidth character (U+3004, the JIS symbol)
〄
12
It does not take up the full width of 2 characters here although it's preformatted, but it does use twice the width of a western character in a terminal.
To deal with this, I have to distinguish between fullwidth or halfwidth characters, but I cannot find a way to do so in C++. Is it really necessary to know all fullwidth characters in the unicode table to get around the problem?
You should use ICU u_getIntPropertyValue with the UCHAR_EAST_ASIAN_WIDTH property.
For example:
bool is_fullwidth(UChar32 c) {
int width = u_getIntPropertyValue(c, UCHAR_EAST_ASIAN_WIDTH);
return width == U_EA_FULLWIDTH || width == U_EA_WIDE;
}
Note that if your graphics library supports combining characters then you'll have to consider those as well when determining how many cells a sequence uses; for example e followed by U+0301 COMBINING ACUTE ACCENT will only take up 1 cell.
There's no need to build tables, people from Unicode have already done that:
http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
The same code is used in terminal emulating software such as xterm[1], konsole[2] and quite likely others...
Related
I have custom print functions I use to print numbers. I made an ASCII version and a UTF-16LE version. The UTF-16LE version uses the Fullwidth codes/characters for 0-9 and A-F for hexadecimal. When debugging my functions I noticed the characters looked a little different in Visual Studio than the ASCII characters, and while this didn't bother me, it got me thinking about it. So I decided to do a quick google search for "Unicode halfwidth vs fullwidth"
... And I found several pages that talk about the "Fullwidth" form referring to the Visual width of the characters, while I thought "Fullwidth" referred to the width of the encoding (2 Bytes or more)...
Here are a few pages and quotes from them:
https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms
ICU Unicode Normal vs Fullwidth
To make things line up neatly, IBM defined a set of 'full-width' (better would have been 'double-width') letters and numbers.
https://en.wikipedia.org/wiki/Half-width_kana
Half-width kana are katakana characters displayed at half their normal width (a 1:2 aspect ratio), instead of the usual square (1:1) aspect ratio. For example, the usual (full-width) form of the katakana ka is カ while the half-width form is カ.
It doesn't make sense to me that "Fullwidth" would refer to the visual width, when we have different Fonts for size and alignment.
Why does "Fullwidth" refer to the visual width? Where in the Unicode UTF-16 spec does it say this?
Is having the choice to output as Halfwidth or Fullwidth using flags be desirable?
Half-width Kana as you've found is just a subset of Halfwidth and fullwidth forms, and it's a property of the codepoint/glyph, not of the encoding. UTF-16 is one of the encodings for Unicode.
The reason for the existence of those characters is because Unicode was designed for lossless back-and-forth conversion between legacy character sets. If you look closer at the Unicode blocks you'll see there are a lot of redundant characters like Ⅶ Ⅷ Ⅸ ㎆ ㎇ ㎎ ㎏ ㎐ Dz dz NJ.... They're all purely for compatibility purpose because they've been used in some character sets.
See also What issues lead people to use Japanese-specific encodings rather than Unicode?
As a Developer/Programmer, would having the choice to output as Halfwidth or Fullwidth using flags be desirable?
Personally I see no reason for using them except in some rare cases, like displaying characters on a square grid. What's worse is that those Japanese characters are often rendered without cleartype and antialiasing (in small sizes) so it's a pain in the eyes to read. If you're in Japan you'll notice some forms that requires the use of halfwidth or fullwidth characters without automatic conversion, which is bad.
You found your own answers to the origination of fullwidth vs. halfwidth so I won't get into that. Yes, the designation refers to the visual width of the characters. Sorry but I don't have any official reference for that.
One of the goals of Unicode is to handle round-trip conversions from/to any legacy character set without loss. Since there are legacy character sets with fullwidth characters, they must also be part of Unicode or they would get converted incorrectly.
I find it hard to imagine a circumstance in modern code where you would want a choice between normal and fullwidth characters. It's really only for legacy support.
In my c++ textbook, there is an "ASCII Table of Printable Characters."
I noticed a few odd things that I would appreciate some clarification on:
Why do the values start with 32? I tested out a simple program and it has the following lines of code: char ch = 1; std::cout << ch << "\n"; of code and nothing printed out. So I am kind of curious as to why the values start at 32.
I noticed the last value, 127, was "Delete." What is this for, and what does it do?
I thought char can store 256 values, why is there only 127? (Please let me know if I have this wrong.)
Thanks in advance!
The printable characters start at 32. Below 32 there are non-printable characters (or control characters), such as BELL, TAB, NEWLINE etc.
DEL is a non-printable character that is equivalent to delete.
char can indeed store 256 values, but its signed-ness is implementation defined. If you need to store values from 0 to 255 then you need to explicitly specify unsigned char. Similarly from -128 to 127, have to specify signed char.
EDIT
The so called extended ASCII characters with codes >127 are not part of the ASCII standard. Their representation depends on the so called "code page" chosen by the operating system. For example, MS-DOS used to use such extended ASCII characters for drawing directory trees, window borders etc. If you changed the code page, you could have also used to display non-English characters etc.
It's a mapping between integers and characters plus other "control" "characters" like space, line feed and carriage return interpreted by display devices (possibly virtual). As such it is arbitrary, but they are organized by binary values.
32 is a power of 2 and an alphabet starts there.
Delete is the signal from your keyboard delete key.
At the time the code was designed only 7 bits were standard. Not all bytes (parts words) were 8 bits.
All the ASCII codes greater than 127 are replaced by Diamond? symbol. How can I display those characters. I have an unsigned char buffer[1024] which contains values from 0 to 256.
Use the QString class's fromAscii() method. By default this will treat Ascii chars above 128 as Latin-1 chars. To change this behavior use QTextCodec::setCodecForCStrings method to set the correct codec for your usage.
I believe QT5 may have taken out the setCodecForCStrings method.
EDIT: Adnan supplied the QT5 alternative to setCodecForCStrings method, adding to answer for completeness.
Qt5 alternative for setCodecForCStrings is QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
This is a rabbit hole with no end. Qt does not fully support printing ascii > 127 as it is not well defined. The current method is to use "fromLocal8bit()" which will take a char array and transform it into the "right" Unicode string (the only thing Qt supports printing).
QTextCodec::setCodecForLocale can be used to identify the character set you wish to transform from. Many codecs are supported, but for some reason IBM437 (the character set used by IBM PCs in the US for decades) is not supported, where several other codecs used by Europe, etc. are. Probably some characters in IBM437 were never assigned proper code points in Unicode, so transforming it isn't possible?
What's frustrating is that there are fonts with all 256 ascii code points, but it is simply not possible to display these in Qt as they only work with Unicode strings. There are a handful of glyphs they don't support, and it seems to grow with newer versions of Qt. Currently I know of 9, 10, 12, 13, and 173. Some of these are for obvious reasons (usually you don't want to print a carriage return glyph, though it did exist in DOS), but others used to work in Qt and now do not.
In my application, I resorted to creating a new font that has copies of the unprintable glyphs in higher unicode codepoints, and translate them before printing them on the screen. It's quite silly but Qt gave up on ascii many years ago, so it's the best option I could find.
I am trying to print out the following string using std::cout :
"Encryptor –pid1 0x34f –pid2"
the '-' characters appear as u's with a circumflex above them (I'm not sure how to type this).
How do I print out the hyphen as intended?
That was not a hyphen.
It was a "n-dash", which will render differently across consoles based on encoding settings.
The hyphen key is usually on the number row of your keyboard, on Western layouts.
Make sure your terminal's idea of the character encoding matches that of your source code. How to do this, of course, depends on your operating system, which terminal emulator (assuming it's an emulator at all) you're using, and so on, neither of which you state.
Also, that's not a hyphen in your example, it's too long. It's probably an "em dash".
I am somewhat new to unicode and unicode strings. I'm trying to determine the difference between "fullwidth" symbol and a normal one.
Take these two for example:
Normal: http://www.fileformat.info/info/unicode/char/20a9/index.htm
Fullwidth: http://www.fileformat.info/info/unicode/char/ffe6/index.htm
I notice that the fullwidth is defined as U+20A9 and coincidentally 20A9 is the normal one. So what is the value of U?
When using libraries like ICU is there a way to specify always return normal versus full?
Thanks,
U+number is a notational convention for a Unicode code point. There is no 'value' of U.
U+0020, for example, is a space. The value in memory is 32 decimal, 20 hex.
Full width characters are a whole other story.
Back in the days of the 3270, Hanzi took up two positions in memory in the display. So they also took up two columns on the screen. To make things line up neatly, IBM defined a set of 'full-width' (better would have been 'double-width') letters and numbers.
If some ICU API is delivering full-width, you can use the Normalizer to get rid of it. You might also post a ticket to their ticket system, this seems odd.
The 'U' in "U+2049" just denotes that "2049" is a Unicode code point, the value of the Won character in the Unicode codespace. It's a notation used in the Unicode Standard. The "U+" shall be followed by a hexadecimal number, using at least 4 digits, such as "U+1234" or "U+10FFFD".
U+20A9 (₩) is the WON SIGN
U+FFE6 (₩) is the FULLWIDTH WON SIGN
This is a legacy of older character encodings. The "width" affected layout. The Unicode spec says:
Compatibility variants are a subset of compatibility characters, and have the further characteristic that they represent variants of existing, ordinary, Unicode characters. For example, compatibility variants might represent various presentation or styled forms of basic letters: superscript or subscript forms, variant glyph shapes, or vertical presentation forms. They also include halfwidth or fullwidth characters from East Asian character encoding standards, Arabic contextual form glyphs from pre-existing Arabic code pages, Arabic ligatures and ligatures from other scripts, and so on. Compatibility variants also include CJK compatibility ideographs, many of which are minor glyph variants of an encoded unified CJK ideograph.
Including these forms in Unicode allows the conversion of text from (and to) the older encodings without loss of meaning.
References:
General Structure
Southeast Asian Scripts
Annex #11: East Asian Width