unicode char value - c++

Question: What is the correct order of Unicode extended symbols by value?
If I excel sort a list of Unicode chars the order is different than if I use the excel "=code()" and sort by those values. The purpose is that I want to measure the distance between chars, for example a-b = 1 and &-% = 1; when sorted with the excel sort function, two chars that are ordered within three appear to have values that are 134 away.
Also, some char symbols are blank in excel and several are found twice with 'find' and are two different symbols - and a couple are not found at all. Please explain the details of these 'special' chars.
http://en.wikipedia.org/wiki/List_of_Unicode_characters
sample code:
int charDist = abs(alpha[index] - code[0]);
EDIT:
To figure out the UNICODE values in c++ vs2008 I ran each code as a comparison from code 1 to code 255 against code 1
cout << mem << " code " << key << " is " << abs(key[0] - '') << " from " << endl;
In the brackets is a black happy face that this website does not have the font for but the command window does, in vs2008 it looks like a half-post | with the right half of a T. Excel leaves a blank.
The following Unicodes are not handled in c++ vs2008 with the std library and #include
9, 10, 13, 26, 34, 44,
And, the numerical 'distance' for codes 1 through 127 are correct, but at 128 the distance skips an extra and is one further away for some reason. Then from 128 to 255 the distance reverses and becomes closer; 255 is 2 away from 1 ''
It'd be nice if these followed something more logical and were just 1 through 255 without hiccups or skips and reversals, and 255-1 = 254 but hey, what do I know.
EDIT2: I found it - without the absolute - the collation for UNIFORMAT is 128 to 255 then 1 to 127 and yields 1 to 255 with the 6 skips for 9, 10, 13, 26, 34, 44 that are garbage. That was not intuitive. In the new order 128->255,1->127 the strange skip from 127 to 128 is clearer, it is because there is no 0 so the value is missing between 255 and 1.
SOLUTION: make my own hashtable with values for each symbol and do not rely on c++ std library or vs2008 to provide the UNIFORMAT values since they are not correct for measuring the char distance outside of several specific subsets of UNIFORMAT.

Unicode doesn't have a defined sort (or collation) order. When Excel sorts, it's using tables based on the currently selected language. For example, someone using Excel in English mode may get different sorting results that someone using Excel in Portuguese.
There are also issues of normalization. With Unicode, one "character" doesn't necessarily correspond to one value. Some characters can be represented in different ways. For example, a capital omega can be coded as a Greek letter or as a symbol for representing units of electrical resistance. In some languages, a single character may be composed from several consecutive values.
The blank values probably correspond to glyphs that you don't have any font coverage for. Some systems use so-called "Unicode fonts" which have a large percentage of the glyphs you need for every script. Windows tends to switch fonts on the fly when the current font doesn't have a necessary glyph. Neither approach will have every glyph necessary. Also, some Unicode values don't encode to a visible glyph (e.g., there are many different kinds of spaces in Unicode), some values act more like ASCII-style controls codes (e.g., paragraph separator or bidi controls), and some values only make sense when they combine with another character, like many of the "combining" accents.
So there's not an answer you're going to be satisfied with. Perhaps if you gave more information about what you're ultimately trying to do, we could suggest a different approach.

I don't think you can do what you want to do in Excel without limiting your approach significantly.
By experimentation, the Code function will never return a value higher than 255. If you use any unicode text that cannot be generated via this VBA Code, it will be interpreted as a question mark (?) or 63.
For x = 1 To 255
Cells(x, 1).Value = Chr(x)
Next
You should be able to determine the difference using Code. But if the character doesn't fall in that realm, you'll need to go outside of Excel, because even VBA will convert any other Unicode characters to the question mark(?) or 63.

Related

Meaning of 3F7.1 in Fortran data format

I am trying to create an MDM file using HLM 7 Student version, but since I don't have access to SPSS I am trying to import my data using ASCII input. As part of this process I am required to input the data format Fortran style. Try as I might I have not been able to understand this step. Could someone familiar with Fortran (or even better HLM itself) explain to me how this works? Here is my current understanding
From the example EG3.DAT they give
(A4,1X,3F7.1)
I think
A4 signifies that the ID is 4 characters long.
1X means skip a space.
F.1 means that it should read 1 decimal places.
I am very confused about what 3F7 might mean.
EG3.DAT
2020 380.0 40.3 12.5
2040 502.0 83.1 18.6
2180 777.0 96.6 44.4
Below are examples from the help documents.
Rules for format statement
Format statement example
EG1 data format
EG2 data format
EG3 data format
One similar question is Explaining Fortran Write Format. Unfortunately it does not explicitly treat the F descriptor.
3F7.1 means 3 floating point numbers, each printed over 7 characters, each with one decimal number behind the decimal point. Leading characters are blanks.
For reading you don't need the .1 info at all, just read a floating point number from those 7 characters.
You guessed the meaning of A4 (string of four characters) and 1X (one blank) correctly.
In Fortran, so-called data edit descriptors (which format the input or output of data) may have repeat specifications.
In the format (A4,1X,3F7.1) the data edit descriptors are A4 and F7.1. Only F7.1 has a repeat specification (the number before the F). This simply means that the format is as though the descriptor appeared repeated: like F7.1, F7.1, F7.1. With a repeat specification of 1, or not given, there is just the single appearance.
The format of the question, then, is like
(A4,1X,F7.1,F7.1,F7.1)
This format is one that is covered by the rules provided in one of the images of the question. In particular, the aspect of repeat specification is given in rule 2 with the corresponding example of rule 3.
Further, in Fortran proper, a repeat count specifier may also be * as special case: that's like an exceptionally large repeat count. *(F7.1) would be like F7.1, F7.1, F7.1, .... I see no indication that this is supported by HLM but if this is needed a very large repeat count may be given instead.
In 1X the 1 isn't a repeat specification but an integral, and necessary, part of the position edit descriptor.
Procedure for making MDM file from excel for HLM:
-Make sure ALL the characters in ALL the columns line up
Select a column, then right click and select Format Cells
Then click on 'Custom' and go to the 'Type' box and enter the number
of 0s you need to line everything up
-Remove all the tabs from the document and replace them with spaces.
Open the document in word and use find and replace
-To save the document as .dat
First save it as .txt
Then open it in Notepad and save it as .dat
To enter the data format (FORTRAN-Style)
The program wants to read the data file space by space, so you have to specify it perfectly so that it reads the whole set properly.
If something is off, even by a single space, then your descriptive stats will be wonky compared to if you check them in another program.
Enclose the code with brackets ()
Divide the entries with commas ,
-Need ID column for all levels
ID column needs to be sorted so that it is in order from smallest to
largest
Use A# with # being the number of characters in the ID
Use an X1 to
move from the ID to the next column
-Need to say how many characters are needed in each column
Use F
After F is the number of characters needed for that column -Use F# (#= number)
There need to be enough character spaces to provide one 'gap' space
between each column
There need to be enough to character spaces to allow for the decimal
As part of the F you need to specify the number of decimal places
You do this by adding a decimal point after the F number and then a
number to represent the spaces you need -F#.#
You can use a number in front of the F so as to 'repeat' it. Not
necessary though. -#F#.#
All in all, it should look something like this:
(A4,X1,F4.0,F5.1)
Helpful links:
https://books.google.de/books?id=VdmVtz6Wtc0C&pg=PA78&lpg=PA78&dq=data+format+fortran+style+hlm&source=bl&ots=kURJ6USN5e&sig=fdtsmTGSKFxn04wkxvRc2Vw1l5Q&hl=en&sa=X&ved=0ahUKEwi_yPurjYrYAhWIJuwKHa0uCuAQ6AEIPzAC#v=onepage&q&f=false
http://www.ssicentral.com/hlm/help6/error/Problems_creating_MDM_files.pdf
http://www.ssicentral.com/hlm/help7/faq/FAQ_Format_specifications_for_ASCII_data.pdf

How to split emojis that contain flags without the flag breaking into 2 characters in Google Sheets

This is my initial string:
πŸ‡¦πŸ‡ΊπŸ…πŸ‰
I used a not so elegant way to break up the emojis.
=if(len(I88) = 4, REGEXEXTRACT(I88,"(.+?)\s*(.+?)"),if(len(I88) = 6, REGEXEXTRACT(I88,"(.+?)\s*(.+?)\s*(.+?)"),if(len(I88) = 8, REGEXEXTRACT(I88,"(.+?)\s*(.+?)\s*(.+?)\s*(.+?)"),if(len(I88) = 10, REGEXEXTRACT(I88,"(.+?)\s*(.+?)\s*(.+?)\s*(.+?)\s*(.+?)"), REGEXEXTRACT(I88,"\s*(.+?)" )))))
The result is 4 columns instead of 3: this is what it looks like
πŸ‡¦ | πŸ‡Ί | πŸ… | πŸ‰
I left the pipes to indicate a separate column
What I want is this:
πŸ‡¦πŸ‡Ί | πŸ… | πŸ‰
Short answer
To correctly separate the three emoticons we need to use a custom function. Fortunaly there are JavaScript libraries that could be used for this like the one shared in the answer by Orlin Giorgiev to Get grapheme character count in javascript strings?
Explanation
The OP formula is returning four elements instead of three because Google Sheets built-in functions requires four "characters" (actually they are code points) that need more than 4 hexadecimal digits to represent them. Each set of "characters" to represent emoticons are called "astral code points".
From https://mathiasbynens.be/notes/javascript-unicode
Astral code points are pretty easy to recognize: if you need more than 4 hexadecimal digits to represent the code point, it’s an astral code point.
Internally, JavaScript [as well Google Sheets built-in functions] represents astral symbols as surrogate pairs, and it exposes the separate surrogate halves as separate β€œcharacters”. If you represent the symbols using nothing but ECMAScript 5-compatible escape sequences, you’ll see that two escapes are needed for each astral symbol. This is confusing, because humans generally think in terms of Unicode symbols or graphemes instead.
Custom function
function SPLITGRAPHEMES(string) {
var splitter = new GraphemeSplitter();
return splitter.splitGraphemes(string);
}
NOTE: Don't forget to include the referred JavaScript library
Syntax
Assume that A1 contains . To split the three emoticons in a 1 x 3 array use the following formula:
=TRANSPOSE(SPLITGRAPHEMES(A1))
Note: In Windows the emoticons (πŸ‡¦πŸ‡ΊπŸ…πŸ‰) in this Q&A doesn't look the same as in Chrome OS, so a image was used in the above paragraph.

C++ - A few quetions about my textbook's ASCII table

In my c++ textbook, there is an "ASCII Table of Printable Characters."
I noticed a few odd things that I would appreciate some clarification on:
Why do the values start with 32? I tested out a simple program and it has the following lines of code: char ch = 1; std::cout << ch << "\n"; of code and nothing printed out. So I am kind of curious as to why the values start at 32.
I noticed the last value, 127, was "Delete." What is this for, and what does it do?
I thought char can store 256 values, why is there only 127? (Please let me know if I have this wrong.)
Thanks in advance!
The printable characters start at 32. Below 32 there are non-printable characters (or control characters), such as BELL, TAB, NEWLINE etc.
DEL is a non-printable character that is equivalent to delete.
char can indeed store 256 values, but its signed-ness is implementation defined. If you need to store values from 0 to 255 then you need to explicitly specify unsigned char. Similarly from -128 to 127, have to specify signed char.
EDIT
The so called extended ASCII characters with codes >127 are not part of the ASCII standard. Their representation depends on the so called "code page" chosen by the operating system. For example, MS-DOS used to use such extended ASCII characters for drawing directory trees, window borders etc. If you changed the code page, you could have also used to display non-English characters etc.
It's a mapping between integers and characters plus other "control" "characters" like space, line feed and carriage return interpreted by display devices (possibly virtual). As such it is arbitrary, but they are organized by binary values.
32 is a power of 2 and an alphabet starts there.
Delete is the signal from your keyboard delete key.
At the time the code was designed only 7 bits were standard. Not all bytes (parts words) were 8 bits.

determine whether a unicode character is fullwidth or halfwidth in C++

I'm writing a terminal (console) application that is supposed to wrap arbitrary unicode text.
Terminals are usually using a monospaced (fixed width) font, so to wrap a text, it's barely more than counting characters and watching whether a word fits into a line or not and act accordingly.
Problem is that there are fullwidth characters in the Unicode table that take up the width of 2 characters in a terminal.
Counting these would see 1 unicode character, but the printed character is 2 "normal" (halfwidth) characters wide, breaking the wrapping routine as it is not aware of chars that take up twice the width.
As an example, this is a fullwidth character (U+3004, the JIS symbol)
〄
12
It does not take up the full width of 2 characters here although it's preformatted, but it does use twice the width of a western character in a terminal.
To deal with this, I have to distinguish between fullwidth or halfwidth characters, but I cannot find a way to do so in C++. Is it really necessary to know all fullwidth characters in the unicode table to get around the problem?
You should use ICU u_getIntPropertyValue with the UCHAR_EAST_ASIAN_WIDTH property.
For example:
bool is_fullwidth(UChar32 c) {
int width = u_getIntPropertyValue(c, UCHAR_EAST_ASIAN_WIDTH);
return width == U_EA_FULLWIDTH || width == U_EA_WIDE;
}
Note that if your graphics library supports combining characters then you'll have to consider those as well when determining how many cells a sequence uses; for example e followed by U+0301 COMBINING ACUTE ACCENT will only take up 1 cell.
There's no need to build tables, people from Unicode have already done that:
http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
The same code is used in terminal emulating software such as xterm[1], konsole[2] and quite likely others...

Cocos2d custom font file - issue with character id

I am developing cocos2d game, which supports multiple languages. I created a font file(.png and .fnt) with all supported characters.
The issue is some of the character id's are in range of 917505-917631. So I set kCCBMFontMaxChars = 917632. But this is taking lot of memory.
Can anyone please tell me how to handle this situation.
kCCBMFontMaxChars = 0xffff; // 65k
This should suffice for all Unicode characters. It certainly works for all the asian and cyrillic languages. The memory usage will be exactly 2 MB.
Don't worry about the ID, I believe they are offsets into the BMFont char array and not indexes. Each entry is 32 Bytes. 917632 divided by 32 gives you 28676, which if it is an index fits within the unicode character range.