C++ - A few quetions about my textbook's ASCII table - c++

In my c++ textbook, there is an "ASCII Table of Printable Characters."
I noticed a few odd things that I would appreciate some clarification on:
Why do the values start with 32? I tested out a simple program and it has the following lines of code: char ch = 1; std::cout << ch << "\n"; of code and nothing printed out. So I am kind of curious as to why the values start at 32.
I noticed the last value, 127, was "Delete." What is this for, and what does it do?
I thought char can store 256 values, why is there only 127? (Please let me know if I have this wrong.)
Thanks in advance!

The printable characters start at 32. Below 32 there are non-printable characters (or control characters), such as BELL, TAB, NEWLINE etc.
DEL is a non-printable character that is equivalent to delete.
char can indeed store 256 values, but its signed-ness is implementation defined. If you need to store values from 0 to 255 then you need to explicitly specify unsigned char. Similarly from -128 to 127, have to specify signed char.
EDIT
The so called extended ASCII characters with codes >127 are not part of the ASCII standard. Their representation depends on the so called "code page" chosen by the operating system. For example, MS-DOS used to use such extended ASCII characters for drawing directory trees, window borders etc. If you changed the code page, you could have also used to display non-English characters etc.

It's a mapping between integers and characters plus other "control" "characters" like space, line feed and carriage return interpreted by display devices (possibly virtual). As such it is arbitrary, but they are organized by binary values.
32 is a power of 2 and an alphabet starts there.
Delete is the signal from your keyboard delete key.
At the time the code was designed only 7 bits were standard. Not all bytes (parts words) were 8 bits.

Related

Fortran formatted IO and the Null character

I wonder how Fortran's I/O is expected to behave in case of a NULL character ACHAR(0).
The actual task is to fill an ASCII file by blocks of precisely eight characters. The strings are read from a binary and may contain non-printing characters.
I tried with gfortran 4.8, 8.1 and f2c. If there is a NULL character in the string the format specifier FORMAT(A8) does not write eight characters.
Give the following F77 code a try:
c Print a string of eight character surrounded by dashes
100 FORMAT('-',A8,'-')
c Works fine if empty or any other combination of printing chars
write(*,100) ''
c In case of a short sting blanks are padded
write(*,100) '345678'
c A NULL character does something I did not expect
write(*,100) '123'//ACHAR(0)//'4567'
c Not even position editing helps
101 FORMAT('-',A8,T10,'x')
write(*,101) '123'//ACHAR(0)//'4567'
end
My output is:
- -
- 345678-
-1234567-
-1234567x
Is this expected behavior? Any idea how to get the output eight characters wide in any case?
When using an edit descriptor A8 the field width is eight. For output, eight characters will be written.
In the case of the example, it isn't the writing of the characters that is contrary to your expectations, but how they are displayed by your terminal.
You can examine the output further with tools like hexdump or you can write to an internal file and look at arbitrary substrings.
Yes, that is expected, if there is a null character, the printing of the string on the screen can stop there. The characters will still be sent, but the string does not have to be printed on the screen.
Note that C uses NULL to delimit strings and the OS may interpret the strings it receives with the same conventions. The allows the non-printable characters to be interpreted in processor specific ways by the processor and the processor includes the whole complex of the compiler, the executing environment (OS and programs in the OS) and the hardware.

C++ get the size (in bytes) of EOL

I am reading an ASCII text file. It is defined by the size of each field, in bytes. E.g. Each row consists of a 10 bytes for some string, 8 bytes for a floating point value, 5 bytes for an integer and so on.
My problem is reading the newline character, which has a variable size depending on the OS (usually 2 bytes for windows and 1 byte for linux I believe).
How can I get the size of the EOL character in C++?
For example, in python I can do:
len(os.linesep)
The time honored way to do this is to read a line.
Now, the last char should be \n. Strip it. Then, look at the previous character. It will either be \r or something else. If it's \r, strip it.
For Windows [ascii] text files, there aren't any other possibilities.
This works even if the file is mixed (e.g. some lines are \r\n and some are just \n).
You can tentatively do this on few lines, just to be sure you're not dealing with something weird.
After that, you now know what to expect for most of the file. But, the strip method is the general reliable way. On Windows, you could have a file imported from Unix (or vice versa).
I'm not sure that the translation occurs where you think it is. Look at the following code:
ostringstream buf;
buf<< std::endl;
string s = buf.str();
int i = strlen(s.c_str());
After this, running on Windows, i == 1. So the end of line definition in std is 1 character. As others have commented, this is the "\n" character.

How to find whether byte read is japanese or english?

I have an array which contains Japanese and ascii characters.
I am trying to find whether characters read is English character or Japanese characters.
in order to solve this i followed as
read first byte , if multicharcterswidth is not equal to one, move pointer to next byte
now display whole two byte together and display that Japanese character has been read.
if multicharcterswidth is equal to one, display the byte. and show message english has been read.
above algo work fine but fails in case of halfwidth form of Japanese eg.シ,ァ etc. as it is only one byte.
How can i find out whether characters are Japanese or English?
**Note:**What i tried
I read from web that first byte will tell whether it is japanese or not which i have covered in step 1 of my algo. But It won't work for half width.
EDIT:
The problem i was solving i include control characters 0X80 at start and end of my characters to identify the string of characters.
i wrote following to identify the end of control character.
cntlchar.....(my characters , can be japnese).....cntlchar
if ((buf[*p+1] & 0X80) && (mbMBCS_charWidth(&buf[*p]) == 1))
// end of control characters reached
else
// *p++
it worked fine when for english but didn't work for japanese half width.
How can i handle this?
Your data must be using Windows Codepage 932. That is a guess, but examining the codepoints shows what you are describing.
The codepage shows that characters in the range 00 to 7F are "English" (a better description is "7-bit ASCII"), the characters in the ranges 81 to 9F and E0 to FF are the first byte of a multibyte code, and everything between A1 and DF are half-width Kana characters.
For individual bytes this is impractical to impossible. For larger sets of data you could do statistical analysis on the bytes and see if it matches known English or Japanese patterns. For example, vowels are very common in English text but different Japanese letters would have similar frequency patterns.
Things get more complicated than testing bits if your data includes accented characters.
If you're dealing with Shift-JIS data and Windows-1252 encoded text, ideally you just remap it to UTF-8. There's no standard way to identify text encoding within a text file, although things like MIME can help if added on externally as metadata.

unicode char value

Question: What is the correct order of Unicode extended symbols by value?
If I excel sort a list of Unicode chars the order is different than if I use the excel "=code()" and sort by those values. The purpose is that I want to measure the distance between chars, for example a-b = 1 and &-% = 1; when sorted with the excel sort function, two chars that are ordered within three appear to have values that are 134 away.
Also, some char symbols are blank in excel and several are found twice with 'find' and are two different symbols - and a couple are not found at all. Please explain the details of these 'special' chars.
http://en.wikipedia.org/wiki/List_of_Unicode_characters
sample code:
int charDist = abs(alpha[index] - code[0]);
EDIT:
To figure out the UNICODE values in c++ vs2008 I ran each code as a comparison from code 1 to code 255 against code 1
cout << mem << " code " << key << " is " << abs(key[0] - '') << " from " << endl;
In the brackets is a black happy face that this website does not have the font for but the command window does, in vs2008 it looks like a half-post | with the right half of a T. Excel leaves a blank.
The following Unicodes are not handled in c++ vs2008 with the std library and #include
9, 10, 13, 26, 34, 44,
And, the numerical 'distance' for codes 1 through 127 are correct, but at 128 the distance skips an extra and is one further away for some reason. Then from 128 to 255 the distance reverses and becomes closer; 255 is 2 away from 1 ''
It'd be nice if these followed something more logical and were just 1 through 255 without hiccups or skips and reversals, and 255-1 = 254 but hey, what do I know.
EDIT2: I found it - without the absolute - the collation for UNIFORMAT is 128 to 255 then 1 to 127 and yields 1 to 255 with the 6 skips for 9, 10, 13, 26, 34, 44 that are garbage. That was not intuitive. In the new order 128->255,1->127 the strange skip from 127 to 128 is clearer, it is because there is no 0 so the value is missing between 255 and 1.
SOLUTION: make my own hashtable with values for each symbol and do not rely on c++ std library or vs2008 to provide the UNIFORMAT values since they are not correct for measuring the char distance outside of several specific subsets of UNIFORMAT.
Unicode doesn't have a defined sort (or collation) order. When Excel sorts, it's using tables based on the currently selected language. For example, someone using Excel in English mode may get different sorting results that someone using Excel in Portuguese.
There are also issues of normalization. With Unicode, one "character" doesn't necessarily correspond to one value. Some characters can be represented in different ways. For example, a capital omega can be coded as a Greek letter or as a symbol for representing units of electrical resistance. In some languages, a single character may be composed from several consecutive values.
The blank values probably correspond to glyphs that you don't have any font coverage for. Some systems use so-called "Unicode fonts" which have a large percentage of the glyphs you need for every script. Windows tends to switch fonts on the fly when the current font doesn't have a necessary glyph. Neither approach will have every glyph necessary. Also, some Unicode values don't encode to a visible glyph (e.g., there are many different kinds of spaces in Unicode), some values act more like ASCII-style controls codes (e.g., paragraph separator or bidi controls), and some values only make sense when they combine with another character, like many of the "combining" accents.
So there's not an answer you're going to be satisfied with. Perhaps if you gave more information about what you're ultimately trying to do, we could suggest a different approach.
I don't think you can do what you want to do in Excel without limiting your approach significantly.
By experimentation, the Code function will never return a value higher than 255. If you use any unicode text that cannot be generated via this VBA Code, it will be interpreted as a question mark (?) or 63.
For x = 1 To 255
Cells(x, 1).Value = Chr(x)
Next
You should be able to determine the difference using Code. But if the character doesn't fall in that realm, you'll need to go outside of Excel, because even VBA will convert any other Unicode characters to the question mark(?) or 63.

C/C++: How to convert 6bit ASCII to 7bit ASCII

I have a set of 6 bits that represent a 7bit ASCII character. How can I get the correct 7bit ASCII code out of the 6 bits I have? Just append a zero and do an bitwise OR?
Thanks for your help.
Lennart
ASCII is inherently a 7-bit character set, so what you have is not "6-bit ASCII". What characters make up your character set? The simplest decoding approach is probably something like:
char From6Bit( char c6 ) {
// array of all 64 characters that appear in your 6-bit set
static SixBitSet[] = { 'A', 'B', ... };
return SixBitSet[ c6 ];
}
A footnote: 6-bit character sets were quite popular on old DEC hardware, some of which, like the DEC-10, had a 36-bit architecture where 6-bit characters made some sense.
You must tell us how your 6-bit set of characters looks, I don't think there is a standard.
The easiest way to do the reverse mapping would probably be to just use a lookup table, like so:
static const char sixToSeven[] = { ' ', 'A', 'B', ... };
This assumes that space is encoded as (binary) 000000, capital A as 000001, and so on.
You index into sixToSeven with one of your six-bit characters, and get the local 7-bit character back.
I can't imagine why you'd be getting old DEC-10/20 SIXBIT, but if that's what it is, then just add 32 (decimal). SIXBIT took the ASCII characters starting with space (32), so just add 32 to the SIXBIT character to get the ASCII character.
The only recent 6-bit code I'm aware of is base64. This uses four 6-bit printable characters to store three 8-bit values (6x4 = 8x3 = 24 bits).
The 6-bit values are drawn from the characters:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
which are the values 0 thru 63. Four of these (say UGF4) are used to represent three 8-bit values.
UGF4 = 010100 000110 000101 111000
= 01010000 01100001 01111000
= Pax
If this is how your data is encoded, there are plenty of snippets around that will tell you how to decode it (and many languages have the encoder and decoder built in, or in an included library). Wikipedia has a good article for it here.
If it's not base64, then you'll need to find out the encoding scheme. Some older schemes used other lookup methods of the shift-in/shift-out (SI/SO) codes for choosing a page within character sets but I think that was more for choosing extended (e.g., Japanese DBCS) characters rather than normal ACSII characters.
If I were to give you the value of a single bit, and I claimed it was taken from Windows XP, could you reconstruct the entire OS?
You can't. You've lost information. There is no way to reconstruct that, unless you have some knowledge about what was lost. If you know that, say, the most significant bit was chopped off, then you can set that to zero, and you've reconstructed at least half the characters correctly.
If you know how 'a' and 'z' are represented in your 6-bit encoding, you might be able to guess at what was removed by comparing them to their 7-bit representations.