What is the order followed by comparisons between characters in C++? I have noticed that 'z' > '1'. I am trying to find a link to the order rationale C++ follows for all characters, or any generic material reference in case this is a widely known order (similar to alphabetical order for lowercase letters).
Each character in any programming language corresponds to an ASCII value. Just check this table, it will solve all of your doubts on how characters are evaluated https://www.ascii-code.com/
Only '0' to '9' is guaranteed to be encoded consecutively. Other characters, according language specification, would be implementation specified. However almost all x86 c compilers use the ASCII code.
Related
This is a homework question I just can't seem to get correct. It is related to a C++ programming course. The question is given below, fill in the blank.
The char and int data types have a direct equivalence, with the value of the int being based on the _________ coding scheme.
The exam has asked a poorly-worded question! char and int do NOT have a direct equivalence - but a char can be interpreted as an int, usually using the "ASCII" coding scheme ("American Standard Code for Information Interchange"). But even that's not universal - there's also EBCDIC and others.
But try "ASCII".
Edit
According to the C standard, the character encoding doesn't have to be ASCII. But there are rules it has to follow:
The repesentations for '0' to '9' must be consecutive and in that order, to make calculations easy when converting to int.
The representations for 'A' to 'Z' must be ascending, to make calculations for sorting easy (note not necessarily consecutive - for example in EBCDIC they're not).
The representations for 'a' to 'z' must also adhere to the above rule, but also the difference between upper and lower case must be the same for every character (note that lower could come before upper).
So I realize that assuming ascii encoding can get you in trouble, but I'm never really sure how much trouble you can have subtracting characters. I'd like to know what relatively common scenarios can cause any of the following to evaluate to false.
Given:
std::string test = "B";
char m = 'M';
A) (m-'A')==12
B) (test[0]-'D') == -2
Also, does the answer change for lowercase values (changing the 77 to 109 ofc)?
Edit: Digit subtraction answers this question for char digits, by saying the standard says '2'-'0'==2 must hold for all digits 0-9, but I want to know if it holds for a-z and A-Z, which section 2.3 of the standard is unclear on in my reading.
Edit 2: Removed ASCII specific content, to focus question more clearly (sorry #πάντα-ῥεῖ for a content changing edit, but I feel it is necessary). Essentially the standard seems to imply some ordering of characters for the basic set, but some encodings do not maintain that ordering, so what's the overriding principle?
In other words, when are chars in C/C++ not stored in ASCII?
C or C++ language don't have any notion of the actual character coding table used by the target system. The only convention is that character literals like 'A' match the current encoding.
You could as well deal with EBCDIC encoded characters and the code looks the same as for ASCII characters.
Quote from C++03 2.2 Character sets:
"The basic execution character set and the basic execution
wide-character set shall each contain all the members of the basic
source character set..The values of the members of the execution
character sets are implementation-defined, and any additional members
are locale-specific."
According to this, 'A', which belongs to the execution character set, its value is implementation-defined. So it's not 65(ASCII code of 'A' in decimal), what?!
// Not always 65?
printf ("%d", 'A');
Or I've a misunderstanding as to the value of a character in execution character set?
Of course it can be ASCII's 65, if the execution character set is ASCII or a superset (such as UTF-8).
It doesn't say "it can't be ASCII", it says that it is something called "the execution character set".
So, the standard allows that the "execution character set" is other things than ASCII or ASCII derivatives. One example would be the EBCDIC character set that IBM used for a long time (there's probably still machines about using EBCDIC, but I suspect anything built in the last 10-15 years wouldn't be using that). The encoding of characters in EBCDIC is completely different from ASCII.
So, expecting, in code, that the value of 'A' is any particular value is not portable. There are also a whole heap of other "common assumptions" that will fail - that there are no "holes" between A-Z, and that 'A'-'a' == 32 are both false in EBCDIC. At least the characters A-Z are in the correct order! ;)
Coming from a discussion started here, does the standard specify values for characters? So, is '0' guaranteed to be 48? That's what ASCII would tell us, but is it guaranteed? If not, have you seen any compiler where '0' isn't 48?
No. There's no requirement for the either the source or execution character sets to use an encoding with an ASCII subset. I haven't seen any non-ASCII implementations but I know someone who knows someone who has. (It is required that '0' - '9' have contiguous integer values, but that's a duplicate question somewhere else on SO.)
The encoding used for the source character set controls how the bytes of your source code are interpreted into the characters used in the C++ language. The standard describes the members of the execution character set as having values. It is the encoding that maps these characters to their corresponding values the determines the integer value of '0'.
Although at least all of the members of the basic source character set plus some control characters and a null character with value zero must be present (with appropriate values) in the execution character set, there is no requirement for the encoding to be ASCII or to use ASCII values for any particular subset of characters (other than the null character).
No, the Standard is very careful not to specify what the source character encoding is.
C and C++ compilers run on EBCDIC computers too, you know, where '0' != 0x30.
However, I believe it is required that '1' == '0' + 1.
It's 0xF0 in EBCDIC. I've never used an EBCDIC compiler, but I'm told that they were all the rage at IBM for a while.
There's no requirement in the C++ standard that the source or execution encodings are ASCII-based. It is guaranteed that '0' == '1' - 1 (and in general that the digits are contiguous and in order). It is not guaranteed that the letters are contiguous, and indeed in EBCDIC 'J' != 'I' + 1 and 'S' != 'R' + 1.
According to the C++11 stardard N3225
The glyphs for the members of the basic source character set are
intended to identify characters from the subset of ISO/IEC 10646 which
corresponds to the ASCII character set. However, because the mapping
from source file characters to the source character set (described in
translation phase 1) is specified as implementation-defined, an
implementation is required to document how the basic source characters
are represented in source files
In short, the character set is not required to be mapped to the ASCII table, even though I've never heard about any different implementation
I need to compile an old Fortran program that previously used a Compaq Fortran compiler. I can't seem to figure out what a constant that begins with a '#' is. gfortran says its a syntax error and I can't seem to find many answers.
CHAR2 = IATA(KK) - #20202030
CHAR3 = IATA(KK+1) - #20202030
What kind of constant is #20202030? According to the comments this code should take two ASCII characters in IATA and convert them to binary. Can someone explain this?
Further down:
IF (IATA(KK+1) .EQ. #2020202C) THEN
Now there is a 'C' at the end. What does that mean?
How can I port this over to gfortran? It feels like I'm missing something obvious. Please enlighten me.
Thanks!
What you are looking at is non-standard Fortran. In Compaq Fortran the # is used to prefix a hexadecimal constant, as one of the comments suggests. As the other comment suggests the standard prefix for hexadecimal constants is Z and the digits should be enclosed in '' marks. So non-standard #2020202C should translate to standard Z'2020202C'.
As for the trailing C, I think that's just a hexadecimal digit.
Just a comment:
Besides being hexadecimal literals in non-standard notation, these are also ASCII strings fitted into 32-bit integer values. When stored in memory #20202030 is '___0' or '0___' depending on the endiannes of the architecture while #2020202C is '___,' or ',___' (underscores represent blanks). Padding with blanks is standard Fortran behaviour and storing 8-bit charaters into 32-bit types padded with blanks instead of NUL-s, e.g. using #20202030 instead of #00000030, should come as no surprise to Fortran programmers.
In C and C++ subtracting '0' from another character is a very common way to convert characters like 0, 1, 2 and so on to their numeric equivalents (that absolutely fails to work with special Unicode symbols). E.g. '9' - '0' gives 9 since the ASCII code of 9 is 0x39 (57) while the ASCII code of 0 is 0x30 (48). Fortran does not treat CHARACTER as integers the way C and C++ do and one has to use ICHAR() or IACHAR() to covert them to their ASCII codes but still this code works much like a C/C++ one would do.
How is the IATA array defined? How are values assigned to its elements?