Unexpected results when looking at ASCII codes in C++ - c++

The bit of code below is extracting ASCII codes from characters.
When I convert characters in the normal ASCII region I get the value I expect.
When I convert £ and € from the extened region I get a load of 1's padding the INT that I'm storing the character in.
e.g. the output of the below is:
45 (ascii E as expected)
FFFFFF80 (extended ascii € as expected but padded with ones)
It's not causing me an issue but I'm just wondering why this happens.
Here's the code...
unsigned int asciichar[3];
string cTextToEncode = "E€";
for (unsigned int i = 0; i < cTextToEncode.length(); i++)
{
asciichar[i] = (unsigned int)cTextToEncode[i];
cout << hex << asciichar[i] << "\n";
}
Can anyone explain why this is?
Thanks

depending on the implementation a char can be either signed or unsigned. In your case they appear to be signed, so 0x80 is interpreted as -128 instead of 128, hence when cast to an integer it becomes 0xffffff80.
btw, this has nothing at all to do with ASCII

First, there's no € in ASCII (extended or otherwise) because the euro didn't exist when ASCII was created. However, several ASCII-friendly 8-bit encodings do support the € character, but the conversion is done by your source code editor (the compiler merely sees a byte which happens to represent € in your editor, but might be something else entirely on, say, a computer in Israel).
Second, (unsigned int) casts do not extract the ASCII encoding of a character. They merely convert the value of the underlying numeric char type to an unsigned integer. This causes strange things to happen when the converted value is negative - on your compiler, char happens to be signed char and thus characters with an ASCII value larger than 127 end up being negative char values.
You should convert to an unsigned char first, and then to an unsigned int.

You should be careful when promoting signed values.
When promoting signed char to signed int a first bit (sign bit) is taken into account. The algorithm is roughly look like this:
1) If you have 1X-XX-XX-XX (char in binary, X - any binary digit) then int will be (starts with 24 ones) 1...1-1X-XX-XX-XX (binary) -> 0xFFFFFFYY (hex)
2) if you have 0X-XX-XX-XX (binary), then you'll have (starts with 24 zeroes) 0...0-0X-XX-XX-XX (binary) -> 0x000000YY (hex).
In your case you want to force rule #2 all the time. In order to do this, you need to tell compiler to ignore first bit (sign bit). For this you need to use unsigned char.

Related

Difference between converting int to char by (char) and by ASCII

I have an example:
int var = 5;
char ch = (char)var;
char ch2 = var+48;
cout << ch << endl;
cout << ch2 << endl;
I had some other code. (char) returned wrong answer, but +48 didn't. When I changed ONLY (char) to +48, then my code got corrected.
What is the difference between converting int to char by using (char) and +48 (ASCII) in C++?
char ch=(char)var; has the same effect as char ch=var; and assigns the numeric value 5 to ch. You're using ASCII (supported by all modern systems) and ASCII character code 5 represents Enquiry 'ENQ' an old terminal control code. Perhaps some old timer has a clue what it did!
char ch2 = var+48; assigns the numeric value 53 to ch2 which happens to represent the ASCII character for the digit '5'. ASCII 48 is zero (0) and the digits all appear in the ASCII table in order after that. So 48+5 lands on 53 (which represents the character '5').
In C++ char is a integer type. The value is interpreted as representing an ASCII character but it should be thought of as holding a number.
Its numeric range is either [-128,127] or [0,255]. That's because C++ requires sizeof(char)==1 and all modern platforms have 8 bit bytes.
NB: C++ doesn't actually mandate ASCII, but again that will be the case on all modern platforms.
PS: I think its an unfortunate artifact of C (inherited by C++) that sizeof(char)==1 and there isn't a separate fundamental type called byte.
A char is simply the base integral denomination in c++. Output statements, like cout and printf map char integers to the corresponding character mapping. On Windows computers this is typically ASCII.
Note that the 5th in ASCII maps to the Enquiry character which has no printable character, while the 53rd character maps to the printable character 5.
A generally accepted hack to store a number 0-9 in a char is to do: const char ch = var + '0' It's important to note the shortcomings here:
If your code is running on some non-ASCII character mapping then characters 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 may not be laid out in order in which case this wouldn't work
If var is outside the 0 - 9 range this var + '0' will map to something other than a numeric character mapping
A guaranteed way to get the most significant digit of a number independent of 1 or 2 is to use:
const auto ch = to_string(var).front()
Generally char represents a number as int does. Casting an int value to char doesn't provide it's ASCII representation.
The ASCII codes as numbers for digits range from 48 (== '0') to 58 (== '9'). So to get the printable digit you have to add '0' (or 48).
The difference is that casting to char (char) explicitly converts the digit to a char and adding 48 do not.
Its important to note that an int is typically 32 bit and char is typically 8 bit. This means that the number you can store in a char is from -127 to +127(or 0 to 255-(2^8-1) if you use unsigned char) and in an int from −2,147,483,648 (−231) to 2,147,483,647 (231 − 1)(or 0 to 2^32 -1 for unsigned).
Adding 48 to a value is not changing the type to char.

How to convert single char to double

I try to create a program that can evaluate simple math expression like "4+4". The expression is given from the user.
The program saves it in a char* and then searches for binary operation (+,-,*,:) and does the operation.
The problem is that I can't figure out how to convert the single char into a double value.
I know there is the atof function but I want to convert single char.
There is a way to do that without creating a char*?
A char usually represents a character. However, a single char is simply an integer in range of at least [-127,+127] (signed version) or at least [0,255] (unsigned version).
If you obtained a character looking as a digit, the value stored in it is an ASCII number representing it. Digits start at code 48 (for zero) and go up incrementally till code 57 (for nine). Thus, if you take the code and subtract 48, you get the integer value. From there, converting it to double is a matter of casting.
Thus:
char digit = ...
double value = double(digit - 48);
or even better, for convenience:
char digit = ...
double value = double(digit - '0'); //'0' has a built-in value 48
There is a way to do that without creating a char* ???
Sure. You can extract the digit number from a single char as follows:
char c = '4';
double d = c - '0';
// ^^^^^^^ this expression results in a numeric value that can be converted
// to double
This uses the circumstance that certain character tables like ASCII or EBCDIC encode the digits in a continuous set of values starting at '0'.

printf escaped unicode character from integer

I'm doing a rewrite of this question.
I want to create a string with a unicode escaped character such as "\u03B1" using an integer constant. For example, this string is the greek letter alpha.
const char *alpha = "\u03B1"
I want to construct the same string using a call to printf using the integer value 0x03B1. For this example it can be done like this but I'm not sure to get those two numbers from 0x03B1.
printf("%c%c", 206, 177);
This link explains what to do but I'm not sure how to do it.
http://www.fileformat.info/info/unicode/utf8.htm
For characters equal to or below 2047 (hex 0x07FF), the UTF-8
representation is spread across two bytes. The first byte will have
the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The
second byte will have the top bit set and the second bit clear (i.e.
0x80 to 0xBF).
NOTE: I do not want to create the string "\\u03B1" with a backslash. This is different than "\u03B1" which is an escaped unicode character.
It appears that even the most recent C and C++ standards are a bit disappointing in their handling of Unicode.
For those who are confused about the example in the question, like I was:
const char *alpha = "\u03B1"
In C99, this will store a pointer to the string "α" (U+03B1) in alpha. In C89, this is invalid syntax.
I could not find a way to use the \u syntax with a variable or integer constant, like what the question was requesting. You may be better off using a library to add better Unicode support to your program. I have not used the ICU library, but it sounds promising.
How to convert a Unicode code point to characters in C++ using ICU?: possibly an answer to your question
Unicode Processing in C++: a related Stack Overflow question
I figured it out.
The first byte contains the 5 upper bits 0x7c0 is 11111000000 and the second byte contains the lower 5 bits 0x3f is 00000111111 of the unicode value.
The first byte uses the mask 0xc0 is 11000000 to set the two high bits and the second byte uses 0x80 is 10000000 to set the first high bit.
int alpha = 0x03B1; // 945
char byte1 = 0xc0 | ((alpha & 0x7c0) >> 6); // 206
char byte2 = 0x80 | (alpha & 0x3f); // 177
printf("%c%c", byte1, byte2);

why does ascii value vary for signed and unsigned character in c++?

I have tried these 2 following codes:
int main()
{
int val=-125;
char code=val;
cout<<"\t"<<code<<" "<<(int)code;
getch();
}
The output i got is a^ -125
The second code is:
int main()
{
int val=-125;
unsigned char code=val;
cout<<"\t"<<code<<" "<<(int)code;
getch();
}
The output i got is: a^ 131
after trying both the codes is it safe to conclude that a character can have 2 ASCII values or my approach to find ASCII value(s) is flawed?
P.S.-
I was unable to upload the pictures of my output, so I am forced to type the output where the character I got isn't present in the standard keyboard.
In both examples 'code' has the same bitwise value. The first bit is 1, because it was a negativ number. Since both 'codes' have the same value the output character is the same (converting from number->character treats the number as an unsigned value).
After that you convert your character back to a (signed) interger. This conversion respects the type and the sign of you char.
->unsigned char -> int -> int always positiv
->char -> int -> int has the same sign as the char (and because the first bit was 1 it's negativ here)
unsigned integers in C++ have modulo 2n behavior, where n is the number of value bits.
that means if your char has 8 bits, then unsigned char has modulo 256 behavior.
this behavior is as if the values 0 through 255 were placed on a clockface. any operation that produces a result that goes past the 0-255 divide just effectively wraps around. just like arithmetic with hours on a clockface.
which means that assigning the value -125 yields the corresponding value in the range 0 through 255, namely -125 + 256 = 131.

Understanding Endianness - a variable value

I'm using a piece of code (found else where on this site) that checks endianness at runtime.
static bool isLittleEndian()
{
short int number = 0x1;
char *numPtr = (char*)&number;
std::cout << numPtr << std::endl;
std::cout << *numPtr << std::endl;
return (numPtr[0] == 1);
}
When in debug mode, the value numPtr looks like this: 0x7fffffffe6ee "\001"
I assume the first hexadecimal part is the pointer's memory address, and the second part is the value it holds. I'm know that \0 is null termination in old-style C++, but why is it at the front? Is it to do with endianness?
On a little-endian machine: 01 the first byte and therefore least significant (byte place 0), and \0 the second byte/final byte (byte place 1)?
In addition, the cout statements do not print the pointer address or it's value. Reasons for this?
The others have given you a clear answer to what "\000" means, so this is an answer to your question:
On a little-endian machine: 01 the first byte and therefore least significant (byte place 0), and \0 the second byte/final byte (byte place 1)?
Yes, this is correct. Of you look at value like 0x1234, it consists of two bytes, the high part 0x12 and the low part 0x34. The term "little endian" means that the low part is stored first in memory:
addr: 0x34
addr+1: 0x12
Did you known that the term "endian" predated the computer industry? It was originally used by Jonathan Swift in his book Gulliver's Travels, where it described if people were eating the egg from the pointy or the round end.
the easiest way to check for endianness is to let the system do it for you:
if (htonl(0xFFFF0000)==0xFFFF0000) printf("Big endian");
else printf("Little endian");
That's not a \0 followed by "01", it's the single character \001, which represents the number 1 in octal. That's the only byte "in" your string. There's another byte after it with the value zero, but you don't see that since it's treated as the string terminator.
For starters: this type of function is totally worthless: on a machine
where sizeof(int) is 4, there are 24 possible byte orders. Most, of
course, don't make sense, but I've seen at least three. And endianness
isn't the only thing which affects integer representation. If you have
an int, and you want to get the low order 8 bits, use intValue &
0xFF, for the next 8 bits, (intValue >> 8) & 0xFF.
With regards to your precise question: I presume what you are describing
as "looks like this" is what you see in the debugger, when you break at
the return. In this case, numPtr is a char* (a unsigned char
const* would make more sense), so the debugger assumes a C style
string. The 0x7fffffffe6ee is the address; what follows is what the
compiler sees as a C style string, which it displays as a string, i.e.
"...". Presumably, your platform is a traditional little-endian
(Intel); the pointer to the C style string sees the sequence (numeric
values) of 1, 0. The 0 is of course the equivalent of '\0', so it
considers this a one character string, with that one character having
the encoding of 1. There is no printable character with an encoding of
one, and it doesn't correspond to any of the normal escape sequences
(e.g. '\n', '\t', etc.) either. So the debugger outputs it using
the octal escape sequence, a '\' followed by 1 to 3 octal digits.
(The traditional '\0' is just a special case of this; a '\' followed
by a single octal digit.) And it outputs 3 digits, because (probably)
it doesn't want to look ahead to ensure that the next character isn't an
octal digit. (If the sequence were the two bytes 1, 49, for example,
49 is '1' in the usual encodings, and if it output only a single byte
for the octal encoding of 1, the results would be "\11", which is a
single character string—corresponding in the usual encodings to
'\t'.) So you get " this is a string, \001 with first character
having an encoding of 1 (and no displayable representation), and "
that's the end of the string.
The "\001" you are seeing is just one byte. It's probably octal notation, which needs three digits to properly express the (decimal) values 0 to 255.
The \0 isn't a NUL, the debugger is showing you numPtr as a string, the first character of which is \001 or control-A in ASCII. The second character is \000, which isn't displayed because NULs aren't shown when displaying strings. The two character string version of 'number' would appear as "\000\001" on a big-endian machine, instead of "\001\000" as it appears on little-endian machines.
In addition, the cout statements do not print the pointer address or
it's value. Reasons for this?
Because chars and char pointers are treated differently than integers when it comes to printing.
When you print a char, it prints the character from whatever character set is being used. Usually, this is ASCII, or some superset of ASCII. The value 0x1 in ASCII is non-printing.
When you print a char pointer, it doesn't print the address, it prints it as a null-terminated string.
To get the results you desire, cast your char pointer to a void pointer, and cast your char to an int.
std::cout << (void*)numPtr << std::endl;
std::cout << (int)*numPtr << std::endl;