I'm using a piece of code (found else where on this site) that checks endianness at runtime.
static bool isLittleEndian()
{
short int number = 0x1;
char *numPtr = (char*)&number;
std::cout << numPtr << std::endl;
std::cout << *numPtr << std::endl;
return (numPtr[0] == 1);
}
When in debug mode, the value numPtr looks like this: 0x7fffffffe6ee "\001"
I assume the first hexadecimal part is the pointer's memory address, and the second part is the value it holds. I'm know that \0 is null termination in old-style C++, but why is it at the front? Is it to do with endianness?
On a little-endian machine: 01 the first byte and therefore least significant (byte place 0), and \0 the second byte/final byte (byte place 1)?
In addition, the cout statements do not print the pointer address or it's value. Reasons for this?
The others have given you a clear answer to what "\000" means, so this is an answer to your question:
On a little-endian machine: 01 the first byte and therefore least significant (byte place 0), and \0 the second byte/final byte (byte place 1)?
Yes, this is correct. Of you look at value like 0x1234, it consists of two bytes, the high part 0x12 and the low part 0x34. The term "little endian" means that the low part is stored first in memory:
addr: 0x34
addr+1: 0x12
Did you known that the term "endian" predated the computer industry? It was originally used by Jonathan Swift in his book Gulliver's Travels, where it described if people were eating the egg from the pointy or the round end.
the easiest way to check for endianness is to let the system do it for you:
if (htonl(0xFFFF0000)==0xFFFF0000) printf("Big endian");
else printf("Little endian");
That's not a \0 followed by "01", it's the single character \001, which represents the number 1 in octal. That's the only byte "in" your string. There's another byte after it with the value zero, but you don't see that since it's treated as the string terminator.
For starters: this type of function is totally worthless: on a machine
where sizeof(int) is 4, there are 24 possible byte orders. Most, of
course, don't make sense, but I've seen at least three. And endianness
isn't the only thing which affects integer representation. If you have
an int, and you want to get the low order 8 bits, use intValue &
0xFF, for the next 8 bits, (intValue >> 8) & 0xFF.
With regards to your precise question: I presume what you are describing
as "looks like this" is what you see in the debugger, when you break at
the return. In this case, numPtr is a char* (a unsigned char
const* would make more sense), so the debugger assumes a C style
string. The 0x7fffffffe6ee is the address; what follows is what the
compiler sees as a C style string, which it displays as a string, i.e.
"...". Presumably, your platform is a traditional little-endian
(Intel); the pointer to the C style string sees the sequence (numeric
values) of 1, 0. The 0 is of course the equivalent of '\0', so it
considers this a one character string, with that one character having
the encoding of 1. There is no printable character with an encoding of
one, and it doesn't correspond to any of the normal escape sequences
(e.g. '\n', '\t', etc.) either. So the debugger outputs it using
the octal escape sequence, a '\' followed by 1 to 3 octal digits.
(The traditional '\0' is just a special case of this; a '\' followed
by a single octal digit.) And it outputs 3 digits, because (probably)
it doesn't want to look ahead to ensure that the next character isn't an
octal digit. (If the sequence were the two bytes 1, 49, for example,
49 is '1' in the usual encodings, and if it output only a single byte
for the octal encoding of 1, the results would be "\11", which is a
single character string—corresponding in the usual encodings to
'\t'.) So you get " this is a string, \001 with first character
having an encoding of 1 (and no displayable representation), and "
that's the end of the string.
The "\001" you are seeing is just one byte. It's probably octal notation, which needs three digits to properly express the (decimal) values 0 to 255.
The \0 isn't a NUL, the debugger is showing you numPtr as a string, the first character of which is \001 or control-A in ASCII. The second character is \000, which isn't displayed because NULs aren't shown when displaying strings. The two character string version of 'number' would appear as "\000\001" on a big-endian machine, instead of "\001\000" as it appears on little-endian machines.
In addition, the cout statements do not print the pointer address or
it's value. Reasons for this?
Because chars and char pointers are treated differently than integers when it comes to printing.
When you print a char, it prints the character from whatever character set is being used. Usually, this is ASCII, or some superset of ASCII. The value 0x1 in ASCII is non-printing.
When you print a char pointer, it doesn't print the address, it prints it as a null-terminated string.
To get the results you desire, cast your char pointer to a void pointer, and cast your char to an int.
std::cout << (void*)numPtr << std::endl;
std::cout << (int)*numPtr << std::endl;
Related
I've seen that the below program is taking only 7 bits of memory to store the character, but in general everywhere I've studied says that char occupies 1 byte of memory ie is 8 bits.
Does a single character require 8 bits or 7 bits?
If it requires 8 bits, what will be stored in the other bit?
#include <iostream>
using namespace std;
int main()
{
char ch = 'a';
int val = ch;
while (val > 0)
{
(val % 2)? cout<<1<<" " : cout<<0<<" ";
val /= 2;
}
return 0;
}
Output:
1 0 0 0 0 1 1
The below code shows the memory gap between the character, i.e. is 7 bits:
9e9 <-> 9f0 <->......<-> a13
#include <iostream>
using namespace std;
int main()
{
char arr[] = {'k','r','i','s','h','n','a'};
for(int i=0;i<7;i++)
cout<<&arr+i<<endl;
return 0;
}
Output:
0x7fff999019e9
0x7fff999019f0
0x7fff999019f7
0x7fff999019fe
0x7fff99901a05
0x7fff99901a0c
0x7fff99901a13
Your first code sample doesn't print leading zero bits, as ASCII characters all have the upper bit set to zero you'll only get at most seven bits printed if using ASCII characters. Extended ASCII characters or utf-8 use the upper bit for characters outside the basic ASCII character set.
Your second example is actually printing that each character is seven bytes long which is obviously incorrect. If you change the size of the array you are using to not be seven characters long you'll see different results.
&arr + i is equivalent to (&arr) + i as &arr is a pointer to char[7] which has a size of 7, the +i adds 7 * i bytes to the pointer. (&arr) + 1 points to one byte past the end of the array, if you try printing the values these pointers point to you'll get junk or a crash: **(&arr + i).
Your code should be static_cast<void*>(&arr[i]), you'll then see the pointer going up by one for each iteration. The cast to void* is necessary to stop the standard library from trying to print the pointer as a null terminated string.
It has nothing to do with space assigned for char. You simply converting ASCII represent of char into binary.
ASCII is a 7 bit character set. In C normally represented by an 8 bit char. If highest bit in an 8 bit byte is set, it is not an ASCII character. The eighth bit was used for parity. To communicate information between computers using different encoding.
ASCII stands for American Standard Code for Information Interchange, with the emphasis on American. The character set could not represent like Arabic letters (things with umlauts for example) or latin.
To “extend” the ASCII set and use those extra 128 values that became available by using all 8 bits, which caused problems. Eventually, Unicode came along which can represent every Unicode character. But 8 bit become a standard for char.
I do not understand what's going on here. This is compiled with GCC 10.2.0 compiler. Printing out the whole string is different than printing out each character.
#include <iostream>
int main(){
char str[] = "“”";
std::cout << str << std::endl;
std::cout << str[0] << str[1] << std::endl;
}
Output
“”
��
Why are not the two outputted lines the same? I would expect the same line twice. Printing out alphanumeric characters does output the same line twice.
Bear in mind that, on almost all systems, the maximum value a (signed) char can hold is 127. So, more likely than not, your two 'special' characters are actually being encoded as multi-byte combinations.
In such a case, passing the string pointer to std::cout will keep feeding data from that buffer until a zero (nul-terminator) byte is encountered. Further, it appears that, on your system, the std::cout stream can properly interpret multi-byte character sequences, so it shows the expected characters.
However, when you pass the individual char elements, as str[0] and str[1], there is no possibility of parsing those arguments as components of multi-byte characters: each is interpreted 'as-is', and those values do not correspond to valid, printable characters, so the 'weird' � symbol is shown, instead.
"“”" contains more bytes than you think. It's usually encoded as utf8. To see that, you can print the size of the array:
std::cout << sizeof str << '\n';
Prints 7 in my testing. Utf8 is a multi-byte encoding. That means each character is encoded in multiple bytes. Now, you're printing bytes of a utf8 encoded string, which are not printable themselves. That's why you get � when you try to print them.
I have an example:
int var = 5;
char ch = (char)var;
char ch2 = var+48;
cout << ch << endl;
cout << ch2 << endl;
I had some other code. (char) returned wrong answer, but +48 didn't. When I changed ONLY (char) to +48, then my code got corrected.
What is the difference between converting int to char by using (char) and +48 (ASCII) in C++?
char ch=(char)var; has the same effect as char ch=var; and assigns the numeric value 5 to ch. You're using ASCII (supported by all modern systems) and ASCII character code 5 represents Enquiry 'ENQ' an old terminal control code. Perhaps some old timer has a clue what it did!
char ch2 = var+48; assigns the numeric value 53 to ch2 which happens to represent the ASCII character for the digit '5'. ASCII 48 is zero (0) and the digits all appear in the ASCII table in order after that. So 48+5 lands on 53 (which represents the character '5').
In C++ char is a integer type. The value is interpreted as representing an ASCII character but it should be thought of as holding a number.
Its numeric range is either [-128,127] or [0,255]. That's because C++ requires sizeof(char)==1 and all modern platforms have 8 bit bytes.
NB: C++ doesn't actually mandate ASCII, but again that will be the case on all modern platforms.
PS: I think its an unfortunate artifact of C (inherited by C++) that sizeof(char)==1 and there isn't a separate fundamental type called byte.
A char is simply the base integral denomination in c++. Output statements, like cout and printf map char integers to the corresponding character mapping. On Windows computers this is typically ASCII.
Note that the 5th in ASCII maps to the Enquiry character which has no printable character, while the 53rd character maps to the printable character 5.
A generally accepted hack to store a number 0-9 in a char is to do: const char ch = var + '0' It's important to note the shortcomings here:
If your code is running on some non-ASCII character mapping then characters 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 may not be laid out in order in which case this wouldn't work
If var is outside the 0 - 9 range this var + '0' will map to something other than a numeric character mapping
A guaranteed way to get the most significant digit of a number independent of 1 or 2 is to use:
const auto ch = to_string(var).front()
Generally char represents a number as int does. Casting an int value to char doesn't provide it's ASCII representation.
The ASCII codes as numbers for digits range from 48 (== '0') to 58 (== '9'). So to get the printable digit you have to add '0' (or 48).
The difference is that casting to char (char) explicitly converts the digit to a char and adding 48 do not.
Its important to note that an int is typically 32 bit and char is typically 8 bit. This means that the number you can store in a char is from -127 to +127(or 0 to 255-(2^8-1) if you use unsigned char) and in an int from −2,147,483,648 (−231) to 2,147,483,647 (231 − 1)(or 0 to 2^32 -1 for unsigned).
Adding 48 to a value is not changing the type to char.
I have a valid utf-8 encoded string in a std::string. I have limit in bytes. I would like to truncate the string and add ... at MAX_SIZE - 3 - x - where x is that value that will prevent a utf-8 character to be cut.
Is there function that could determine x based on MAX_SIZE without the need to start from the beginning of the string?
If you have a location in a string, and you want to go backwards to find the start of a UTF-8 character (and therefore a valid place to cut), this is fairly easily done.
You start from the last byte in the sequence. If the top two bits of the last byte are 10, then it is part of a UTF-8 sequence, so keep backing up until the top two bits are not 10 (or until you reach the start).
The way UTF-8 works is that a byte can be one of three things, based on the upper bits of the byte. If the topmost bit is 0, then the byte is an ASCII character, and the next 7 bits are the Unicode Codepoint value itself. If the topmost bit is 10, then the 6 bits that follow are extra bits for a multi-byte sequence. But the beginning of a multibyte sequence is coded with 11 in the top 2 bits.
So if the top bits of a byte are not 10, then it's either an ASCII character or the start of a multibyte sequence. Either way, it's a valid place to cut.
Note however that, while this algorithm will break the string at codepoint boundaries, it ignores Unicode grapheme clusters. This means that combining characters can be culled, away from the base characters that they combine with; accents can be lost from characters, for example. Doing proper grapheme cluster analysis would require having access to the Unicode table that says whether a codepoint is a combining character.
But it will at least be a valid Unicode UTF-8 string. So that's better than most people do ;)
The code would look something like this (in C++14):
auto FindCutPosition(const std::string &str, size_t max_size)
{
assert(str.size() >= max_size, "Make sure stupidity hasn't happened.");
assert(str.size() > 3, "Make sure stupidity hasn't happened.");
max_size -= 3;
for(size_t pos = max_size; pos > 0; --pos)
{
unsigned char byte = static_cast<unsigned char>(str[pos]); //Perfectly valid
if(byte & 0xC0 != 0x80)
return pos;
}
unsigned char byte = static_cast<unsigned char>(str[0]); //Perfectly valid
if(byte & 0xC0 != 0x80)
return 0;
//If your first byte isn't even a valid UTF-8 starting point, then something terrible has happened.
throw bad_utf8_encoded_text(...);
}
The bit of code below is extracting ASCII codes from characters.
When I convert characters in the normal ASCII region I get the value I expect.
When I convert £ and € from the extened region I get a load of 1's padding the INT that I'm storing the character in.
e.g. the output of the below is:
45 (ascii E as expected)
FFFFFF80 (extended ascii € as expected but padded with ones)
It's not causing me an issue but I'm just wondering why this happens.
Here's the code...
unsigned int asciichar[3];
string cTextToEncode = "E€";
for (unsigned int i = 0; i < cTextToEncode.length(); i++)
{
asciichar[i] = (unsigned int)cTextToEncode[i];
cout << hex << asciichar[i] << "\n";
}
Can anyone explain why this is?
Thanks
depending on the implementation a char can be either signed or unsigned. In your case they appear to be signed, so 0x80 is interpreted as -128 instead of 128, hence when cast to an integer it becomes 0xffffff80.
btw, this has nothing at all to do with ASCII
First, there's no € in ASCII (extended or otherwise) because the euro didn't exist when ASCII was created. However, several ASCII-friendly 8-bit encodings do support the € character, but the conversion is done by your source code editor (the compiler merely sees a byte which happens to represent € in your editor, but might be something else entirely on, say, a computer in Israel).
Second, (unsigned int) casts do not extract the ASCII encoding of a character. They merely convert the value of the underlying numeric char type to an unsigned integer. This causes strange things to happen when the converted value is negative - on your compiler, char happens to be signed char and thus characters with an ASCII value larger than 127 end up being negative char values.
You should convert to an unsigned char first, and then to an unsigned int.
You should be careful when promoting signed values.
When promoting signed char to signed int a first bit (sign bit) is taken into account. The algorithm is roughly look like this:
1) If you have 1X-XX-XX-XX (char in binary, X - any binary digit) then int will be (starts with 24 ones) 1...1-1X-XX-XX-XX (binary) -> 0xFFFFFFYY (hex)
2) if you have 0X-XX-XX-XX (binary), then you'll have (starts with 24 zeroes) 0...0-0X-XX-XX-XX (binary) -> 0x000000YY (hex).
In your case you want to force rule #2 all the time. In order to do this, you need to tell compiler to ignore first bit (sign bit). For this you need to use unsigned char.