I came across this example here
.
#include <iostream>
int main()
{
int i = 7;
char* p = reinterpret_cast<char*>(&i);
if(p[0] == '\x7') //POINT OF INTEREST
std::cout << "This system is little-endian\n";
else
std::cout << "This system is big-endian\n";
}
What I'm confused about is the if statement. How do the escape sequences behave here? I get the same result with p[0] == '\07' (\x being hexadecimal escape sequence). How would checking if p[0] == '\x7' tell me if the system is little or big endian?
The layout of a (32-bit) integer in memory is;
Big endian:
+-----+-----+-----+-----+
| 0 | 0 | 0 | 7 |
+-----+-----+-----+-----+
^ pointer to int points here
Little endian:
+-----+-----+-----+-----+
| 7 | 0 | 0 | 0 |
+-----+-----+-----+-----+
^ pointer to int points here
What the code basically does is read the first char that the integer pointer points to, which in case of little endian is \x0, and big endian is \x7.
Hex 7 and octal 7 happens to be the same value, as is decimal 7.
The check is intended to try to determine if the value ends up in the first or last byte of the int.
A little endian system will store the bytes of the value in "reverse" order, with the lower part first
07 00 00 00
A big endian system would store the "big" end first
00 00 00 07
By reading the first byte, the code will see if the 7 ends up there, or not.
7 in decimal is the same as 7 in hexadecimal and 7 in octal, so it doesn't matter if you use '\x7', '\07', or even just 7 (numeric literal, not a character one).
As for the endianness test: the value of i is 7, meaning it will have the number 7 in its least significant byte, and 0 in all other bytes. The cast char* p = reinterpret_cast<char*>(&i); makes p point to the first byte in the representation of i. The test then checks whether that byte's value is 7. If so, it's the least significant byte, implying a little-endian system. If the value is not 7, it's not a little-endian system. The code assumes that it's a big-endian, although that's not strictly established (I believe there were exotic systems with some sort of mixed endianness as well, although the code will probably not run on such in practice).
Related
I want to be able to merge bytes from two unsigned long parameters, taking exactly half of the bytes, the half that starts with the least significant byte of the second param and the rest of the first param.
For example:
x = 0x89ABCDEF12893456
y = 0x76543210ABCDEF19
result_merged = 0x89ABCDEFABCDEF19
First, I need to check whether the system that I work on is little endian or big endian. I already wrote a function that checks that, called is_big_endian().
now I know that char char *c = (char*) &y will give me the "first"(MSB) or "last"(LSB) (depends whether is big endian or not) byte of y.
Now, I do want to use AND(&) bitwise operator to merge x and y bytes, the question is how can I get only half of the bytes, starting from the LSB.
I mean I can use a "for" loop to go over size_of and then split by 2, but i'm confused how exactly should I do it.
And I also thought about "masking" the bytes, because I already know for sure that the given parameters are "long" which means 16 bits. so maybe I can mask them in the following way?
I want to be able to use it both on 32 and 64 bit systems, which means my code is wrong because i'm using here a fixed size of 64 bit long although I don't know what is the system that the code runs on.
I thought about using an array to store all the bits or maybe use shifting?
unsigned long merge_bytes(unsigned long x, unsigned long int y)
{
if (is_big_endian() ==0) {
//little endian system
return (y & 0xFFFFFFFF00000000) | (x & 0xFFFFFFFFFFFF);
}
else
{
return (y & 0x00000000FFFFFFFF) | (x & 0xFFFFFFFFFFFF);
}
}
I have "masked" the right side of the bits if that's a little endian system because the LSB there is the furthest to the left bit.
And did the opposite if this is a big endian system.
any help would be appreciated.
Your code is almost correct. You want this:
merged = (y & 0x00000000ffffffff) | (x & 0xffffffff00000000);
There is no need to distinguish between big and little endian. The high bits of a value are the high bits of the value.
The difference is only the representation in memory.
Example: storage of the value 0x12345678 at memory location 0x0000
Little endian:
Address byte
-------------
0000 78
0001 56
0002 34
0003 12
Big endian:
Address byte
-------------
0000 12
0001 34
0002 56
0003 78
from the book Stroustrup - Programming: Principles and practices using C++. In §17.3, about Memory, addresses and pointers, it is supposed to be allowed to assign a char* to int*:
char ch1 = 'a';
char ch2 = 'b';
char ch3 = 'c';
char ch4 = 'd';
int* pi = &ch3; // point to ch3, a char-size piece of memory
*pi = 12345; // write to an int-size piece of memory
*pi = 67890;
graphically we have something like this:
quoting from the source:
Had the compiler allowed the code, we would have been writing 12345 to the memory starting at &ch3. That would definitely have changed the value of some nearby memory, such as ch2 or ch4, or we would have overwritten part of pi itself.
In that case, the next assignment *pi = 67890 would place 67890 in some completely different part of memory.
I don't understand, why the next assignment would place it: in some completely different part of memory? The address stored in int *pi is still &ch3, so that assignment would be overwrite the content at that address, i.e. 12345. Why it isn't so?
Please, can you help me? Many thanks!
char ch3 = 'c';
int* pi = &ch3;
it is supposed to be allowed to assign a char* to int*:
Not quite - there is an alignment concern. It is undefined behavior (UB) when
If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined. C17dr § 6.3.2.3 7
Example: Some processor require int * to be even and if &ch3 was odd, storing the address might fail, and de-referencing the address certainly fails: bus error.
The next is certainly UB as the destination is outside the memory of ch3.
ch1, ch2, ch4 might be nearby and provide some reasonable undefined behavior, but the result is UB.
// undefined behavior
*pi = 12345; // write to an int-size piece of memory`
When code attempts to write outside its bounds - it is UB, anything may happen, including writing into neighboring data.
The address stored in int *pi is still &ch3
Maybe, maybe not. UB has occurred.
why the next assignment would place it: in some completely different part of memory?
The abusive code suggests that pi itself is overwritten by *pi = 12345;. This might happen, it might not. It is UB. A subsequent use of *pi is simply more UB.
Recall with UB you might get what you hope for, you might not - it is not defined by C.
You seem to have skipped part of the explanation you quoted:
or we would have overwritten part of pi itself
Think of it this way, since ints are larger than chars, if an int* points to an address location that stores a char, there will be overflowing memory when you attempt to assign an integer value to that location, as you only have a single byte of memory allocated but are assigning 4 bytes worth of data. i.e. you cannot fit 4 bytes of data into one, so the other 3 bytes will go somewhere.
Assume then that the overflowing bytes partially change the value stored in pi. Now the next assignment will go to a random memory location.
Let's assume the memory address layout is:
0 1 2 3 4 5 6 7
From the left 0, 1, 2 and 3 are characters. From the right 4, 5, 6 and 7 are an int*.
The values in each byte in hex may be:
61 62 63 64 02 00 00 00
Note how the first four are ascii values and the last four are the address of ch3. Writing *pi = 12345; Changes the values like so:
61 62 39 30 00 00 00 00
With 0x39300000 being 12345 in little endian hexadecimal.
The next write *pi = 67890; would start from memory adress 00 00 00 00 not 02 00 00 00 as one could expect.
Firstly, you have to understand that everything is a number i.e., a char, int, int* all contain numbers. Memory addresses are also numbers. Let's assume the current example compiles and we have memory like following:
--------------------------
Address | Variable | Value
--------------------------
0x01 | ch1 a
0x02 | ch2 b
0x03 | ch3 c
0x04 | ch4 d
0x05 | pi &ch3 = 0x03
Now let's dereference pi and reassign a new value to ch3:
*pi = 12345;
Let's assume int is 4 bytes. Since pi is an int pointer, it will write a 4 byte value to the location pointed by pi. Now, char can only contain one byte values so, what would happen if we try to write 4 bytes to that location? Strictly speaking, this is undefined behaviour but I will try to explain what the author means.
Since char cannot contain values larger than 1 byte, *pi = 12345 will overflow ch3. When this overflow happens, the remaining 3 bytes out of the 4 bytes may get written in the memory location nearby. What memory locations do we have nearby? ch4 and pi itself! ch4 can only contain 1 byte as well, that leaves us with 2 bytes and the next location is pi itself. Meaning pi will overwrite it's own value!
--------------------------
Address | Variable | Value
--------------------------
0x01 | ch1 a
0x02 | ch2 b
0x03 | ch3 12 //12 ended up here
0x04 | ch4 34 //34 ended up here
0x05 | pi &ch3 = 0x03 // 5 gets written here
As you can see that pi is now pointing to some other memory address which is definitely not ch3.
i know this is not a good way to use void pointers, but i am curious about this behavior.
int main()
{
void* ptr;
void* next;
ptr = ::operator new(8);
next = (void*)((char*)ptr + 1) ;
*(int*)next = 90;
*(int*)ptr =100;
cout << "Ptr = " << ptr<<endl;
cout << "Ptr-> " << *(int*)ptr<<endl;
cout << "Next = " << next<<endl;
cout << "Next-> " << *(int*)next<<endl;
}
---------------
Result :
Ptr = 0x234e010
Ptr-> 100
Next = 0x234e011
Next-> 0
---------------
in both 64bit(x86_64) and 32bit(x86) platforms (Linux and windows) when add more than 4 byte to ptr, 90 will remain in where next is pointing to, but when add less than 4,
*next suddenly became 0 when *ptr=100 will be execute, same result with calloc and malloc
Why changing the content of where ptr point have to change the content of where ptr+1 point ? and why this not ganna happen when we go more than 4 byte ? and if it is about memory alignment why same behavior on both 32bit and 64bit platforms ?
also same result in delphi (windows 32bit/Embarcadero)
Thank you and sorry for my bad English
As you're on an X86 architecture, alignment won't be a problem: the processor can access unaligned memory, even if it's slower.
Intel processors store data with the least significant byte FIRST. Google about endianness to get a more in-depth explanation. What happens is
vvvvvvvvvvvvvvvv this is Ptr
+===+===+===+===+===+===+===+===+
| |100| 0 | 0 | 0 | | | |
+===+===+===+===+===+===+===+===+
^^^^^^^^^^^^^^^^ this is next after you say '*next=100'
now when you store 90 to *Ptr, this changes to
vvvvvvvvvvvvvvvv this is Ptr
+===+===+===+===+===+===+===+===+
| 90| 0 | 0 | 0 | 0 | | | |
+===+===+===+===+===+===+===+===+
^^^^^^^^^^^^^^^^ this is next
so next becomes 0, because its first 3 bytes get overwritten by 0, and the last byte is 0 anyways.
ints are (typically) 4 bytes. You're only adding one byte to ptr for next rather than 4. This means that ptr and next overlap, thus your assignment of 100 is clobbering the last byte of 90 (which happens to 0 it out).
You need:
next = (void*)((char*)ptr + sizeof(int));
And really you need to just use properly typed pointers.
What appears to be happening is that the eight bytes of memory have the following contents after each operation:
*(int*)next = 90; // 90 == 0x00 00 00 5A
--> 0x?? 5A 00 00 00 ?? ?? ??
*(int*)ptr = 100; // 100 == 0x00 00 00 64
--> 0x64 00 00 00 00 ?? ?? ??
As a result, *(int*)next is now equal to zero. Note that you can't count on this behavior, since it depends on the underlying hardware.
This is nothing to do with allocation or void* and everything to do with your making assumptions about how values are stored in memory.
You wrote an int, a unit occupying 4 bytes on your system, to address 101 and then wrote another to address 100, overwriting 3 of the first value.
Unfortunatelty, the system you used doesn't use the endianess/byte ordering you assumed whereby the lsb of the first value would occupy address 104. The result was that one of the four zero bytes from your second write made the first value zero.
This is because you invoked undefined behavior using char offsets to bit twiddle int values.
When ints become 8 bytes anyone who was using their own hand manipulation of pointers will have the same problem.
Incidentally, you should be able to tell form this something about the word order of your systems numeric representation.
What I must do is open a file in binary mode that contains stored data that is intended to be interpreted as integers. I have seen other examples such as Stackoverflow-Reading “integer” size bytes from a char* array. but I want to try taking a different approach (I may just be stubborn, or stupid :/). I first created a simple binary file in a hex editor that reads as follows.
00 00 00 47 00 00 00 17 00 00 00 41
This (should) equal 71, 23, and 65 if the 12 bytes were divided into 3 integers.
After opening this file in binary mode and reading 4 bytes into an array of chars, how can I use bitwise operations to make char[0] bits be the first 8 bits of an int and so on until the bits of each char are part of the int.
My integer = 00 00 00 00
+ ^ ^ ^ ^
Chars Char[0] Char[1] Char[2] Char[3]
00 00 00 47
So my integer(hex) = 00 00 00 47 = numerical value of 71
Also, I don't know how the endianness of my system comes into play here, so is there anything that I need to keep in mind?
Here is a code snippet of what I have so far, I just don't know the next steps to take.
std::fstream myfile;
myfile.open("C:\\Users\\Jacob\\Desktop\\hextest.txt", std::ios::in | std::ios::out | std::ios::binary);
if(myfile.is_open() == false)
{
std::cout << "Error" << std::endl;
}
char* mychar;
std::cout << myfile.is_open() << std::endl;
mychar = new char[4];
myfile.read(mychar, 4);
I eventually plan on dealing with reading floats from a file and maybe a custom data type eventually, but first I just need to get more familiar with using bitwise operations.
Thanks.
You want the bitwise left shift operator:
typedef unsigned char u8; // in case char is signed by default on your platform
unsigned num = ((u8)chars[0] << 24) | ((u8)chars[1] << 16) | ((u8)chars[2] << 8) | (u8)chars[3];
What it does is shift the left argument a specified number of bits to the left, adding zeros from the right as stuffing. For example, 2 << 1 is 4, since 2 is 10 in binary and shifting one to the left gives 100, which is 4.
This can be more written in a more general loop form:
unsigned num = 0;
for (int i = 0; i != 4; ++i) {
num |= (u8)chars[i] << (24 - i * 8); // += could have also been used
}
The endianness of your system doesn't matter here; you know the endianness of the representation in the file, which is constant (and therefore portable), so when you read in the bytes you know what to do with them. The internal representation of the integer in your CPU/memory may be different from that of the file, but the logical bitwise manipulation of it in code is independent of your system's endianness; the least significant bits are always at the right, and the most at the left (in code). That's why shifting is cross-platform -- it operates at the logical bit level :-)
Have you thought of using Boost.Spirit to make a binary parser? You might hit a bit of a learning curve when you start, but if you want to expand your program later to read floats and structured types, you'll have an excellent base to start from.
Spirit is very well-documented and is part of Boost. Once you get around to understanding its ins and outs, it's really mind-boggling what you can do with it, so if you have a bit of time to play around with it, I'd really recommend taking a look.
Otherwise, if you want your binary to be "portable" - i.e. you want to be able to read it on a big-endian and a little-endian machine, you'll need some sort of byte-order mark (BOM). That would be the first thing you'd read, after which you can simply read your integers byte by byte. Simplest thing would probably be to read them into a union (if you know the size of the integer you're going to read), like this:
union U
{
unsigned char uc_[4];
unsigned long ui_;
};
read the data into the uc_ member, swap the bytes around if you need to change endianness and read the value from the ui_ member. There's no shifting etc. to be done - except for the swapping if you want to change endianness..
HTH
rlc
This isn't cross-platform code... everything is being performed on the same platform (i.e. endianess is the same.. little endian).
I have this code:
unsigned char array[4] = {'t', 'e', 's', 't'};
unsigned int out = ((array[0]<<24)|(array[1]<<16)|(array[2]<<8)|(array[3]));
std::cout << out << std::endl;
unsigned char buff[4];
memcpy(buff, &out, sizeof(unsigned int));
std::cout << buff << std::endl;
I'd expect the output of buff to be "test" (with a garbage trailing character because of the lack of '/0') but instead the output is "tset." Obviously changing the order of characters that I'm shifting (3, 2, 1, 0 instead of 0, 1, 2, 3) fixes the problem, but I don't understand the problem. Is memcpy not acting the way I expect?
Thanks.
This is because your CPU is little-endian. In memory, the array is stored as:
+----+----+----+----+
array | 74 | 65 | 73 | 74 |
+----+----+----+----+
This is represented with increasing byte addresses to the right. However, the integer is stored in memory with the least significant bytes at the left:
+----+----+----+----+
out | 74 | 73 | 65 | 74 |
+----+----+----+----+
This happens to represent the integer 0x74657374. Using memcpy() to copy that into buff reverses the bytes from your original array.
You're running this on a little-endian platform.
On a little-endian platform, a 32-bit int is stored in memory with the least significant byte in the lowest memory address. So bits 0-7 are stored at address P, bits 8-15 in address P + 1, bits 16-23 in address P + 2 and bits 24-31 in address P + 3.
In your example: bits 0-7 = 't', bits 8-15 = 's', bits 16-23 = 'e', bits 24-31 = 't'
So that's the order that the bytes are written to memory: "tset"
If you address the memory then as separate bytes (unsigned chars), you'll read them in the order they are written to memory.
On a little-endian platform the output should be tset. The original sequence was test from lower addresses to higher addresses. Then you put it into an unsigned int with first 't' going into the most significant byte and the last 't' going into the least significant byte. On a little-endian machine the least significant byte is stored at lower address. This is how it will be copied to the final buf. This is how it is going to be output: from the last 't' to the first 't', i.e. tset.
On a big-endian machine you would not observe the reversal.
You have written a test for platform byte order, and it has concluded: little endian.
How about adding a '\0' to your buff?