I came across this syntax for reading a BMP file in C++
#include <fstream>
int main() {
std::ifstream in('filename.bmp', std::ifstream::binary);
in.seekg(0, in.end);
size = in.tellg();
in.seekg(0);
unsigned char * data = new unsigned char[size];
in.read((unsigned char *)data, size);
int width = *(int*)&data[18];
// omitted remainder for minimal example
}
and I don't understand what the line
int width = *(int*)&data[18];
is actually doing. Why doesn't a simple cast from unsigned char * to int, int width = (int)data[18];, work?
Note
As #user4581301 indicated in the comments, this depends on the implementation and will fail in many instances. And as #NathanOliver- Reinstate Monica and #ChrisMM pointed out this is Undefined Behavior and the result is not guaranteed.
According to the bitmap header format, the width of the bitmap in pixels is stored as a signed 32-bit integer beginning at byte offset 18. The syntax
int width = *(int*)&data[18];
reads bytes 19 through 22, inclusive (assuming a 32-bit int) and interprets the result as an integer.
How?
&data[18] gets the address of the unsigned char at index 18
(int*) casts the address from unsigned char* to int* to avoid loss of precision on 64 bit architectures
*(int*) dereferences the address to get the referred int value
So basically, it takes the address of data[18] and reads the bytes at that address as if they were an integer.
Why doesn't a simple cast to `int` work?
sizeof(data[18]) is 1, because unsigned char is one byte (0-255) but sizeof(&data[18]) is 4 if the system is 32-bit and 8 if it is 64-bit, this can be larger (or even smaller for 16-bit systems) but with the exception of 16-bit systems it should be at minimum 4 bytes. Obviously reading more than 4 bytes is not desired in this case, and the cast to (int*) and subsequent dereference to int yields 4 bytes, and indeed the 4 bytes between offsets 18 and 21, inclusive. A simple cast from unsigned char to int will also yield 4 bytes, but only one byte of the information from data. This is illustrated by the following example:
#include <iostream>
#include <bitset>
int main() {
// Populate 18-21 with a recognizable pattern for demonstration
std::bitset<8> _bits(std::string("10011010"));
unsigned long bits = _bits.to_ulong();
for (int ii = 18; ii < 22; ii ++) {
data[ii] = static_cast<unsigned char>(bits);
}
std::cout << "data[18] -> 1 byte "
<< std::bitset<32>(data[18]) << std::endl;
std::cout << "*(unsigned short*)&data[18] -> 2 bytes "
<< std::bitset<32>(*(unsigned short*)&data[18]) << std::endl;
std::cout << "*(int*)&data[18] -> 4 bytes "
<< std::bitset<32>(*(int*)&data[18]) << std::endl;
}
data[18] -> 1 byte 00000000000000000000000010011010
*(unsigned short*)&data[18] -> 2 bytes 00000000000000001001101010011010
*(int*)&data[18] -> 4 bytes 10011010100110101001101010011010
Related
I'm trying to read a binary file and simple convert the data to usable unsigned integers. The code below works for 2-byte reading, for certain file locations, and correctly prints the unsigned integer. When I use the 4-byte code though my value turns out to be a number much larger than it is supposed to be. I believe the issue lies within the read function, it seems as though I am getting the wrong character/decimal number (101 for example) which when bit shifted becomes a number much larger than it should be (~6662342).(when the program runs it throws an exception every now and then "stack around the variable buf runtime error #2" in visual studios). Any ideas? It may be my fundamental knowledge of how the data is stored in the char array that is affecting my data output.
working 2-byte code
unsigned char buf[2];
file.seekg(3513);
uint64_t value = readBufferLittleEndian(buf, &file);
printf("%i", value);
system("PAUSE");
return 0;
}
uint64_t readBufferLittleEndian(unsigned char buf[], std::ifstream *file)
{
file->read((char*)(&buf[0]), 2);
return (buf[1] << 8 | buf[0]);
}
broken 4-byte code
unsigned char buf[8 + 1]; //= { 0, 2 , 0 , 0 , 0 , 0 , 0, 0, 0 };
uint64_t buf1[9];
file.seekg(3213);
uint64_t value = readBufferLittleEndian(buf, &file, buf1);
std::cout << value;
system("PAUSE");
return 0;
}
uint64_t readBufferLittleEndian(unsigned char buf[], std::ifstream *file, uint64_t buf1[])
{
file->read((char*)(&buf[0]), 4);
for (int index = 0; index < 4; index++)
{
buf1[index] = buf[index];
}
buf1[0];
buf1[1];
buf1[2];
buf1[3];
//return (buf1[7] << 56 | buf1[6] << 48 | buf1[5] << 40 | buf1[4] << 32 | buf1[3] << 24 | buf1[2] << 16 | buf1[1] << 8 | buf1[0]);
return (buf1[3] << 24 | buf1[2] << 16 | buf1[1] << 8 | buf1[0]);
//return (buf1[1] << 8 | buf1[0]);
}
Please correct me if I got the endianess reversed.
code is C++ except for the printf line
you have to cast before shifting. You cannot shift a char left 56 bits.
ie do ((uint64_t)buf[n] << NN
Seekg(0) = byte 1, seekg(3212) = byte 3213. Not entirely sure why I was getting a zero in byte 3214 before, considering I now get 220 (indicating big-endianess). Getting 220 would have indicated that I was interpreting the functionality of seekg(). Oh well it is solved where it matters now anyway.
I am trying to convert 4 bytes to an integer using C++.
This is my code:
int buffToInteger(char * buffer)
{
int a = (int)(buffer[0] << 24 | buffer[1] << 16 | buffer[2] << 8 | buffer[3]);
return a;
}
The code above works in almost all cases, for example:
When my buffer is: "[\x00, \x00, \x40, \x00]" the code will return 16384 as expected.
But when the buffer is filled with: "[\x00, \x00, \x3e, \xe3]", the code won't work as expected and will return "ffffffe1".
Does anyone know why this happens?
Your buffer contains signed characters. So, actually, buffer[0] == -29, which upon conversion to int gets sign-extended to 0xffffffe3, and in turn (0x3e << 8) | 0xffffffe3 == 0xffffffe3.
You need ensure your individual buffer bytes are interpreted unsigned, either by declaring buffer as unsigned char *, or by explicitly casting:
int a = int((unsigned char)(buffer[0]) << 24 |
(unsigned char)(buffer[1]) << 16 |
(unsigned char)(buffer[2]) << 8 |
(unsigned char)(buffer[3]));
In the expression buffer[0] << 24 the value 24 is an int, so buffer[0] will also be converted to an int before the shift is performed.
On your system a char is apparently signed, and will then be sign extended when converted to int.
There's a implict promotion to a signed int in your shifts.
That's because char is (apparently) signed on your platform (the common thing) and << promotes to integers implicitly. In fact none of this would work otherwise because << 8 (and higher) would scrub all your bits!
If you're stuck with using a buffer of signed chars this will give you what you want:
#include <iostream>
#include <iomanip>
int buffToInteger(char * buffer)
{
int a = static_cast<int>(static_cast<unsigned char>(buffer[0]) << 24 |
static_cast<unsigned char>(buffer[1]) << 16 |
static_cast<unsigned char>(buffer[2]) << 8 |
static_cast<unsigned char>(buffer[3]));
return a;
}
int main(void) {
char buff[4]={0x0,0x0,0x3e,static_cast<char>(0xe3)};
int a=buffToInteger(buff);
std::cout<<std::hex<<a<<std::endl;
// your code goes here
return 0;
}
Be careful about bit shifting on signed values. Promotions don't just add bytes but may convert values.
For example a gotcha here is that you can't use static_cast<unsigned int>(buffer[1]) (etc.) directly because that converts the signed char value to a signed int and then reinterprets that value as an unsigned.
If anyone asks me all implicit numeric conversions are bad. No program should have so many that they would become a chore. It's a softness in the C++ inherited from C that causes all sorts of problems that far exceed their value.
It's even worse in C++ because they make the already confusing overloading rules even more confusing.
I think this could be also done with use of memcpy:
int buffToInteger(char* buffer)
{
int a;
memcpy( &a, buffer, sizeof( int ) );
return a;
}
This is much faster than the example mentioned in the original post, because it just treats all bytes "as is" and there is no need to do any operations such as bit shift etc.
It also doesn't cause any signed-unsigned issues.
char buffer[4];
int a;
a = *(int*)&buffer;
This takes a buffer reference, type casts it to an int reference and then dereferences it.
int buffToInteger(char * buffer)
{
return *reinterpret_cast<int*>(buffer);
}
This conversion is simple and fast. We only tell compiler to treat a byte array in a memory as a single integer
I have the following simple program that uses a union to convert between a 64 bit integer and its corresponding byte array:
union u
{
uint64_t ui;
char c[sizeof(uint64_t)];
};
int main(int argc, char *argv[])
{
u test;
test.ui = 0x0123456789abcdefLL;
for(unsigned int idx = 0; idx < sizeof(uint64_t); idx++)
{
cout << "test.c[" << idx << "] = 0x" << hex << +test.c[idx] << endl;
}
return 0;
}
What I would expect as output is:
test.c[0] = 0xef
test.c[1] = 0xcd
test.c[2] = 0xab
test.c[3] = 0x89
test.c[4] = 0x67
test.c[5] = 0x45
test.c[6] = 0x23
test.c[7] = 0x1
But what I actually get is:
test.c[0] = 0xffffffef
test.c[1] = 0xffffffcd
test.c[2] = 0xffffffab
test.c[3] = 0xffffff89
test.c[4] = 0x67
test.c[5] = 0x45
test.c[6] = 0x23
test.c[7] = 0x1
I'm seeing this on Ubuntu LTS 14.04 with GCC.
I've been trying to get my head around this for some time now. Why are the first 4 elements of the char array displayed as 32 bit integers, with 0xffffff prepended to them? And why only the first 4, why not all of them?
Interestingly enough, when I use the array to write to a stream (which was the original purpose of the whole thing), the correct values are written. But comparing the array char by char obviously leads to problems, since the first 4 chars are not equal 0xef, 0xcd, and so on.
Using char is not the right thing to do since it could be signed or unsigned. Use unsigned char.
union u
{
uint64_t ui;
unsigned char c[sizeof(uint64_t)];
};
char gets promoted to an int because of the prepended unary + operator. . Since your chars are signed, any element with the highest by set to 1 is interpreted as a negative number and promoted to an integer with the same negative value. There are a few different ways to solve this:
Drop the +: ... << test.c[idx] << .... This may print the char as a character rather than a number, so is probably not a good solution.
Declare c as unsigned char. This will promote it to an unsigned int.
Explicitly cast +test.c[idx] before it is passed: ... << (unsigned char)(+test.c[idx]) << ...
Set the upper bytes of the integer to zero using binary &: ... << +test.c[idx] & 0xFF << .... This will only display the lowest-order byte no matter how the char is promoted.
Use either unsigned char or use test.c[idx] & 0xff to avoid sign extension when a char value > 0x7f is converted to int.
It is unsigned char vs signed char and its casting to integer
The unary plus causes the char to be promoted to a int (integral promotion). Because you have signed chars the value will be used as such and the other bytes will reflect that.
It is not true that only the four are ints, they all are. You just don't see it from the representtion since the leading zeroes are not shown.
Either use unsigned chars or & 0xff for promotion to get the desired result.
I have a string of 256*4 bytes of data. These 256* 4 bytes need to be converted into 256 unsigned integers. The order in which they come is little endian, i.e. the first four bytes in the string are the little endian representation of the first integer, the next 4 bytes are the little endian representation of the next integer, and so on.
What is the best way to parse through this data and merge these bytes into unsigned integers? I know I have to use bitshift operators but I don't know in what way.
Hope this helps you
unsigned int arr[256];
char ch[256*4] = "your string";
for(int i = 0,k=0;i<256*4;i+=4,k++)
{
arr[k] = ch[i]|ch[i+1]<<8|ch[i+2]<<16|ch[i+3]<<24;
}
Alternatively, we can use C/C++ casting to interpret a char buffer as an array of unsigned int. This can help get away with shifting and endianness dependency.
#include <stdio.h>
int main()
{
char buf[256*4] = "abcd";
unsigned int *p_int = ( unsigned int * )buf;
unsigned short idx = 0;
unsigned int val = 0;
for( idx = 0; idx < 256; idx++ )
{
val = *p_int++;
printf( "idx = %d, val = %d \n", idx, val );
}
}
This would print out 256 values, the first one is
idx = 0, val = 1684234849
(and all remaining numbers = 0).
As a side note, "abcd" converts to 1684234849 because it's run on X86 (Little Endian), in which "abcd" is 0x64636261 (with 'a' is 0x61, and 'd' is 0x64 - in Little Endian, the LSB is in the smallest address). So 0x64636261 = 1684234849.
Note also, if using C++, reinterpret_cast should be used in this case:
const char *p_buf = "abcd";
const unsigned int *p_int = reinterpret_cast< const unsigned int * >( p_buf );
If your host system is little-endian, just read along 4 bytes, shift properly and copy them to int
char bytes[4] = "....";
int i = bytes[0] | (bytes[1] << 8) | (bytes[2] << 16) | (bytes[3] << 24);
If your host is big-endian, do the same and reverse the bytes in the int, or reverse it on-the-fly while copying with bit-shifting, i.e. just change the indexes of bytes[] from 0-3 to 3-0
But you shouldn't even do that just copy the whole char array to the int array if your PC is in little-endian
#define LEN 256
char bytes[LEN*4] = "blahblahblah";
unsigned int uint[LEN];
memcpy(uint, bytes, sizeof bytes);
That said, the best way is to avoid copying at all and use the same array for both types
union
{
char bytes[LEN*4];
unsigned int uint[LEN];
} myArrays;
// copy data to myArrays.bytes[], do something with those bytes if necessary
// after populating myArrays.bytes[], get the ints by myArrays.uint[i]
I scan through the byte representation of an int variable and get somewhat unexpected result.
If I do
int a = 127;
cout << (unsigned int) *((char *)&a);
I get 127 as expected. If I do
int a = 256;
cout << (unsigned int) *((char *)&a + 1);
I get 1 as expected. But if I do
int a = 128;
cout << (unsigned int) *((char *)&a);
I have 4294967168 which is, well… quite fancy.
The question is: is there a way to get 128 when looking at first byte of an int variable which value is 128?
For the same reason that (unsigned int)(char)128 is 4294967168: char is signed by default on most commonly used systems. 128 cannot fit in a signed 8-bit quantity, so when you cast it to char, you get -128 (0x80 in hex).
Then, when you cast -128 to an unsigned int, you get 232 - 128, which is 4294967168.
If you want to get +128, then use an unsigned char instead of char.
char is signed here, so in your second example, *((char *)&a + 1) = ((char)256 +1) = (0+1) = 1, which is encoded as 0b00000000000000000000000000000001, so becomes 1 as an unsigned int.
In your third example, *((char *)&a) = (char)128 = (char)-127, which is encoded as 0b10000000000000000000000000000000, i.e., 2<<31, which is 4294967168
As the comments have pointed out, it looks like what's happening here is that you are running into an oddity of twos complement. In your last cast, since you are not using an unsigned char, the highest-order bit of the byte is being used to indicate positive or negative values. You then only have 7 bits out of the full 8 to represent your value, giving you a range of 0-127 for positive numbers (-128-127 overall).
If you exceed this range, then it wraps, and you get -128, which when casted back to an unsigned int will result in that abnormally large value.
int a = 128;
cout << (unsigned int) *((unsigned char *)&a);
Also all of your code is dependent on running on a little endian machine.
Here's how you should probably be doing these things:
int a = 127;
cout << (unsigned)(unsigned char)(0xFF & a);
int a = 256;
cout << (unsigned)(unsigned char)(0xFF & (a>>8));
int a = 128;
cout << (unsigned)(unsigned char)(0xFF & a);