What I want to do: read a series of 4 bytes e.g. 00000000 00000011 00000001 00000011 (this is a random example) from a binary file, and represent it as an integer in my program. What is the best way to do this?
EDIT SOLUTION I overlooked this part of the spec for the PNG file format here, hopefully this is useful to anyone that finds the question.
I am experimenting with the PNG image format and am having trouble extracting a 4 byte number. I Have succeeded in opening and printing the binary representation of the file, so I know that the data I am working with isn't corrupted or malformed.
I have reviewed questions like Reading 16-bit integers from binary file c++, and the 32 bit equivalent(s) but I cannot discern if they are reading integers that are in a binary file e.g. 00000000 72 00000000 or reading bytes as integers, which is what my goal is.
As an example, the first four bytes of the first chunk are 00000000 00000000 00000000 00001101 or 13.
Following the example of questions like the one above, this should == 13:
int test;
img.read( (char*) &test, sizeof(test));
yet it outputs 218103808
I also tried the approach of using a union with a character array and integer data member, and got the same output of 218103808
also, on my system sizeof(int) is equal to 4
And lastly, just to be sure that it wasn't a malformed PNG (which it wasn't I am rather sure) I used gimp to import it then export it as a new file, therefore natively created on my system.
EDIT
As I mentioned, after seekg(8) the next four bytes are 00000000 00000000 00000000 00001101 but when I decided to test the read function using
bitset<32> num;
img.read( (char*) &num, sizeof(int) );
it outputs 00001101 00000000 00000000 00000000
I am simply confused by this part, here. It's as if the bytes are reversed here. And this string of bytes equates to 218103808
Any insight would be appreciated
Notice that 218103808 is 0x0D000000 in hex. You might want to read about Endianess
That means the data you are reading is in big endian format, while your platform uses little endian.
Basically you need to reverse the 4 bytes, (and you likely want to use unsigned integers), so you get 0x0000000D, (13 decimal) which you can do like:
#define BSWAPUINT(x) ((((x) & 0x000000ff) << 24) |\
(((x) & 0x0000ff00) << 8) |\
(((x) & 0x00ff0000) >> 8) |\
(((x) & 0xff000000) >> 24))
unsigned int test;
img.read( (char*) &test, sizeof(test));
test = BSWAPUINT(test);
The above code will only work if the code runs on a little endian platform though.
To have your code be independent on whether your platform is big or little endian you can assemble the bytes to an integer yourself, given that you know the data format is big endian, you can do:
unsigned char buf[4];
unsigned int test;
img.read( (char*) &test, sizeof(test));
test = (unsigned int)buf[0] << 24;
test |= buf[1] << 16;
test |= buf[2] << 8;
test |= buf[3];
Or, on unix systems you can #include <arpa/inet.h> and use ntohl()
test = ntohl(test);
(Dealing with data in this manner, you are also better of using types such as uint32_t instead of int/unsigned int's , from stdint.h )
Related
I've recently needed to convert mnist data-set to images and labels, it is binary and the structure is in the previous link, so i did a little research and as I'm fan of c++ ,I've read the I/O binary in c++,after that I've found this link in stack. That link works well but no code commenting and no explanation of algorithm so I've get confused and that raise some question in my mind which i need a professional c++ programmer to ask.
1-What is the algorithm to convert the data-set in c++ with help of ifstream?
I've realized to read a file as a binary with file.read and move to the next record, but in C , we define a struct and move it inside the file but i can't see any struct in c++ program for example to read this:
[offset] [type] [value] [description]
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
0016 unsigned byte ?? pixel
How can we go to the specific offset for example 0004 and read for example 32 bit integer and put it to an integer variable.
2-What the function reverseInt is doing? (It is not obviously doing simple reversing an integer)
int ReverseInt (int i)
{
unsigned char ch1, ch2, ch3, ch4;
ch1 = i & 255;
ch2 = (i >> 8) & 255;
ch3 = (i >> 16) & 255;
ch4 = (i >> 24) & 255;
return((int) ch1 << 24) + ((int)ch2 << 16) + ((int)ch3 << 8) + ch4;
}
I've did a little debugging with cout and when it revised for example 270991360 it return 10000 , which i cannot find any relation, I understand it AND the number multiples with two with 255 but why?
PS :
1-I already have the MNIST converted images but i want to understand the algorithm.
2-I've already unzip the gz files so the file is pure binary.
1-What is the algorithm to convert the data-set in c++ with help of ifstream?
This function read a file (t10k-images-idx3-ubyte.gz) as follow:
Read a magic number and adjust endianness
Read number of images and adjust endianness
Read number rows and adjust endianness
Read number of columns and adjust endianness
Read all the given images x rows x columns characters (but loose them).
The function use normal int and always switch endianness, that means it target a very specific architecture and is not portable.
How can we go to the specific offset for example 0004 and read for example 32 bit integer and put it to an integer variable.
ifstream provides a function to seek to a given position:
file.seekg( posInBytes, std::ios_base::beg);
At the given position, you could read the 32-bit integer:
int32_t val;
file.read ((char*)&val,sizeof(int32_t));
2- What the function reverseInt is doing?
This function reverse order of the bytes of an int value:
Considering an integer of 32bit like aaaaaaaabbbbbbbbccccccccdddddddd, it return the integer ddddddddccccccccbbbbbbbbaaaaaaaa.
This is useful for normalizing endianness, however, it is probably not very portable, as int might not be 32bit (but e.g. 16bit or 64bit)
I am writing code in c++ to read a wave file in. I am following the wave file specification I found here.
In the following code I am reading in the chunksize, which is stored in bytes 4,5,6,7.
According to the specification, this int is stores in little endian in these 4 bytes.
So if these 4 bytes held the unsigned value 2, I would think the would be as follows..
4 5 6 7
00000010 00000000 00000000 00000000
So if I am trying to read these 4 bytes as an int on windows, I don't need to do anything correct? Since windows it little endian. So this is what I did...
unsigned int chunk_size = (hbytes[4] << 24) + (hbytes[5] << 16) + (hbytes[6] << 8) + hbytes[7];
but that didn't work, it gave me an incorrect value. When I swapped the endian of the bytes, it did work....
unsigned int chunk_size = (hbytes[7] << 24) + (hbytes[6] << 16) + (hbytes[5] << 8) + hbytes[4];
Is this information I have about wavefiles correct? Is this int stored as little endian? Or are my assumptions about endianess incorrect?
You got everything right except the procedure to convert a little-endian stream.
Your diagram is correct: if the 4-byte field holds a 2, then the first byte (hbytes[4]) is 2 and the remaining bytes are 0. Why would you then want to left shift that byte by 24? The byte you want to left shift by 24 is the high-order byte, hbytes[7].
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
convert big endian to little endian in C [without using provided func]
I'm having trouble with this one part: If I wanted to take a 32 bit number, and I want to shift its bytes (1 byte = 8 bits) from big endian to little endian form. For example:
Lets say I have the number 1.
In 32 bits this is what it would look like:
1st byte 2nd byte 3rd byte 4th byte
00000000 00000000 00000000 00000001
I want it so that it looks like this:
4th byte 3rd byte 2nd byte 1st byte
00000001 00000000 00000000 00000000
so that the byte with the least significant value appears first. I was thinking you can use a for loop, but I'm not exactly sure on how to shift bits/bytes in C++. For example if a user entered in 1 and I had to shift it's bits like the above example, I'm not sure how I would convert 1 into bits, then shift. Could anyone point me in the right direction? Thanks!
<< and >> is the bitwise shift operators in C and most other C style languages.
One way to do what you want is:
int value = 1;
uint x = (uint)value;
int valueShifted =
( x << 24) | // Move 4th byte to 1st
((x << 8) & 0x00ff0000) | // Move 2nd byte to 3rd
((x >> 8) & 0x0000ff00) | // Move 3rd byte to 2nd
( x >> 24); // Move 4th byte to 1st
uint32_t n = 0x00000001;
std::reverse( (char*)&n, (char*)(&n + 1) );
assert( n == 0x01000000 );
Shifting is done with the << and >> operators. Together with the bit-wise AND (&) and OR (|) operators you can do what you want:
int value = 1;
int shifted = value << 24 | (value & 0x0000ff00) << 8 | (value & 0x00ff0000) >> 8 | (value & 0xff000000) >> 24;
I was working on making a random number gen. I am using a union to access bytes.
typedef unsigned int uint;
typdef unsigned char uchar;
union
{
uint intBits;
uchar charBits[4];
};
// Yes I know ints are not guaranteed to be 4 but ignore that.
So if the number 1 was stored in this union it would look like
00000000 0000000 00000000 00000001
right?
Would a int of -1 look like
00000000 0000000 00000000 00000001
or
10000000 0000000 00000000 00000001
so really the address of the uint is the bit that is 1 right? And the address of charBits[0] is the bit that is 1 right? The confusing thing is this. charBits[1], would have to move to the left to be here
!
00000000 0000000 00000000 00000001
so do memory addresses get bigger right to left or left to right?
EDIT:
I am on a 64bit windows 7 system intel i7 CPU.
It depends on the machine architecture. If your CPU is big endian then it will work as you seem to expect:
int(4) => b3 b2 b1 b0
But if your CPU is little endian then the bytes are in the opposite direction:
int(4) => b0 b1 b2 b3
Note that bit orders within bytes are always from left (most significant) to right (least signficant).
There is absolutely no need to do this at all. You can easily compose a 32-bit integer from 8-bit values like this:
int myInt = byte1 | (byte2<<8) | (byte3<<16) | (byte4<<24);
And you can easily decompose a 32-bit integer into 8-bit values like this:
byte1 = myInt & 0xff;
byte2 = (myInt >> 8) & 0xff;
byte3 = (myInt >> 16) & 0xff;
byte4 = (myInt >> 24);
So there's no reason to write non-portable, hard to understand code that relies on internal representation details of the CPU or platform. Just write code that clearly does what you actually want.
I was searching for a way to efficiently pack my data in order to send them over a network.
I found a topic which suggested a way : http://www.sdltutorials.com/cpp-tip-packing-data
And I've also seen it being used in commercial applications. So I decided to give it a try, but the results weren't what I expected.
First of all , the whole point of "packing" your data is to save bytes. But I don't think that the algorithm mentioned above is saving bytes at all.
Because , without packing ... The server would send 4 bytes (Data) , after the packing the server sends a character array , 4 bytes long ... So it's pointless.
Aside from that , why would someone add 0xFF , it doesn't do anything at all.
The code snippet found in the tutorial mentioned above:
unsigned char Buffer[3];
unsigned int Data = 1024;
unsigned int UpackedData;
Buffer[0] = (Data >> 24) & 0xFF;
Buffer[1] = (Data >> 12) & 0xFF;
Buffer[2] = (Data >> 8) & 0xFF;
Buffer[3] = (Data ) & 0xFF;
UnpackedData = (Buffer[0] << 24) | (Buffer[1] << 12) | (Buffer[2] << 8) | (Buffer[3] & 0xFF);
Result:
0040 // 4 bytes long character
1024 // 4 bytes long
The & 0xFF is to make sure it's between 0 and 255.
I wouldn't place too much credence in that posting; aside from your objection, the code contains an obvious mistake. Buffer is only 3 elements long, but the code stores data in 4 elements.
For integers a simple method I found often useful is BER encoding. Basically for an unsigned integer you write 7 bits for each byte, using the 8th bit to mark if another byte is needed
void berPack(unsigned x, std::vector<unsigned char>& out)
{
while (x >= 128)
{
out.push_back(128 + (x & 127)); // write 7 bits, 8th=1 -> more needed
x >>= 7;
}
out.push_back(x); // Write last bits (8th=0 -> this ends the number)
}
for a signed integer you encode the sign in the least significant bit and the use the same encoding as before
void berPack(int x, std::vector<unsigned char>& out)
{
if (x < 0) berPack((unsigned(-x) << 1) + 1, out);
else berPack((unsigned(x) << 1), out);
}
With this approach small numbers will use less space. Another advantage is that this encoding is already architecture-neutral (i.e. data will be understood correctly independently on the endian-ness of the system) and that the same format can handle different integer sizes and you can send data from a 32 bit system to a 64 bit system without problems (assuming of course that the values themselves are not overflowing).
The price to pay is that for example unsigned values from 268435456 (1 << 28) to 4294967295 ((1 << 32) - 1) will require 5 bytes instead of 4 bytes of standard fixed 4-bytes packing.
Another reason for packing is to enforce a consistent structure, so that data written by one machine can be reliably read by another.
It's not "adding"; it's performing a bitwise-AND in order to mask out the LSB (least-significant byte). But it doesn't look necessary here.