How to understand MNIST Binary converter in c++? - c++

I've recently needed to convert mnist data-set to images and labels, it is binary and the structure is in the previous link, so i did a little research and as I'm fan of c++ ,I've read the I/O binary in c++,after that I've found this link in stack. That link works well but no code commenting and no explanation of algorithm so I've get confused and that raise some question in my mind which i need a professional c++ programmer to ask.
1-What is the algorithm to convert the data-set in c++ with help of ifstream?
I've realized to read a file as a binary with file.read and move to the next record, but in C , we define a struct and move it inside the file but i can't see any struct in c++ program for example to read this:
[offset] [type] [value] [description]
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
0016 unsigned byte ?? pixel
How can we go to the specific offset for example 0004 and read for example 32 bit integer and put it to an integer variable.
2-What the function reverseInt is doing? (It is not obviously doing simple reversing an integer)
int ReverseInt (int i)
{
unsigned char ch1, ch2, ch3, ch4;
ch1 = i & 255;
ch2 = (i >> 8) & 255;
ch3 = (i >> 16) & 255;
ch4 = (i >> 24) & 255;
return((int) ch1 << 24) + ((int)ch2 << 16) + ((int)ch3 << 8) + ch4;
}
I've did a little debugging with cout and when it revised for example 270991360 it return 10000 , which i cannot find any relation, I understand it AND the number multiples with two with 255 but why?
PS :
1-I already have the MNIST converted images but i want to understand the algorithm.
2-I've already unzip the gz files so the file is pure binary.

1-What is the algorithm to convert the data-set in c++ with help of ifstream?
This function read a file (t10k-images-idx3-ubyte.gz) as follow:
Read a magic number and adjust endianness
Read number of images and adjust endianness
Read number rows and adjust endianness
Read number of columns and adjust endianness
Read all the given images x rows x columns characters (but loose them).
The function use normal int and always switch endianness, that means it target a very specific architecture and is not portable.
How can we go to the specific offset for example 0004 and read for example 32 bit integer and put it to an integer variable.
ifstream provides a function to seek to a given position:
file.seekg( posInBytes, std::ios_base::beg);
At the given position, you could read the 32-bit integer:
int32_t val;
file.read ((char*)&val,sizeof(int32_t));
2- What the function reverseInt is doing?
This function reverse order of the bytes of an int value:
Considering an integer of 32bit like aaaaaaaabbbbbbbbccccccccdddddddd, it return the integer ddddddddccccccccbbbbbbbbaaaaaaaa.
This is useful for normalizing endianness, however, it is probably not very portable, as int might not be 32bit (but e.g. 16bit or 64bit)

Related

Can i store 2 * 12bit unsigned numbers in a file as 3 bytes in C++ using bit fields?

I`m working on an LZW compression app in C++. Since there are no data types that can store 12 bit numbers for representing table elements up to 4095 I thought that I can store 2 of those nrs as 3 bytes in a file and then read them as a struct with 2 unsigned short members. Is there a way to do that or I should just use unsigned short? This is what I have tried but it stores 4 bytes as there are 2 unsigned short members.
#define BITS 12
struct LZWStruct {
unsigned short code1 : BITS;
unsigned short code2 : BITS;
};
int main() {
LZWStruct test;
test.code1 = 144;
test.code2 = 233;
FILE* f = fopen("binary.bin", "wb");
fwrite(&test, sizeof(test), 1, f);
fclose(f);
}
Your question title and question body are two different questions with different answers.
No, you absolutely cannot store 3 * 12-bit unsigned numbers (36 bits) in four bytes (32 bits).
Yes, you can store two 12-bit numbers (24 bits) in three bytes (24 bits).
The bit fields in C++, inherited from C, that you are trying to use do not guarantee exactly how the bits are packed in the structure, so you cannot know which three bytes in the structure have your data. You should simply use the shift and or operators to put them in an integer. Then you will know exactly which three bytes to write to the file.
Then to be portable, in particular not dependent on the endianess of the machine, you should write bytes from the integer also using the shift operator. If you write using a pointer to the integer, it won't be portable.
In your example, you could have tried fwrite(&test, 3, 1, f), and it might work, if the compiler put the codes in the low bits of test, and if your machine is little-endian. Otherwise, no.
So to do it reliably:
Put in an integer:
unsigned short code1;
unsigned short code2;
uint32_t test = (code1 & 0x3ff) | ((uint32_t)(code2 & 0x3ff) << 12);
Write to a file:
putc(test, f);
putc(test >> 8, f);
putc(test >> 16, f);
You can skip the intermediate step if you like:
putc(code1, f);
putc(((code1 >> 8) & 0xf) | (code2 << 4), f);
putc(code2 >> 4, f);
(In the above I am assuring that I only store the low 12 bits of each code with the & operators, in case the bits above the low 12 are not zero. If you know for certain that the code values are less than 4096, then you can remove the & operations above.)
From here, multiple adjacent bit fields are usually packed together. The special unnamed bit field of size zero can be forced to break up padding. It specifies that the next bit field begins at the beginning of its allocation unit. Use sizeof to verify the size of your structure.
The exact packing, however, may depend on platform and compiler. This may be less a problem if the data are later loaded by the same program, or some closely related, but may be an issue for some generic format.

16-bit to 10-bit conversion code explanation

I came across the following code to convert 16-bit numbers to 10-bit numbers and store it inside an integer. Could anyone maybe explain to me what exactly is happening with the AND 0x03?
// Convert the data to 10-bits
int xAccl = (((data[1] & 0x03) * 256) + data[0]);
if(xAccl > 511) {
xAccl -= 1024;
}
Link to where I got the code: https://www.instructables.com/id/Measurement-of-Acceleration-Using-ADXL345-and-Ardu/
The bitwise operator & will make a mask, so in this case, it voids the 6 highest bits of the integer.
Basically, this code does a modulo % 1024 (for unsigned values).
data[1] takes the 2nd byte; & 0x03 masks that byte with binary 11 - so: takes 2 bits; * 256 is the same as << 8 - i.e. pushes those 2 bits into the 9th and 10th positions; adding data[0] to data combines these two bytes (personally I'd have used |, not +).
So; xAccl is now the first 10 bits, using big-endian ordering.
The > 511 seems to be a sign check; essentially, it is saying "if the 10th bit is set, treat the entire thing as a negative integer as though we'd used 10-bit twos complement rules".

C++ - Reading number of bits per pixel from BMP file

I am trying to get number of bits per pixel in a bmp file. According to Wikipedia, it is supposed to be at 28th byte. So after reading a file:
// Przejscie do bajtu pod ktorym zapisana jest liczba bitow na pixel
plik.seekg(28, ios::beg);
// Read number of bytes used per pixel
int liczbaBitow;
plik.read((char*)&liczbaBitow, 2);
cout << "liczba bitow " << liczbaBitow << endl;
But liczbaBitow (variable that is supposed to hold number of bits per pixel value) is -859045864. I don't know where it comes from... I'm pretty lost.
Any ideas?
To clarify #TheBluefish's answer, this code has the bug
// Read number of bytes used per pixel
int liczbaBitow;
plik.read((char*)&liczbaBitow, 2);
When you use (char*)&libczbaBitow, you're taking the address of a 4 byte integer, and telling the code to put 2 bytes there.
The other two bytes of that integer are unspecified and uninitialized. In this case, they're 0xCC because that's the stack initialization value used by the system.
But if you're calling this from another function or repeatedly, you can expect the stack to contain other bogus values.
If you initialize the variable, you'll get the value you expect.
But there's another bug.. Byte order matters here too. This code is assuming that the machine native byte order exactly matches the byte order from the file specification. There are a number of different bitmap formats, but from your reference, the wikipedia article says:
All of the integer values are stored in little-endian format (i.e. least-significant byte first).
That's the same as yours, which is obviously also x86 little endian. Other fields aren't defined to be little endian, so as you proceed to decode the image, you'll have to watch for it.
Ideally, you'd read into a byte array and put the bytes where they belong.
See Convert Little Endian to Big Endian
int libczbaBitow;
unsigned char bpp[2];
plik.read(bpp, 2);
libczbaBitow = bpp[0] | (bpp[1]<<8);
-859045864 can be represented in hexadecimal as 0xCCCC0018.
Reading the second byte gives us 0x0018 = 24bpp.
What is most likely happening here, is that liczbaBitow is being initialized to 0xCCCCCCCC; while your plik.read is only writing the lower 16 bits and leaving the upper 16 bits unchanged. Changing that line should fix this issue:
int liczbaBitow = 0;
Though, especially with something like this, it's best to use a datatype that exactly matches your data:
int16_t liczbaBitow = 0;
This can be found in <cstdint>.

Spitting a char array into a sequence of ints and floats

I'm writing a program in C++ to listen to a stream of tcp messages from another program to give tracking data from a webcam. I have the socket connected and I'm getting all the information in but having difficulty splitting it up into the data I want.
Here's the format of the data coming in:
8 byte header:
4 character string,
integer
32 byte message:
integer,
float,
float,
float,
float,
float
This is all being stuck into a char array called buffer. I need to be able to parse out the different bytes into the primitives I need. I have tried making smaller sub arrays such as headerString that was filled by looping through and copying the first 4 elements of the buffer array and I do get the the correct hear ('CCV ') printed out. But when I try the same thing with the next for elements (to get the integer) and try to print it out I get weird ascii characters being printed out. I've tried converting the headerInt array to an integer with the atoi method from stdlib.h but it always prints out zero.
I've already done this in python using the excellent unpack method, is their any alternative in C++?
Any help greatly appreciated,
Jordan
Links
CCV packet structure
Python unpack method
The buffer only contains the raw image of what you read over the
network. You'll have to convert the bytes in the buffer to whatever
format you want. The string is easy:
std::string s(buffer + sOffset, 4);
(Assuming, of course, that the internal character encoding is the same
as in the file—probably an extension of ASCII.)
The others are more complicated, and depend on the format of the
external data. From the description of the header, I gather than the
integers are four bytes, but that still doesn't tell me anything about
their representation. Depending on the case, either:
int getInt(unsigned char* buffer, int offset)
{
return (buffer[offset ] << 24)
| (buffer[offset + 1] << 16)
| (buffer[offset + 2] << 8)
| (buffer[offset + 3] );
}
or
int getInt(unsigned char* buffer, int offset)
{
return (buffer[offset + 3] << 24)
| (buffer[offset + 2] << 16)
| (buffer[offset + 1] << 8)
| (buffer[offset ] );
}
will probably do the trick. (Other four byte representations of
integers are possible, but they are exceedingly rare. Similarly, the
conversion of the unsigned results of the shifts and or's into a int
is implementation defined, but in practice, the above will work almost
everywhere.)
The only hint you give concerning the representation of the floats is in
the message format: 32 bytes, minus a 4 byte integer, leave 28 bytes for
5 floats; but 28 doesn't go into five, so I cannot even guess as to the
length of the floats (except that there must be some padding in there
somewhere). But converting floating point can be more or less
complicated if the external format isn't exactly like the internal
format.
Something like this may work:
struct {
char string[4];
int integers[2];
float floats[5];
} Header;
Header* header = (Header*)buffer;
You should check that sizeof(Header) == 32.

What is the correct way to output hex data to a file?

I've read about [ostream] << hex << 0x[hex value], but I have some questions about it
(1) I defined my file stream, output, to be a hex output file stream, using output.open("BWhite.bmp",ios::binary);, since I did that, does that make the hex parameter in the output<< operation redundant?
(2)
If I have an integer value I wanted to store in the file, and I used this:
int i = 0;
output << i;
would i be stored in little endian or big endian? Will the endi-ness change based on which computer the program is executed or compiled on?
Does the size of this value depend on the computer it's run on? Would I need to use the hex parameter?
(3) Is there a way to output raw hex digits to a file? If I want the file to have the hex digit 43, what should I use?
output << 0x43 and output << hex << 0x43 both output ASCII 4, then ASCII 3.
The purpose of outputting these hex digits is to make the header for a .bmp file.
The formatted output operator << is for just that: formatted output. It's for strings.
As such, the std::hex stream manipulator tells streams to output numbers as strings formatted as hex.
If you want to output raw binary data, use the unformatted output functions only, e.g. basic_ostream::put and basic_ostream::write.
You could output an int like this:
int n = 42;
output.write(&n, sizeof(int));
The endianness of this output will depend on the architecture. If you wish to have more control, I suggest the following:
int32_t n = 42;
char data[4];
data[0] = static_cast<char>(n & 0xFF);
data[1] = static_cast<char>((n >> 8) & 0xFF);
data[2] = static_cast<char>((n >> 16) & 0xFF);
data[3] = static_cast<char>((n >> 24) & 0xFF);
output.write(data, 4);
This sample will output a 32 bit integer as little-endian regardless of the endianness of the platform. Be careful converting that back if char is signed, though.
You say
"Is there a way to output raw hex digits to a file? If I want the file to have the hex digit 43, what should I use? "
"Raw hex digits" will depend on the interpretation you do on a collection of bits. Consider the following:
Binary : 0 1 0 0 1 0 1 0
Hex : 4 A
Octal : 1 1 2
Decimal : 7 4
ASCII : J
All the above represents the same numeric quantity, but we interpret it differently.
So you can simply need to store the data as binary format, that is the exact bit pattern which is represent by the number.
EDIT1
When you open a file in text mode and write a number in it, say when you write 74 (as in above example) it will be stored as two ASCII character '7' and '4' . To avoid this open the file in binary mode ios::binary and write it with write () . Check http://courses.cs.vt.edu/~cs2604/fall00/binio.html#write
The purpose of outputting these hex digits is to make the header for a .bmp file.
You seem to have a large misconception of how files work.
The stream operators << generate text (human readable output). The .bmp file format is a binary format that is not human readable (will it is but its not nice and I would not read it without tools).
What you really want to do is generate binary output and place it the file:
char x = 0x43;
output.write(&x, sizeof(x));
This will write one byte of data with the hex value 0x43 to the output stream. This is the binary representation you want.
would i be stored in little endian or big endian? Will the endi-ness change based on which computer the program is executed or compiled on?
Neither; you are again outputting text (not binary data).
int i = 0;
output.write(reinterpret_cast<char*>(&i), sizeof(i)); // Writes the binary representation of i
Here you do need to worry about endianess (and size) of the integer value and this will vary depending on the hardware that you run your application on. For the value 0 there is not much tow worry about endianess but you should worry about the size of the integer.
I would stick some asserts into my code to validate the architecture is OK for the code. Then let people worry about if their architecture does not match the requirements:
int test = 0x12345678;
assert((sizeof(test) * CHAR_BITS == 32) && "BMP uses 32 byte ints");
assert((((char*)&test)[0] == 0x78) && "BMP uses little endian");
There is a family of functions that will help you with endianess and size.
http://www.gnu.org/s/hello/manual/libc/Byte-Order.html
Function: uint32_t htonl (uint32_t hostlong)
This function converts the uint32_t integer hostlong from host byte order to network byte order.
// Writing to a file
uint32_t hostValue = 0x12345678;
uint32_t network = htonl(hostValue);
output.write(&network, sizeof(network));
// Reading from a file
uint32_t network;
output.read(&network, sizeof(network);
uint32_t hostValue = ntohl(network); // convert back to platform specific value.
// Unfortunately the BMP was written with intel in-mind
// and thus all integers are in liitle-endian.
// network bye order (as used by htonl() and family) is big endian.
// So this may not be much us to you.
Last thing. When you open a file in binary format output.open("BWhite.bmp",ios::binary) it does nothing to stream apart from how it treats the end of line sequence. When the file is in binary format the output is not modified (what you put in the stream is what is written to the file). If you leave the stream in text mode then '\n' characters are converted to the end of line sequence (OS specific set of characters that define the end of line). Since you are writing a binary file you definitely do not want any interference in the characters you write so binary is the correct format. But it does not affect any other operation that you perform on the stream.