Bit reading puzzle (reading a binary file in C++)

Bit reading puzzle (reading a binary file in C++) - c++

I am trying to read the file 'train-images-idx3-ubyte', which can be found here along with the corresponding file format description (at the bottom of the webpage). When I look at the bytes with od -t x1 train-images-idx3-ubyte | less (hexadecimal, bytewise), I get the following output:
adress bytes
0000000 00 00 08 03 00 00 ea 60 00 00 00 1c 00 00 00 1c
0000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
...
This is what I expected according to 1. But when I try to read the data with C++ I've got a problem. What I do is this:
std::fstream trainingData("minst/train-images-idx3-ubyte",
std::ios::in | std::ios::binary);
int8_t zero = 0, encoding = 0, dimension = 0;
int32_t samples = -1;
trainingData >> zero >> zero >> encoding >> dimension;
trainingData >> samples;
debugLogger << "training set image file, encoding = "
<< (int) encoding << ", dimension = "
<< (int) dimension << ", items = " << (int) samples << "\n";
But the output of these few lines of code is:
training set image file, encoding = 8, dimension = 3, items = 0
Everything but the number of instances (items, samples) is correct. I tried reading the next 4 bytes as int8_t and that gave me at least the same result as od. I cannot imagine how samples can be 0. What I actually wanted to read here was 10,000. Maybe you've got a clue?

As mentioned in other answers, you need to use unformatted input, i.e. istream::read(...) instead of operator>>. Translating your code above to use read yields:
trainingData.read(reinterpret_cast<char*>(&zero), sizeof(zero));
trainingData.read(reinterpret_cast<char*>(&zero), sizeof(zero));
trainingData.read(reinterpret_cast<char*>(&encoding), sizeof(encoding));
trainingData.read(reinterpret_cast<char*>(&dimension), sizeof(dimension));
trainingData.read(reinterpret_cast<char*>(&samples), sizeof(samples));
Which gets you most of the way there - but 00 00 ea 60 looks like it's in Big-endian format, so you'll have to pass it through ntohl to make sense of it if you're running on an intel-based machine:
samples = ntohl(samples);
which gives encoding = 8, dimension = 3, items = 60000.

The input is formatted, which will result in you reading wrong results from the file. Reading from an unformatted input will provide the correct results.

Related

converting a string read from binary file to integer

I have a binary file. i am reading 16 bytes at a time it using fstream.
I want to convert it to an integer. I tried atoi. but it didnt work.
In python we can do that by converting to byte stream using stringobtained.encode('utf-8') and then converting it to int using int(bytestring.hex(),16). Should we follow such an elloborate steps as done in python or is there a way to convert it directly?
ifstream file(binfile, ios::in | ios::binary | ios::ate);
if (file.is_open())
{
size = file.tellg();
memblock = new char[size];
file.seekg(0, ios::beg);
while (!file.eof())
{
file.read(memblock, 16);
int a = atoi(memblock); // doesnt work 0 always
cout << a << "\n";
memset(memblock, 0, sizeof(memblock));
}
file.close();
Edit:
This is the sample contents of the file.
53 51 4C 69 74 65 20 66 6F 72 6D 61 74 20 33 00
04 00 01 01 00 40 20 20 00 00 05 A3 00 00 00 47
00 00 00 2E 00 00 00 3B 00 00 00 04 00 00 00 01
I need to read it as 16 byte i.e. 32 hex digits at a time.(i.e. one row in the sample file content) and convert it to integer.
so when reading 53 51 4C 69 74 65 20 66 6F 72 6D 61 74 20 33 00, i should get, 110748049513798795666017677735771517696
But i couldnt do it. I always get 0 even after trying strtoull. Am i reading the file wrong, or what am i missing.

You have a number of problems here. First is that C++ doesn't have a standard 128-bit integer type. You may be able to find a compiler extension, see for example Is there a 128 bit integer in gcc? or Is there a 128 bit integer in C++?.
Second is that you're trying to decode raw bytes instead of a character string. atoi will stop at the first non-digit character it runs into, which 246 times out of 256 will be the very first byte, thus it returns zero. If you're very unlucky you will read 16 valid digits and atoi will start reading uninitialized memory, leading to undefined behavior.
You don't need atoi anyway, your problem is much simpler than that. You just need to assemble 16 bytes into an integer, which can be done with shifting and or operators. The only complication is that read wants a char type which will probably be signed, and you need unsigned bytes.
ifstream file(binfile, ios::in | ios::binary);
char memblock[16];
while (file.read(memblock, 16))
{
uint128_t a = 0;
for (int i = 0; i < 16; ++i)
{
a = (a << 8) | (static_cast<unsigned int>(memblock[i]) & 0xff);
}
cout << a << "\n";
}
file.close();

It the number is binary what you want is:
short value ;
file.read(&value, sizeof (value));
Depending upon how the file was written and your processor, you may have to reverse the bytes in value using bit operations.

Having trouble getting the output of Linux 'dd' command in C++ program

I'm trying to use the sudo dd if=/dev/sda ibs=1 count=64 skip=446 command to get the partition table information from the master boot record in order to parse it I'm basically trying to read the output to a string in order to parse it, but all I'm getting is the following: � !. What I'm expecting is:
80 01 01 00 83 FE 3F 01 3F 00 00 00 43 7D 00 00
00 00 01 02 83 FE 3F 0D 82 7D 00 00 0C F1 02 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
My current code looks like this, and is just taken from here: How to execute a command and get output of command within C++ using POSIX?
#include <iostream>
#include <stdexcept>
#include <stdio.h>
#include <string>
using namespace std;
string exec(const char* cmd) {
char buffer[128];
string result = "";
FILE* pipe = popen(cmd, "r");
if (!pipe) throw std::runtime_error("popen() failed!");
try {
while (!feof(pipe)) {
if (fgets(buffer, 128, pipe) != NULL)
result += buffer;
}
} catch (...) {
pclose(pipe);
throw;
}
pclose(pipe);
return result;
}
int main() {
string s = exec("sudo dd if=/dev/sda ibs=1 count=64 skip=446");
cout << s;
}
Obviously I'm doing something wrong, but I can't figure out the problem. How do I get the proper output into my string?

while (!feof(pipe)) {
This is your first bug.
result += buffer;
This is your second bug. buffer is a char array, which decays to a char * in this context. As you know, a char * in a string context gets typically interpreted as a C-style string that's terminated by a '\0' byte.
You might've noticed that you expect to get a bunch of 00 bytes read. Well, after the char array gets decayed to a char *, everything up to the first 00 byte is going to get appended to your result, rather than the 128 bytes, exactly. And if there were no 00 bytes in those 128 bytes, you'll probably end up getting some random garbage, as an extra bonus, with a small possibility of a crash.
if (fgets(buffer, 128, pipe) != NULL)
This is your third bug. If the read data happens to include a 0A byte, an '\n' character, this is not going to read 128 bytes.
cout << s;
This is your fourth bug. Since the data will (after all the other bugs are fixed) presumably contain binary stuff, your terminal is inlikely to have much success displaying various bytes, especially bytes 00 through 1F.
To fix your code you will need to:
Correctly handle the end-of-file condition.
Correctly read binary data. fgets(), et al, are completely unsuitable for the task. If you insist on using C file structures, your only reasonable option is to use fread().
Correctly assemble a std::string from a blob of binary data. Merely appending a char buffer to it, crossing your fingers, and hoping for the best, will not work. You will most likely need to use the two-argument std::string constructor, that takes a beginning and an ending iterator value as parameters.
Display binary data correctly, instead of just dumping the entire blob to std::cout, just like that. The most common approach is a std::hex manipulator, and diligent up-conversion of each char to an int, as an unsigned value.

Weird hexadecimal format

Following hexdump shows some data made by device i have on my hands. It stores year, month, day, hour, minute, seconds, and lenght in weird way for me (4 bytes marks for single digit in reverse order).
de 07 00 00 01 00 00 00 16 00 00 00 10 00 00 00
24 00 00 00 1d 00 00 00 15 00 00 00 X X X X
For example:
Year is marked as "000007de" aka 0x07de (=2014). Now; problem i am having is how to properly handle this in c/c++. (first 4 bytes)
How do i read those 4 bytes with "reverse" order to make proper hexadecimal for handling afterwards with like ints/longs?

If you read the value as int on the same architecture it has been generated with then you don't need to do anything, as this is the natural format for your system.
You only need to do something about this if you want to read it on a different architecture, with a different binary format.
So you can read it simply with
int32_t n;
fread(&n, sizeof int32_t, 1, FILE);
Of course the file has to be opened in binary mode and you need a 32 bit int.

If you read it in the reverse order, you can then change the endianness with something like:
uint32_t before = 0xde070000;
uint32_t after = ((before<<24) & 0xff000000) |
((before<<8) & 0xff0000) |
((before>>8) & 0xff00) |
((before>>24) & 0xff);
Edit: as pointed out in comments, this is only defined for unsigned 32-bits conversions.

Binary File interpretation

I am reading in a binary file (in c++). And the header is something like this (printed in hexadecimal)
43 27 41 1A 00 00 00 00 23 00 00 00 00 00 00 00 04 63 68 72 31 FFFFFFB4 01 00 00 04 63 68 72 32 FFFFFFEE FFFFFFB7
when printed out using:
std::cout << hex << (int)mem[c];
Is there an efficient way to store 23 which is the 9th byte(?) into an integer without using stringstream? Or is stringstream the best way?
Something like
int n= mem[8]
I want to store 23 in n not 35.

You did store 23 in n. You only see 35 because you are outputting it with a routine that converts it to decimal for display. If you could look at the binary data inside the computer, you would see that it is in fact a hex 23.
You will get the same result as if you did:
int n=0x23;
(What you might think you want is impossible. What number should be stored in n for 1E? The only corresponding number is 31, which is what you are getting.)

Do you mean you want to treat the value as binary-coded decimal? In that case, you could convert it using something like:
unsigned char bcd = mem[8];
unsigned char ones = bcd % 16;
unsigned char tens = bcd / 16;
if (ones > 9 || tens > 9) {
// handle error
}
int n = 10*tens + ones;

Accessing specific binary information based on binary format documentation

I have a binary file and documentation of the format the information is stored in. I'm trying to write a simple program using c++ that pulls a specific piece of information from the file but I'm missing something since the output isn't what I expect.
The documentation is as follows:
Half-word Field Name Type Units Range Precision
10 Block Divider INT*2 N/A -1 N/A
11-12 Latitude INT*4 Degrees -90 to +90 0.001
There are other items in the file obviously but for this case I'm just trying to get the Latitude value.
My code is:
#include <cstdlib>
#include <iostream>
#include <fstream>
using namespace std;
int main(int argc, char* argv[])
{
char* dataFileLocation = "testfile.bin";
ifstream dataFile(dataFileLocation, ios::in | ios::binary);
if(dataFile.is_open())
{
char* buffer = new char[32768];
dataFile.seekg(10, ios::beg);
dataFile.read(buffer, 4);
dataFile.close();
cout << "value is << (int)(buffer[0] & 255);
}
}
The result of which is "value is 226" which is not in the allowed range.
I'm quite new to this and here's what my intentions where when writing the above code:
Open file in binary mode
Seek to the 11th byte from the start of the file
Read in 4 bytes from that point
Close the file
Output those 4 bytes as an integer.
If someone could point out where I'm going wrong I'd sure appreciate it. I don't really understand the (buffer[0] & 255) part (took that from some example code) so layman's terms for that would be greatly appreciated.
Hex Dump of the first 100 bytes:
testfile.bin 98,402 bytes 11/16/2011 9:01:52
-0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -A -B -C -D -E -F
00000000- 00 5F 3B BF 00 00 C4 17 00 00 00 E2 2E E0 00 00 [._;.............]
00000001- 00 03 FF FF 00 00 94 70 FF FE 81 30 00 00 00 5F [.......p...0..._]
00000002- 00 02 00 00 00 00 00 00 3B BF 00 00 C4 17 3B BF [........;.....;.]
00000003- 00 00 C4 17 00 00 00 00 00 00 00 00 80 02 00 00 [................]
00000004- 00 05 00 0A 00 0F 00 14 00 19 00 1E 00 23 00 28 [.............#.(]
00000005- 00 2D 00 32 00 37 00 3C 00 41 00 46 00 00 00 00 [.-.2.7.<.A.F....]
00000006- 00 00 00 00 [.... ]

Since the documentation lists the field as an integer but shows the precision to be 0.001, I would assume that the actual value is the stored value multiplied by 0.001. The integer range would be -90000 to 90000.
The 4 bytes must be combined into a single integer. There are two ways to do this, big endian and little endian, and which you need depends on the machine that wrote the file. x86 PCs for example are little endian.
int little_endian = buffer[0] | buffer[1]<<8 | buffer[2]<<16 | buffer[3]<<24;
int big_endian = buffer[0]<<24 | buffer[1]<<16 | buffer[2]<<8 | buffer[3];
The &255 is used to remove the sign extension that occurs when you convert a signed char to a signed integer. Use unsigned char instead and you probably won't need it.
Edit: I think "half-word" refers to 2 bytes, so you'll need to skip 20 bytes instead of 10.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Bit reading puzzle (reading a binary file in C++) - c++

The input is formatted, which will result in you reading wrong results from the file. Reading from an unformatted input will provide the correct results.

Related

converting a string read from binary file to integer

Having trouble getting the output of Linux 'dd' command in C++ program

Weird hexadecimal format

Binary File interpretation

Accessing specific binary information based on binary format documentation

Categories

Resources