Reading hard disk sector raw data - Why hex? - c++

I'm trying to read hard disk sector to get the raw data. Now after searching a lot I found out that some people are storing that raw sector data in hex and some in char .
Which is better, and why ? Which will give me better performance ?
I'm trying to write it in C++ and OS is windows.
For clarification -
#include <iostream>
#include <windows.h>
#include <winioctl.h>
#include <stdio.h>
void main() {
DWORD nRead;
char buf[512];
HANDLE hDisk = CreateFile("\\\\.\\PhysicalDrive0",
GENERIC_READ, FILE_SHARE_READ,
NULL, OPEN_EXISTING, 0, NULL);
SetFilePointer(hDisk, 0xA00, 0, FILE_BEGIN);
ReadFile(hDisk, buf, 512, &nRead, NULL);
for (int currentpos=0;currentpos < 512;currentpos++) {
std::cout << buf[currentpos];
}
CloseHandle(hDisk);
std::cin.get();
}
Consider the above code written by someone else and not me.
Notice the datatype char buf[512]; . Storing with datatype as char and it hasn't been converted into hex.

Raw data is just "raw data"... you store it as it is, you do not convert it. So, there no performance issue here. At most the difference is in representing the raw data in human readable format. In general:
representing it in char format makes easier to understand if there is some text contained in it,
while hex is better for representing numeric data (in case it follows some kind of pattern).
In your specific case: char just means 1 byte. so you are sure you are storing your data in a 512 bytes buffer. Allocating such space in term of Integer size gets thing unnecessarily more complicated

You have got yourself confused.
The data on a disk is stored as binary, just a long ass stream of ones and zeros.
The reason it is read in hex of char format is because it is easier to do.
decimal: 36
char: z (potentially one way of representing this value)
hex: 24
binary: 100100
The binary is the raw bit stream you would read from the disc or mememory. Hex is like a shorthand representation for it, they are completely interchangeable, one Hex 'number' simple represents four bits. Again, the decimal is just yet another way to represent that value.
The char however is a little bit tricky; for my representation, I have taken the characters 0-9 to represent the values 0-9 and then a-z are ** representing** the values 10-36. Equally, I could have decided to take the standard ascii value which would give me '$'.
As to why 'char' is used when dealing with bytes, it is because the C++ 'har' type is just a single byte (which is normally 8 bits).
I will also point out the problem with negative numbers. when you have a integer number, that is signed (has positive and negative) the first bit (the most significant) represents a large negative value such that if all bits are 'one' the value will represent -1. For example, with four bits so it is easy to see...
0010 = +2
1000 = -8
0110 = +6
1110 = -2
The key to this problem is that it is all just how you interpret/represent the binary values. The same sequence of bits can be represented more or less any way you want.

I'm am guessing you're talking about the final data being written to some file. The reason to use hex is because it's easier to read and harder to mess up. Generally if someone is doing some sort of human analysis on the sector they're going to use a hex editor on the raw data anyway, so if you output it as hex you skip the need for a hex viewer/editor.
For instance, on DOS/Windows you have to make sure you open a file as binary if you're going to use characters. Also you might have to make sure that the operating system doesn't mess with the character format anywhere in between.

Related

Read file as bytes and store into an array of deterministically 8-bit values

Coming back to sort of play around with C++ a little bit after some years out of college, when looking up how to read a file as bytes in C++, some of the information I came across is that there isn't any sort of magical "readAsBytes" function, and you essentially are supposed to do this by reading a file the same way you would a text file, but with making sure to store the results into a char*. For instance:
someIFStream.read(someCharPointer, sizeOfSomeCharPointer);
That being said, even though chars in C++ are usually supposed to be right around 8 bits, this isn't exactly guaranteed. Start messing around with different platforms and text encodings long enough, and you're going to run into issues if you want a true array of bytes.
You could just use a uint8_t* and copy everything over from the char* . . . but dang, that's wasteful. Why can't we just get everything into a uint8_t* the first time around, while we're still reading the file, in a way that doesn't have to worry about whether it's a 32-bit machine or a 64-bit machine or UTF-8 or UTF-16 or what have you?
So the question is: Is this possible, at least in more modern C++ versions? If so, how? The reason I don't want to go from a char* to a uint8_t* is basically one of not having to waste a bunch of CPU cycles on some 50,000-iteration for loop. Thanks!
EDIT
I'm defining a byte as 8 bits for the purposes of this question, unless somebody strongly suggests otherwise. My understanding is that bytes were originally 6 bits, then became 7, and then finally settled down on 8, but that 32-bit groupings and such are usually thought of as small collections of bytes. If I'm mistaken, or if I should think of this problem differently (either way), please bring it up.
A char is one byte, and a file is a sequence of bytes. It doesn't matter whether the machine is 32-bit or 64-bit or something else, and it doesn't matter whether text is stored in UTF-8 or UTF-16 or something else. A file contains bytes, and each byte fits in a char. This is required by the standard.
What can vary is how many bits are in a byte on a particular platform. If it's 8, then char is the same as uint8_t (aside from signedness, which doesn't affect how the data is stored) and you can just read bytes directly into a uint8_t. But if a byte is, say, 10 bits, you're going to have to cast all those chars in a loop, since reading from the file gives you a sequence of 10-bit bytes and you need to chop off two bits from each one.
If you want your program to be adaptible to different byte sizes, you could use #if CHAR_BIT == 8 to determine whether to read straight into a uint8_t array or read into a char array and then cast all the bytes into uint8_t afterward.
Since you're "coming back to C++" and concerned about UTF-8 vs. UTF-16 when reading raw char data from a file, I'm guessing you're accustomed to languages like Java and C# where the char type represents a Unicode character. That's not the case in C and C++. A char is a byte, and if you read, say, a multi-byte UTF-8 character from a file, you get each individual byte as a separate char, not the whole Unicode character as a single value.

Why and how should I write and read from binary files?

I'm coding a game project as a hobby and I'm currently in the part where I need to store some resource data (.BMPs for example) into a file format of my own so my game can parse all of it and load into the screen.
For reading BMPs, I read the header, and then the RGB data for each pixel, and I have a array[width][height] that stores these values.
I was told I should save these type of data in binary, but not the reason. I've read about binary and what it is (the 0-1 representation of data), but why should I use it to save a .BMP data for example?
If I'm going to read it later in the game, doesn't it just adds more complexness and maybe even slow down the loading process?
And lastly, if it is better to save in binary (I'm guessing it is, seeing as how everyone seems to do so from what I researched in other game resource files) how do I read and write binary in C++?
I've seen lots of questions but with many different ways for many different types of variables, so I'm asking which is the best/more C++ish way of doing it?
You have it all backwards. A computer processor operates with data at the binary level. Everything in a computer is binary. To deal with data in human-readable form, we write functions that jump through hoops to make that binary data look like something that humans understand. So if you store your .BMP data in a file as text, you're actually making the computer do a whole lot more work to convert the .BMP data from its natural binary form into text, and then from its text form back into binary in order to display it.
The truth of the matter is that the more you can handle data in its raw binary form, the faster your code will be able to run. Less conversions means faster code. But there's obviously a tradeoff: If you need to be able to look at data and understand it without pulling out a magic decoder ring, then you might want to store it in a file as text. But in doing so, we have to understand that there is conversion processing that must be done to make that human-readable text meaningful to the processor, which as I said, operates on nothing but pure binary data.
And, just in case you already knew that or sort-of-knew-it, and your question was "why should I open my .bmp file in binary mode and not in text mode", then the reason for that is that opening a file in text mode asks the platform to perform CRLF-to-LF conversions ("\r\n"-to-"\n" conversions), as necessary based on the platform, so that at the internal string-processing level, all you're dealing with is '\n' characters. If your file consists of binary data, you don't want that conversion going on, or else it will corrupt the data from the file as you read it. In this state, most of the data will be fine, and things may work fine most of the time, but occasionally you'll run across a pair of bytes of the hexadecimal form 0x0d,0x0a (decimal 13,10) that will get converted to just 0x0a (10), and you'll be missing a byte in the data you read. Therefore be sure to open binary files in binary mode!
OK, based on your most recent comment (below), here's this:
As you (now?) understand, data in a computer is stored in binary format. Yes, that means it's in 0's and 1's. However, when programming, you don't actually have to fiddle with the 0's and 1's yourself, unless you're doing bitwise logical operations for some reason. A variable of type, let's say int for example, is a collection of individual bits, each of which can be either 0 or 1. It's also a collection of bytes, and assuming that there are 8 bits in a byte, then there are generally 2, 4, or 8 bytes in an int, depending on your platform and compiler options. But you work with that int as an int, not as individual 0's and 1's. If you write that int out to a file in its purest form, the bytes (and thus the bits) get written out in an unconverted raw form. But you could also convert them to ASCII text and write them out that way. If you're displaying an int on the screen, you don't want to see the individual 0's and 1's of course, so you print it in its ASCII form, generally decoded as a decimal number. You could just as easily print that same int in its hexadecimal form, and the result would look different even though it's the same number. For example, in decimal, you might have the decimal value 65. That same value in hexadecimal is 0x41 (or, just 41 if we understand that it's in base 16). That same value is the letter 'A' if we display it in ASCII form (and consider only the low byte of the 2,- 4,- or 8-byte int, i.e. treat it as a char).
For the rest of this discussion, forget that we were talking about an int and now consider that we're discussing a char, or 1 byte (8 bits). Let's say we still have that same value, 65, or 0x41, or 'A', however you want to look at it. If you want to send that value to a file, you can send it in its raw form, or you can convert it to text form. If you send it in its raw form, it will occupy 8 bits (one byte) in the file. But if you want to write it to the file in text form, you'd convert it to ASCII, which depending on the format you want to write it an the actual value (65 in this case), it will occupy either 1, 2, or 3 bytes. Say you want to write it in decimal ASCII with no padding characters. The value 65 will then take 2 bytes: one for the '6' and one for the '5'. If you want to print it in hexadecimal form, it will still take 2 bytes: one for the '4' and one for the '1', unless you prepend it with "0x", in which case it will take 4 bytes, one for '0', one for 'x', one for '4', and another for '1'. Or suppose your char is the value 255 (the maximum value of a char): If we write it to the file in decimal ASCII form, it will take 3 bytes. But if we write that same value in hexadecimal ASCII form, it will still take 2 bytes (or 4, if we're prepending "0x"), because the value 255 in hexadecimal is 0xFF. Compare this to writing that 8-bit byte (char) in its raw binary form: A char takes 1 byte (by definition), so it will consume only 1 byte of the file in binary form regardless of what its value is.

Why is floating point byte swapping different from integer byte swapping?

I have a binary file of doubles that I need to load using C++. However, my problem is that it was written in big-endian format but the fstream >> operator will then read the number wrong because my machine is little-endian. It seems like a simple problem to resolve for integers, but for doubles and floats the solutions I have found won't work. How can I (or should I) fix this?
I read this as a reference for integer byte swapping:
How do I convert between big-endian and little-endian values in C++?
EDIT: Though these answers are enlightening, I have found that my problem is with the file itself and not the format of the binary data. I believe my byte swapping does work, I was just getting confusing results. Thanks for your help!
The most portable way is to serialize in textual format so that you don't have byte order issues. This is how operator>> works so you shouldn't be having any endian issues with >>. The principal problem with binary formats (which would explain endian problems) is that floating point numbers consist of a number of mantissa bits, a number of exponent bits and a sign bit. The exponent may use an offset. This mean that a straight byte re-ordering may not be sufficient, depending on the source and target format.
If you are using and IEEE-754 on both machines then you may be OK with a straight byte reversal as this standard specifies a bit-string interchange format that should be portable (byte order issues aside).
If you have to convert between two machine architectures and you have to use a raw byte memory dump, then so long as the basic number format is the same (i.e. they have the same bit counts in each part of the number), you can read the data into an array of unsigned char, use some basic byte and bit swapping routines to correct the storage format and then copy the raw bytes into a variable of the appropriate type.
The standard conversion operators do not work with binary data, so it's not clear how you got where you are.
However, since byte swapping operates on bytes, not numbers, you perform it on data destined to be floats just as data which will be integers.
And since text is so inefficient and floating-point data sets tend to be so large, it's entirely reasonable to want this.
int32_t raw_bytes;
stream >> raw_bytes; // not an int, just 32 bits of bytes
my_byte_swap( raw_bytes ); // swap 'em
float f = * reinterpret_cast< float * >( & raw_bytes ); // read them into FPU

Integer Types in file formats

I am currently trying to learn some more in depth stuff of file formats.
I have a spec for a 3D file format (U3D in this case) and I want to try to implement that. Nothing serious, just for the learning effect.
My problem starts very early with the types, that need to be defined. I have to define different integers (8Bit, 16bit, 32bit unsigned and signed) and these then need to be converted to hex before writing that to a file.
How do I define these types, since I can not just create an I16 i.e.?
Another problem for me is how to convert that I16 to a hex number with 8 digits
(i.e. 0001 0001).
Hex is just a representation of a number. Whether you interpret the number as binary, decimal, hex, octal etc is up to you. In C++ you have support for decimal, hex, and octal representations, but they are all stored in the same way.
Example:
int x = 0x1;
int y = 1;
assert(x == y);
Likely the file format wants you to store the files in normal binary format. I don't think the file format wants the hex numbers as a readable text string. If it does though then you could use std::hex to do the conversion for you. (Example: file << hex << number;)
If the file format talks about writing more than a 1 byte type to file then be careful of the Endianness of your architecture. Which means do you store the most significant byte of the multi byte type first or last.
It is very common in file format specifications to show you how the binary should look for a given part of the file. Don't confuse this though with actually storing binary digits as strings. Likewise they will sometimes give a shortcut for this by specifying in hex how it should look. Again most of the time they don't actually mean text strings.
The smallest addressable unit in C++ is a char which is 1 byte. If you want to set bits within that byte you need to use bitwise operators like & and |. There are many tutorials on bitwise operators so I won't go into detail here.
If you include <stdint.h> you will get types such as:
uint8_t
int16_t
uint32_t
First, let me understand.
The integers are stored AS TEXT in a file, in hexadecimal format, without prefix 0x?
Then, use this syntax:
fprintf(fp, "%08x", number);
Will write 0abc1234 into a file.
As for "define different integers (8Bit, 16bit, 32bit unsigned and signed)", unless you roll your own, and concomitant math operations for them, you should stick to the types supplied by your system. See stdint.h for the typedefs available, such as int32_t.

How would one handle bits from a file?

ifstream inStream;
inStream.open(filename.c_str(), fstream::binary);
if(inStream.fail()){
cout<<" Error in opening file: "<<filename;
exit(1);
}
Let's say we just want to deal with individual bits from the file. I know we can read the file char by char, but can we read it bit by bit just as easily?
Files are typically read in units that are greater than a bit (usually a byte or above). A single bit file would still take at least a whole byte (actually, it would take multiple bytes on disk based on the file system, but the length could be determined in bytes).
However, you could write a wrapper around the stream that provides the next bit every time, while internally reading a character, supplying bits whenever asked, and reading the next character from the file when there is a request that could not longer be filled from the previous character. I assume that you know how to turn a single byte (or char) into a sequence of bits.
Since this is homework, you are probably expected to write this yourself instead of using an existing
library.
You'll have to read from the file byte by byte and then extract bits as needed from the read byte. There is no way to do IO at bit level.
I guess your binary file is the huffman encoded and compressed file. You'll have to read this file byte by byte, then extract bits from these bytes using bitwise operators like:
char byte;
// read byte from file.
unsigned char mask = 0x80; // mask for bit extraction.
byte & mask // will give you the most significant bit.
byte <<= 1; // will left sift the byte by 1 so that you can read the next MSB.
you can use the read bits to descend the huffman tree till you reach a leaf node, at which point you've decoded a symbol.
Depending on what you're doing with the bits, it may be easier to read by the 32-bit word rather than by byte. In either case you're going to be doing mask and shift operations, the specifics of which are left as the proverbial exercise for the reader. :-) Don't be discouraged if it takes several tries; I have to do this sort of this thing moderately often and I still get it wrong the first time more often than not.