Integer Types in file formats

Integer Types in file formats - c++

I am currently trying to learn some more in depth stuff of file formats.
I have a spec for a 3D file format (U3D in this case) and I want to try to implement that. Nothing serious, just for the learning effect.
My problem starts very early with the types, that need to be defined. I have to define different integers (8Bit, 16bit, 32bit unsigned and signed) and these then need to be converted to hex before writing that to a file.
How do I define these types, since I can not just create an I16 i.e.?
Another problem for me is how to convert that I16 to a hex number with 8 digits
(i.e. 0001 0001).

Hex is just a representation of a number. Whether you interpret the number as binary, decimal, hex, octal etc is up to you. In C++ you have support for decimal, hex, and octal representations, but they are all stored in the same way.
Example:
int x = 0x1;
int y = 1;
assert(x == y);
Likely the file format wants you to store the files in normal binary format. I don't think the file format wants the hex numbers as a readable text string. If it does though then you could use std::hex to do the conversion for you. (Example: file << hex << number;)
If the file format talks about writing more than a 1 byte type to file then be careful of the Endianness of your architecture. Which means do you store the most significant byte of the multi byte type first or last.
It is very common in file format specifications to show you how the binary should look for a given part of the file. Don't confuse this though with actually storing binary digits as strings. Likewise they will sometimes give a shortcut for this by specifying in hex how it should look. Again most of the time they don't actually mean text strings.
The smallest addressable unit in C++ is a char which is 1 byte. If you want to set bits within that byte you need to use bitwise operators like & and |. There are many tutorials on bitwise operators so I won't go into detail here.

If you include <stdint.h> you will get types such as:
uint8_t
int16_t
uint32_t

First, let me understand.
The integers are stored AS TEXT in a file, in hexadecimal format, without prefix 0x?
Then, use this syntax:
fprintf(fp, "%08x", number);
Will write 0abc1234 into a file.

As for "define different integers (8Bit, 16bit, 32bit unsigned and signed)", unless you roll your own, and concomitant math operations for them, you should stick to the types supplied by your system. See stdint.h for the typedefs available, such as int32_t.

Related

Hexadecimal in [] operator

I found some article and I saw this:
// Capture vendor string
char vendor[0x20];
memset(vendor, 0, sizeof(vendor));
*reinterpret_cast<int*>(vendor) = data_[0][1];
*reinterpret_cast<int*>(vendor + 4) = data_[0][3];
*reinterpret_cast<int*>(vendor + 8) = data_[0][2];
This line: char vendor[0x20];.
Why threre are hexadecimal and may I use octal value?
CPUID

Why threre are hexadecimal
Because the author chose to use hexadecimal. As you can see, 0x20 is quite "round" in hexadecimal, as it has only one non-zero digit.
may I use octal value?
Yes. Where-ever you can use an integer literal, you can use any of the available base representations. Binary, decimal, octal and hexadecimal are the options.
P.S. The example is technically broken in standard C++, because it fails to align the buffer, so it is not a good example to use for learning the language. I appears though that it is written specifically for x86 processors, which do work with unaligned operations.
A correct way to write this would have been to use an array of integers, copy the values, and then reinterpret the result as characters when reading.

What is the endianness of binary literals in C++14?

I have tried searching around but have not been able to find much about binary literals and endianness. Are binary literals little-endian, big-endian or something else (such as matching the target platform)?
As an example, what is the decimal value of 0b0111? Is it 7? Platform specific? Something else? Edit: I picked a bad value of 7 since it is represented within one byte. The question has been sufficiently answered despite this fact.
Some background: Basically I'm trying to figure out what the value of the least significant bits are, and masking it with binary literals seemed like a good way to go... but only if there is some guarantee about endianness.

Short answer: there isn't one. Write the number the way you would write it on paper.
Long answer:
Endianness is never exposed directly in the code unless you really try to get it out (such as using pointer tricks). 0b0111 is 7, it's the same rules as hex, writing
int i = 0xAA77;
doesn't mean 0x77AA on some platforms because that would be absurd. Where would the extra 0s that are missing go anyway with 32-bit ints? Would they get padded on the front, then the whole thing flipped to 0x77AA0000, or would they get added after? I have no idea what someone would expect if that were the case.
The point is that C++ doesn't make any assumptions about the endianness of the machine*, if you write code using primitives and the literals it provides, the behavior will be the same from machine to machine (unless you start circumventing the type system, which you may need to do).
To address your update: the number will be the way you write it out. The bits will not be reordered or any such thing, the most significant bit is on the left and the least significant bit is on the right.
There seems to be a misunderstanding here about what endianness is. Endianness refers to how bytes are ordered in memory and how they must be interpretted. If I gave you the number "4172" and said "if this is four-thousand one-hundred seventy-two, what is the endianness" you can't really give an answer because the question doesn't make sense. (some argue that the largest digit on the left means big endian, but without memory addresses the question of endianness is not answerable or relevant). This is just a number, there are no bytes to interpret, there are no memory addresses. Assuming 4 byte integer representation, the bytes that correspond to it are:
low address ----> high address
Big endian: 00 00 10 4c
Little endian: 4c 10 00 00
so, given either of those and told "this is the computer's internal representation of 4172" you could determine if its little or big endian.
So now consider your binary literal 0b0111 these 4 bits represent one nybble, and can be stored as either
low ---> high
Big endian: 00 00 00 07
Little endian: 07 00 00 00
But you don't have to care because this is also handled by the hardware, the language dictates that the compiler reads from left to right, most significant bit to least significant bit
Endianness is not about individual bits. Given that a byte is 8 bits, if I hand you 0b00000111 and say "is this little or big endian?" again you can't say because you only have one byte (and no addresses). Endianness doesn't pertain to the order of bits in a byte, it refers to the ordering of entire bytes with respect to address(unless of course you have one-bit bytes).
You don't have to care about what your computer is using internally. 0b0111 just saves you the time from having to write stuff like
unsigned int mask = 7; // only keep the lowest 3 bits
by writing
unsigned int mask = 0b0111;
Without needing to comment explaining the significance of the number.
* In c++20 you can check the endianness using std::endian.

All integer literals, including binary ones are interpreted in the same way as we normally read numbers (left most digit being most significant).
The C++ standard guarantees the same interpretation of literals without having to be concerned with the specific environment you're on. Thus, you don't have to concern yourself with endianness in this context.
Your example of 0b0111 is always equal to seven.
The C++ standard doesn't use terms of endianness in regards to number literals. Rather, it simply describes that literals have a consistent interpretation, and that the interpretation is the one you would expect.
C++ Standard - Integer Literals - 2.14.2 - paragraph 1
An integer literal is a sequence of digits that has no period or
exponent part, with optional separating single quotes that are ignored
when determining its value. An integer literal may have a prefix that
specifies its base and a suffix that specifies its type. The lexically
first digit of the sequence of digits is the most significant. A
binary integer literal (base two) begins with 0b or 0B and consists of
a sequence of binary digits. An octal integer literal (base eight)
begins with the digit 0 and consists of a sequence of octal digits.
A decimal integer literal (base ten) begins with a digit other than 0
and consists of a sequence of decimal digits. A hexadecimal integer
literal (base sixteen) begins with 0x or 0X and consists of a sequence
of hexadecimal digits, which include the decimal digits and the
letters a through f and A through F with decimal values ten through
fifteen. [Example: The number twelve can be written 12, 014, 0XC, or
0b1100. The literals 1048576, 1’048’576, 0X100000, 0x10’0000, and
0’004’000’000 all have the same value. — end example ]
Wikipedia describes what endianness is, and uses our number system as an example to understand big-endian.
The terms endian and endianness refer to the convention used to
interpret the bytes making up a data word when those bytes are stored
in computer memory.
Big-endian systems store the most significant byte of a word in the
smallest address and the least significant byte is stored in the
largest address (also see Most significant bit). Little-endian
systems, in contrast, store the least significant byte in the smallest
address.
An example on endianness is to think of how a decimal number is
written and read in place-value notation. Assuming a writing system
where numbers are written left to right, the leftmost position is
analogous to the smallest address of memory used, and rightmost
position the largest. For example, the number one hundred twenty three
is written 1 2 3, with the hundreds place left-most. Anyone who reads
this number also knows that the leftmost digit has the biggest place
value. This is an example of a big-endian convention followed in daily
life.
In this context, we are considering a digit of an integer literal to be a "byte of a word", and the word to be the literal itself. Also, the left-most character in a literal is considered to have the smallest address.
With the literal 1234, the digits one, two, three and four are the "bytes of a word", and 1234 is the "word". With the binary literal 0b0111, the digits zero, one, one and one are the "bytes of a word", and the word is 0111.
This consideration allows us to understand endianness in the context of the C++ language, and shows that integer literals are similar to "big-endian".

You're missing the distinction between endianness as written in the source code and endianness as represented in the object code. The answer for each is unsurprising: source-code literals are bigendian because that's how humans read them, in object code they're written however the target reads them.
Since a byte is by definition the smallest unit of memory access I don't believe it would be possible to even ascribe an endianness to any internal representation of bits in a byte -- the only way to discover endianness for larger numbers (whether intentionally or by surprise) is by accessing them from storage piecewise, and the byte is by definition the smallest accessible storage unit.

The C/C++ languages don't care about endianness of multi-byte integers. C/C++ compilers do. Compilers parse your source code and generate machine code for the specific target platform. The compiler, in general, stores integer literals the same way it stores an integer; such that the target CPU's instructions will directly support reading and writing them in memory.
The compiler takes care of the differences between target platforms so you don't have to.
The only time you need to worry about endianness is when you are sharing binary values with other systems that have different byte ordering.Then you would read the binary data in, byte by byte, and arrange the bytes in memory in the correct order for the system that your code is running on.

One picture is sometimes more than thousand words.

Endianness is implementation-defined. The standard guarantees that every object has an object representation as an array of char and unsigned char, which you can work with by calling memcpy() or memcmp(). In C++17, it is legal to reinterpret_cast a pointer or reference to any object type (not a pointer to void, pointer to a function, or nullptr) to a pointer to char, unsigned char, or std::byte, which are valid aliases for any object type.
What people mean when they talk about “endianness” is the order of bytes in that object representation. For example, if you declare unsigned char int_bytes[sizeof(int)] = {1}; and int i; then memcpy( &i, int_bytes, sizeof(i)); do you get 0x01, 0x01000000, 0x0100, 0x0100000000000000, or something else? The answer is: yes. There are real-world implementations that produce each of these results, and they all conform to the standard. The reason for this is so the compiler can use the native format of the CPU.
This comes up most often when a program needs to send or receive data over the Internet, where all the standards define that data should be transmitted in big-endian order, on a little-endian CPU like the x86. Some network libraries therefore specify whether particular arguments and fields of structures should be stored in host or network byte order.
The language lets you shoot yourself in the foot by twiddling the bits of an object representation arbitrarily, but it might get you a trap representation, which could cause undefined behavior if you try to use it later. (This could mean, for example, rewriting a virtual function table to inject arbitrary code.) The <type_traits> header has several templates to test whether it is safe to do things with an object representation. You can copy one object over another of the same type with memcpy( &dest, &src, sizeof(dest) ) if that type is_trivially_copyable. You can make a copy to correctly-aligned uninitialized memory if it is_trivially_move_constructible. You can test whether two objects of the same type are identical with memcmp( &a, &b, sizeof(a) ) and correctly hash an object by applying a hash function to the bytes in its object representation if the type has_unique_object_representations. An integral type has no trap representations, and so on. For the most part, though, if you’re doing operations on object representations where endianness matters, you’re telling the compiler to assume you know what you’re doing and your code will not be portable.
As others have mentioned, binary literals are written with the most-significant-digit first, like decimal, octal or hexidecimal literals. This is different from endianness and will not affect whether you need to call ntohs() on the port number from a TCP header read in from the Internet.

You might want to think about C or C++ or any other language as being intrinsically little endian (think about how the bitwise operators work). If the underlying HW is big endian, the compiler ensures that the data is stored in big endian (ditto for other endianness) however your bit wise operations work as if the data is little endian. Thing to remember is that as far as the language is concerned, data is in little endian. Endianness related problems arise when you cast the data from one type to the other. As long as you don't do that you are good.
I was questioned about the statement "C/C++ language as being intrinsically little endian", as such I am providing an example which many knows how it works but well here I go.
typedef union
{
struct {
int a:1;
int reserved:31;
} bits;
unsigned int value;
} u;
u test;
test.bits.a = 1;
test.bits.reserved = 0;
printf("After bits assignment, test.value = 0x%08X\n", test.value);
test.value = 0x00000001;
printf("After value assignment, test.value = 0x%08X\n", test.value);
Output on a little endian system:
After bits assignment, test.value = 0x00000001
After value assignment, test.value = 0x00000001
Output on a big endian system:
After bits assignment, test.value = 0x80000000
After value assignment, test.value = 0x00000001
So, if you do not know the processor's endianness, where does everything come out right? in the little endian system! Thus, I say that the C/C++ language is intrinsically little endian.

Is size of smallest unit of data written to file on file stream in binary mode always 8 bits?

As I mentioned above,
1-) Is size of smallest unit of data written to file on file stream in binary mode always 8 bits? If it writes to file whatever character is passed with the function put(), can we say that it is always 8 bits?
2-) If we add an integer to a variable of char type, does position in character set of the variable change as many as the integer added, regardless of how bits of variable of char type are represented in memory whichever platform/machine it is tried on? And what if we exceed the limit of value the variable can take in any system that has signed or unsigned char representation of char type? Does it always return from end to begining when adding and do the reverse for extracting?
3-) What exactly I want to know is whether there is a portable way to storage data in file for binary mode and how common file formats are manipulated by reading and writing without problems.
Thanks.

1) The C++ standard is pretty clear that a "byte" (or char) is not necessarily 8 bits, for one thing. Although machines with 9- or 12-bit char types are not very common, if you want extreme portability you need to take this into account in some way (e.g. specify that "our implementation expects a char to be 8 bits - which can of course be checked during compilation or runtime, e.g:
#if (CHAR_BITS != 8)
#error This implementation requires char_bits == 8.
#endif
or
if (CHAR_BITS != 8)
{
cerr << "Sorry, can't run on this platform, CHAR_BITS is not 8\n";
exit(2);
}
2) Adding an int value to a char value will convert it to an int - if you then convert it back to a char, it should be consistent, yes. Although behaviour is technically "undefined" for overflows between positive and negative values, which can cause strange things (e.g. traps for overflows) on some machines.
3) As long as it's clearly defined and documents, a binary format can be made to work well in a portable scenarion. See "JPG", "PNG" and to some degree "BMP" as examples where binary data is "quite portable". I'm not sure how well it works to display a JPG on a DEC-10 system with a 36-bit machine word tho'.

1) No, the smallest unit of allocation is a disk page, as defined by the filesystem parameters. With most modern file systems, this is 4k, though some next-gen file systems exceptionally small files' content can be stored in the inode, so the content itself takes no extra space on the disk. FAT and NTFS page sizes range from 4k to 64k depending on how the disk was formatted.
1a) "smallest read/write" unit is usually an 8-bit byte, though on some oddball systems use different byte sizes (CDC cyber comes to mind with a 12-bit byte). I can't think of any modern systems that use anything other than an 8-bit byte.
2) adding an integer to a char will result in a size integer result. The compiler will implicitly promote the char to integer before the arithmetic. This can then be downcast (by truncation, usually) to a char.
3) Yes and yes. You have to thoroughly document the file formats, including endianness of words if you plan to be running on different CPU architectures (i.e. Intel is little-ended, motorola is big-ended, and some supercomputers are weirdly ended). These different architectures will read and write words and dwords differently, and you may have to account for that in your reader code.
3a) This is fairly common (though now with XML and other self-defining semistructured formats perhaps less so), and so long as the documentation is complete, there are few issues in reading or writing these files.

Why and how should I write and read from binary files?

I'm coding a game project as a hobby and I'm currently in the part where I need to store some resource data (.BMPs for example) into a file format of my own so my game can parse all of it and load into the screen.
For reading BMPs, I read the header, and then the RGB data for each pixel, and I have a array[width][height] that stores these values.
I was told I should save these type of data in binary, but not the reason. I've read about binary and what it is (the 0-1 representation of data), but why should I use it to save a .BMP data for example?
If I'm going to read it later in the game, doesn't it just adds more complexness and maybe even slow down the loading process?
And lastly, if it is better to save in binary (I'm guessing it is, seeing as how everyone seems to do so from what I researched in other game resource files) how do I read and write binary in C++?
I've seen lots of questions but with many different ways for many different types of variables, so I'm asking which is the best/more C++ish way of doing it?

You have it all backwards. A computer processor operates with data at the binary level. Everything in a computer is binary. To deal with data in human-readable form, we write functions that jump through hoops to make that binary data look like something that humans understand. So if you store your .BMP data in a file as text, you're actually making the computer do a whole lot more work to convert the .BMP data from its natural binary form into text, and then from its text form back into binary in order to display it.
The truth of the matter is that the more you can handle data in its raw binary form, the faster your code will be able to run. Less conversions means faster code. But there's obviously a tradeoff: If you need to be able to look at data and understand it without pulling out a magic decoder ring, then you might want to store it in a file as text. But in doing so, we have to understand that there is conversion processing that must be done to make that human-readable text meaningful to the processor, which as I said, operates on nothing but pure binary data.
And, just in case you already knew that or sort-of-knew-it, and your question was "why should I open my .bmp file in binary mode and not in text mode", then the reason for that is that opening a file in text mode asks the platform to perform CRLF-to-LF conversions ("\r\n"-to-"\n" conversions), as necessary based on the platform, so that at the internal string-processing level, all you're dealing with is '\n' characters. If your file consists of binary data, you don't want that conversion going on, or else it will corrupt the data from the file as you read it. In this state, most of the data will be fine, and things may work fine most of the time, but occasionally you'll run across a pair of bytes of the hexadecimal form 0x0d,0x0a (decimal 13,10) that will get converted to just 0x0a (10), and you'll be missing a byte in the data you read. Therefore be sure to open binary files in binary mode!
OK, based on your most recent comment (below), here's this:
As you (now?) understand, data in a computer is stored in binary format. Yes, that means it's in 0's and 1's. However, when programming, you don't actually have to fiddle with the 0's and 1's yourself, unless you're doing bitwise logical operations for some reason. A variable of type, let's say int for example, is a collection of individual bits, each of which can be either 0 or 1. It's also a collection of bytes, and assuming that there are 8 bits in a byte, then there are generally 2, 4, or 8 bytes in an int, depending on your platform and compiler options. But you work with that int as an int, not as individual 0's and 1's. If you write that int out to a file in its purest form, the bytes (and thus the bits) get written out in an unconverted raw form. But you could also convert them to ASCII text and write them out that way. If you're displaying an int on the screen, you don't want to see the individual 0's and 1's of course, so you print it in its ASCII form, generally decoded as a decimal number. You could just as easily print that same int in its hexadecimal form, and the result would look different even though it's the same number. For example, in decimal, you might have the decimal value 65. That same value in hexadecimal is 0x41 (or, just 41 if we understand that it's in base 16). That same value is the letter 'A' if we display it in ASCII form (and consider only the low byte of the 2,- 4,- or 8-byte int, i.e. treat it as a char).
For the rest of this discussion, forget that we were talking about an int and now consider that we're discussing a char, or 1 byte (8 bits). Let's say we still have that same value, 65, or 0x41, or 'A', however you want to look at it. If you want to send that value to a file, you can send it in its raw form, or you can convert it to text form. If you send it in its raw form, it will occupy 8 bits (one byte) in the file. But if you want to write it to the file in text form, you'd convert it to ASCII, which depending on the format you want to write it an the actual value (65 in this case), it will occupy either 1, 2, or 3 bytes. Say you want to write it in decimal ASCII with no padding characters. The value 65 will then take 2 bytes: one for the '6' and one for the '5'. If you want to print it in hexadecimal form, it will still take 2 bytes: one for the '4' and one for the '1', unless you prepend it with "0x", in which case it will take 4 bytes, one for '0', one for 'x', one for '4', and another for '1'. Or suppose your char is the value 255 (the maximum value of a char): If we write it to the file in decimal ASCII form, it will take 3 bytes. But if we write that same value in hexadecimal ASCII form, it will still take 2 bytes (or 4, if we're prepending "0x"), because the value 255 in hexadecimal is 0xFF. Compare this to writing that 8-bit byte (char) in its raw binary form: A char takes 1 byte (by definition), so it will consume only 1 byte of the file in binary form regardless of what its value is.

Why is floating point byte swapping different from integer byte swapping?

I have a binary file of doubles that I need to load using C++. However, my problem is that it was written in big-endian format but the fstream >> operator will then read the number wrong because my machine is little-endian. It seems like a simple problem to resolve for integers, but for doubles and floats the solutions I have found won't work. How can I (or should I) fix this?
I read this as a reference for integer byte swapping:
How do I convert between big-endian and little-endian values in C++?
EDIT: Though these answers are enlightening, I have found that my problem is with the file itself and not the format of the binary data. I believe my byte swapping does work, I was just getting confusing results. Thanks for your help!

The most portable way is to serialize in textual format so that you don't have byte order issues. This is how operator>> works so you shouldn't be having any endian issues with >>. The principal problem with binary formats (which would explain endian problems) is that floating point numbers consist of a number of mantissa bits, a number of exponent bits and a sign bit. The exponent may use an offset. This mean that a straight byte re-ordering may not be sufficient, depending on the source and target format.
If you are using and IEEE-754 on both machines then you may be OK with a straight byte reversal as this standard specifies a bit-string interchange format that should be portable (byte order issues aside).
If you have to convert between two machine architectures and you have to use a raw byte memory dump, then so long as the basic number format is the same (i.e. they have the same bit counts in each part of the number), you can read the data into an array of unsigned char, use some basic byte and bit swapping routines to correct the storage format and then copy the raw bytes into a variable of the appropriate type.

The standard conversion operators do not work with binary data, so it's not clear how you got where you are.
However, since byte swapping operates on bytes, not numbers, you perform it on data destined to be floats just as data which will be integers.
And since text is so inefficient and floating-point data sets tend to be so large, it's entirely reasonable to want this.
int32_t raw_bytes;
stream >> raw_bytes; // not an int, just 32 bits of bytes
my_byte_swap( raw_bytes ); // swap 'em
float f = * reinterpret_cast< float * >( & raw_bytes ); // read them into FPU

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js