Why and how should I write and read from binary files?

Why and how should I write and read from binary files? - c++

I'm coding a game project as a hobby and I'm currently in the part where I need to store some resource data (.BMPs for example) into a file format of my own so my game can parse all of it and load into the screen.
For reading BMPs, I read the header, and then the RGB data for each pixel, and I have a array[width][height] that stores these values.
I was told I should save these type of data in binary, but not the reason. I've read about binary and what it is (the 0-1 representation of data), but why should I use it to save a .BMP data for example?
If I'm going to read it later in the game, doesn't it just adds more complexness and maybe even slow down the loading process?
And lastly, if it is better to save in binary (I'm guessing it is, seeing as how everyone seems to do so from what I researched in other game resource files) how do I read and write binary in C++?
I've seen lots of questions but with many different ways for many different types of variables, so I'm asking which is the best/more C++ish way of doing it?

You have it all backwards. A computer processor operates with data at the binary level. Everything in a computer is binary. To deal with data in human-readable form, we write functions that jump through hoops to make that binary data look like something that humans understand. So if you store your .BMP data in a file as text, you're actually making the computer do a whole lot more work to convert the .BMP data from its natural binary form into text, and then from its text form back into binary in order to display it.
The truth of the matter is that the more you can handle data in its raw binary form, the faster your code will be able to run. Less conversions means faster code. But there's obviously a tradeoff: If you need to be able to look at data and understand it without pulling out a magic decoder ring, then you might want to store it in a file as text. But in doing so, we have to understand that there is conversion processing that must be done to make that human-readable text meaningful to the processor, which as I said, operates on nothing but pure binary data.
And, just in case you already knew that or sort-of-knew-it, and your question was "why should I open my .bmp file in binary mode and not in text mode", then the reason for that is that opening a file in text mode asks the platform to perform CRLF-to-LF conversions ("\r\n"-to-"\n" conversions), as necessary based on the platform, so that at the internal string-processing level, all you're dealing with is '\n' characters. If your file consists of binary data, you don't want that conversion going on, or else it will corrupt the data from the file as you read it. In this state, most of the data will be fine, and things may work fine most of the time, but occasionally you'll run across a pair of bytes of the hexadecimal form 0x0d,0x0a (decimal 13,10) that will get converted to just 0x0a (10), and you'll be missing a byte in the data you read. Therefore be sure to open binary files in binary mode!
OK, based on your most recent comment (below), here's this:
As you (now?) understand, data in a computer is stored in binary format. Yes, that means it's in 0's and 1's. However, when programming, you don't actually have to fiddle with the 0's and 1's yourself, unless you're doing bitwise logical operations for some reason. A variable of type, let's say int for example, is a collection of individual bits, each of which can be either 0 or 1. It's also a collection of bytes, and assuming that there are 8 bits in a byte, then there are generally 2, 4, or 8 bytes in an int, depending on your platform and compiler options. But you work with that int as an int, not as individual 0's and 1's. If you write that int out to a file in its purest form, the bytes (and thus the bits) get written out in an unconverted raw form. But you could also convert them to ASCII text and write them out that way. If you're displaying an int on the screen, you don't want to see the individual 0's and 1's of course, so you print it in its ASCII form, generally decoded as a decimal number. You could just as easily print that same int in its hexadecimal form, and the result would look different even though it's the same number. For example, in decimal, you might have the decimal value 65. That same value in hexadecimal is 0x41 (or, just 41 if we understand that it's in base 16). That same value is the letter 'A' if we display it in ASCII form (and consider only the low byte of the 2,- 4,- or 8-byte int, i.e. treat it as a char).
For the rest of this discussion, forget that we were talking about an int and now consider that we're discussing a char, or 1 byte (8 bits). Let's say we still have that same value, 65, or 0x41, or 'A', however you want to look at it. If you want to send that value to a file, you can send it in its raw form, or you can convert it to text form. If you send it in its raw form, it will occupy 8 bits (one byte) in the file. But if you want to write it to the file in text form, you'd convert it to ASCII, which depending on the format you want to write it an the actual value (65 in this case), it will occupy either 1, 2, or 3 bytes. Say you want to write it in decimal ASCII with no padding characters. The value 65 will then take 2 bytes: one for the '6' and one for the '5'. If you want to print it in hexadecimal form, it will still take 2 bytes: one for the '4' and one for the '1', unless you prepend it with "0x", in which case it will take 4 bytes, one for '0', one for 'x', one for '4', and another for '1'. Or suppose your char is the value 255 (the maximum value of a char): If we write it to the file in decimal ASCII form, it will take 3 bytes. But if we write that same value in hexadecimal ASCII form, it will still take 2 bytes (or 4, if we're prepending "0x"), because the value 255 in hexadecimal is 0xFF. Compare this to writing that 8-bit byte (char) in its raw binary form: A char takes 1 byte (by definition), so it will consume only 1 byte of the file in binary form regardless of what its value is.

Related

How do I represent an LZW output in bytes?

I found an implementation of the LZW algorithm and I was wondering how can I represent its output, which is an int list, to a byte array.
I had tried with one byte but in case of long inputs the dictionary has more than 256 entries and thus I cannot convert.
Then I tried to add an extra byte to indicate how many bytes are used to store the values, but in this case I have to use 2 bytes for each value, which doesn't compress enough.
How can I optimize this?

As bits, not bytes. You just need a simple routine that writes an arbitrary number of bits to a stream of bytes. It simply keeps a one-byte buffer into which you put bits until you have eight bits. Then write than byte, clear the buffer, and start over. The process is reversed on the other side.
When you get to the end, just write the last byte buffer if not empty with the remainder of the bits set to zero.
You only need to figure out how many bits are required for each symbol at the current state of the compression. That same determination can be made on the other side when pulling bits from the stream.

In his 1984 article on LZW, T.A. Welch did not actually state how to "encode codes", but described mapping "strings of input characters into fixed-length codes", continuing "use of 12-bit codes is common". (Allows bijective mapping between three octets and two codes.)
The BSD compress(1) command didn't literally follow, but introduced a header, the interesting part being a specification of the maximum number if bits to use to encode an LZW output code, allowing decompressors to size decompression tables appropriately or fail early and in a controlled way. (But for the very first,) Codes were encoded with just the number of integral bits necessary, starting with 9.
An alternative would be to use Arithmetic Coding, especially if using a model different from every code is equally probable.

How to detect whether a file is formatted or unformatted?

The way I am using is the following. I try to open a file in the default formatted form and test reading it. If failed (error or reaching file end), then unformatted. But this does not give me confidence in the file types, after all, why would an unformatted file fail to give a formatted reading, and why would a formatted file give a failed unformatted reading. I would expect that unformatted file read as formatted returns most likely error but not guaranteed, formatted file read as unformatted gives weird things but not an error (a test code actually returns end of file). Any better ways to check file type?

Short answer
Formatted file contains mostly ASCII. Processors and implementations allow you to have non ascii, writing them to file is OK, but reading them back can be a problem if read as formatted. Assuming that your formatted files have only ASCII characters and that your unformatted file are not limitted to text, the following subroutine will do the job.
!
subroutine detect_format(fName)
character(*), intent(in) :: fName
integer :: fId, stat
character :: c
logical :: formatted
!
stat = 0
formatted = .true. !assume formatted
open(newunit=fId,file=fName,status='old',form='unformatted',recl=1)
! I assume that it fails only on the end of file
do while((stat==0).and.formatted)
read(fId, iostat=stat)c
formatted = formatted.and.( iachar(c)<=127 )
end do
if(formatted)then
print*, trim(fName), ' is a formatted file'
else
print*, trim(fName), ' is an unformatted file'
end if
close(fId)
!
end subroutine detect_format
If your unformatted file contains only characters, this procedure will not help. Anyway, there is no difference between formatted and unformatted characters file, unless it is an unformatted with variable record size. In that special case, you can catch it with the record size that is saved.
You can use some heuristics to simplify it. For example, you can say that you consider it ASCII if the first 100 bytes are ASCII. Or you can say you consider it ASCII if more that 80% are ASCII. The subroutine can be made simple by using stream-based IO.
Long answer
The first thing is to understand: - the internal representation of data in computer memory (RAM, disk, etc.); - the external representation; - as well as the difference between them.
The second thing is to understand the fortran distinction of formatted versus unformatted files.
Internal and external representation of data in computer memory.
By internal representation, I mean the form under which the CPU process the data. That is the binary representation. In the internal representation, you must know the type of the data to give it a meaning. By external representation I mean the glyphs that get printed on your screen or on the paper from your printer. For example, if we are processing only numbers, the glyphs are the symbols (0, 1, 2, ..., 9) for the latin based languages, (I, II, III, IV, X, ...) for roman. Follow this link for the glyphs in other languages. I am going a little far away from what the fortran standard defines, but this is for the purpose of the transition. The fortran standard uses only the symbols (0, 1, 2, ..., 9), but some implementations account for the decimal separator that can either be a comma or a dot. The human brain is able to figure out what it is by looking at the external representation.
In between the internal representation and the external representation, there is an intermediate representation that helps human and computers to understand each other. And that form is what makes the difference between the formatted and the unformatted files in fortran. That intermediate form is the computer internal representation of the external representation (computer does not store glyph, it only draws it on request when you want to see). As computer representation, the intermediate form is binary but it has a 1 to 1 correspondence with the external representation (glyphs).
The storage unit in computer science is the byte. Some people like to go to the level of the bit, but it is not necessary. Data store in computer memory are just strings of bytes. A byte itself is a string of 8 bits, meaning that there are 256 possibilities of values that a byte can store. Further, the bytes are usually grouped by 4 or 8 (in the past they use to call it word).
Now any byte or group of bytes makes sense only if you know the type of data it contains. You can process the same string of 4 bytes as a 4 bytes integer, a 4 byte IEEE floating point number, a string of 4 bytes character, etc. If you are processing 4 bytes numbers (integer or IEEE floating points), the internal representation allows byte to take all the possible 256 values (except for very few that are used to define markers NaN Inf, etc. but they are still values). If you are processing English text (ASCII), the internal representation allows byte to takes only the first 127 values.
When it comes to the external representation, everything must be turn into glyph: numbers and characters alike. The intermediate representation has to map numbers to glyphs. Each number must be turned into a string of digit. Because the digit are also ASCII characters, everything get limited to the 127 values of bytes. That is the key to determine the content of your file.
Fortran formatted and unformatted files
When it comes to fortran, it mostly uses formatted files for human readable content. The content of the file will be the intermediate representation, limited to ASCII for english language.
Unformatted files represent the binary or internal representation of data as there are processed in the CPU. It is like a dump of the RAM.
Now to detect the content with modern fortran compiler, you just have to open the file and read it byte by byte and check if it contains only ASCII. If you get non ASCII you have an unformatted file, otherwise you have a formatted file. Reading byte by byte can be done by using stream-based IO in modern compilers or fixed-size record of 1 byte each. The later is the one used in the example.
I have to add that the life is not that simple. That procedure gives only a high probability not the exact truth. Being all in the range of ASCII does not garantee that it is automatically characters.
If you have a character file, it does not matter if it is formatted or fixed size record unformatted, it will contain ASCII.

One approach is to name the files in a logical way.
Personally I use .dat, .txt or .csv for formatted data, and I use .bin for binary data.
Unless you have hundreds+ of files, then perhaps you can just open them with an editor and see what it looks like?

Writing numerical data to file as binary vs. written out?

I'm writing floating point numbers to file, but there's two different ways of writing these numbers and I'm wondering which to use.
The two choices are:
write the raw representative bits to file
write the ascii representation of the number to file
Option 1 seems like it would be more practical to me, since I'm truncating each float to 4 bytes. And parsing each number can be skipped entirely when reading. But in practice, I've only ever seen option 2 used.
The data in question is 3D model information, where small file sizes and quick reading can be very advantageous, but again, no existing 3D model format does this that I know of, and I imagine there must be a good reason behind it.
My question is, what reasons are there for choosing to write the written out form of numbers, instead of the bit representation? And are there situations where using the binary form would be preferred?

First of all, floats are 4 bytes on any architecture you might encounter normally, so nothing is "truncated" when you write the 4 bytes from memory to a file.
As for your main question, many regular file formats are designed for "interoperability" and ease of reading/writing. That's why text, which is an almost universally portable representation (character encoding issues notwithstanding,) is used most often.
For example, it is very easy for a program to read the string "123" from a text file and know that it represents the number 123.
(But note that text itself is not a format. You might choose to represent all your data elements as ASCII/Unicode/whatever strings of characters, and put all these strings along with each other to form a text file, but you still need to specify exactly what each element means and what data can be found where. For example, a very simplistic text-based 3D triangle mesh file format might have the number of triangles in the mesh on the first line of the file, followed by three triplets of real numbers on the next N lines, each, specifying the 9 numbers required for the X,Y,Z coordinates of the three vertices of a triangle.)
On the other hand are the binary formats. These usually have in them the data elements in the same format as they are found in computer memory. This means an integer is represented with a fixed number of bytes (1, 2, 4 or 8, usually in "two's complement" format) or a real number is represented by 4 or 8 bytes in IEEE 754 format. (Note that I'm omitting a lot of details for the sake of staying on point.)
Main advantages of a binary format are:
They are usually smaller in size. A 32-bit integer written as an ASCII string can get upto 10 or 11 bytes (e.g. -1000000000) but in binary it always takes up 4 bytes. And smaller means faster-to-transfer (over network, from disk to memory, etc.) and easier to store.
Each data element is faster to read. No complicated parsing is required. If the data element happens to be in the exact format/layout that your platform/language can work with, then you just need to transfer the few bytes from disk to memory and you are done.
Even large and complex data structures can be laid out on disk in exactly the same way as they would have been in memory, and then all you need to do to "read" that format would be to get that large blob of bytes (which probably contains many many data elements) from disk into memory, in one easy and fast operation, and you are done.
But that 3rd advantage requires that you match the layout of data on disk exactly (bit for bit) with the layout of your data structures in memory. This means that, almost always, that file format will only work with your code and your code only, and not even if you change some stuff around in your own code. This means that it is not at all portable or interoperable. But it is damned fast to work with!
There are disadvantages to binary formats too:
You cannot view or edit or make sense of them in a simple, generic software like a text editor anymore. You can open any XML, JSON or config file in any text editor and make some sense of it quite easily, but not a JPEG file.
You will usually need more specific code to read in/write out a binary format, than a text format. Not to mention specification that document what every bit of the file should be. Text files are generally more self-explanatory and obvious.
In some (many) languages (scripting and "higher-level" languages) you usually don't have access to the bytes that make up an integer or a float, not to read them nor to write them. This means that you'll lose most of the speed advantages that binary files give you when you are working in a lower-level language like C or C++.
Binary in-memory formats of primitive data types are almost always tied to the hardware (or more generally, the whole platform) that the memory is attached to. When you choose to write the same bits from memory to a file, the file format becomes hardware-dependent as well. One hardware might not store floating-point real numbers exactly the same way as another, which means binary files written on one cannot be read on the other naively (care must be taken and the data carefully converted into the target format.) One major difference between hardware architectures is know as "endianness" which affects how multibyte primitives (e.g. a 4-byte integer, or an 8-byte float) are expected to be stored in memory (from highest-order byte to the lowest-order, or vice versa, which are called "big endian" and "little endian" respectively.) Data written to a binary file on a big-endian architecture (e.g. PowerPC) and read verbatim on a little-endian architecture (e.g. x86) will have all the bytes in each primitive swapped from high-value to low-value, which means all (well, almost all) the values will be wrong.
Since you mention 3D model data, let me give you an example of what formats are used in a typical game engine. The game engine runtime will most likely need the most speed it can have in reading the models, and 3D models are large, so usually it has a very specific, and not-at-all-portable format for its model files. But that format would most likely not be supported by any modeling software. So you need to write a converter (also called an exporter or importer) that would take a common, generally-used format (e.g. OBJ, DAE, etc.) and convert that into the engine-specific, proprietary format. But as I mentioned, reading/transferring/working-with a text-based format is easier than a binary format, so you usually would choose a text-based common format to export your models into, then run the converter on them to the optimized, binary, engine-specific runtime format.

You might prefer binary format if:
You want more compact encoding (fewer bytes - because text encoding will probably take more space).
Precision - because if you encode as text you might lose precision - but maybe there are ways to encode as text without losing precision*.
Performance is probably also another advantage of binary encoding.
Since you mention data in question is 3D model simulation, compactness of encoding (maybe also performance) and precision maybe relevant for you. On the other hand, text encoding is human readable.
That said, with binary encoding you typically have issues like endianness, and that float representation maybe different on different machines, but here is a way to encode floats (or doubles) in binary format in a portable way:
uint64_t pack754(long double f, unsigned bits, unsigned expbits)
{
long double fnorm;
int shift;
long long sign, exp, significand;
unsigned significandbits = bits - expbits - 1; // -1 for sign bit
if (f == 0.0) return 0; // get this special case out of the way
// check sign and begin normalization
if (f < 0) { sign = 1; fnorm = -f; }
else { sign = 0; fnorm = f; }
// get the normalized form of f and track the exponent
shift = 0;
while(fnorm >= 2.0) { fnorm /= 2.0; shift++; }
while(fnorm < 1.0) { fnorm *= 2.0; shift--; }
fnorm = fnorm - 1.0;
// calculate the binary form (non-float) of the significand data
significand = fnorm * ((1LL<<significandbits) + 0.5f);
// get the biased exponent
exp = shift + ((1<<(expbits-1)) - 1); // shift + bias
// return the final answer
return (sign<<(bits-1)) | (exp<<(bits-expbits-1)) | significand;
}
*: In C, since C99 there seems a way to do this, but still I think it will take more space.

How are doubles represented when written to text files? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
When you write a number of doubles to a file, in which format are they stored? Is it in byte format or string format?
E.g. given 0.00083231. Is it stored with 10 bytes, where each byte represents one digit? Or is it stored as only 8 bytes, since the size of a double is 8 bytes?
Assume that the language used is C++.

If you choose to write text, e.g. with formatted output like file << x, you get text.
If you choose to write bytes, e.g. with unformatted output like file.write(&x, sizeof x), you get bytes.

It depends on how you print the value
If you print the number as a binary value, it'll take sizeof(double) bytes (which is not always 8) in the file and you can't read the value with a normal text viewer/editor. You must use a binary/hex editor to see it in binary format.
If you print the number using a text output function, the result depends on how you format it. If you use cout or functions in std::printf family using %f format, the value will be printed using 6 significant digits so it'll take only 8 bytes in textual format at most. If you use a different length/width specifier (for example printf("%9.10f\n", 0.00083231) then of course the real bytes printed will be different. Using another format will also result in different printed form outputs. For example %e will print out the string in the scientific format which is 8.323100e-04 in your case, and take at least 12 bytes in the output string. %a will print out the value in hexadecimal form which will be even longer except for values that are exactly representable in binary. See live example here

Question:
When you write a number of doubles to a file, in which format are they stored? Is it in byte format or string format?
It depends on which functions you use to write the numbers.
E.g.:
If you use fprintf or printf, the number will be written out in textual form, which, in your example, will be written as 0.000832 with the format "%lf" and will take 8 bytes. You can change the format to change the number of bytes used to write out the number. The resulting output will be in human readable form. Same thing if you use cout << number;.
If you use fwrite, the number will be written in binary form. The number of bytes necessary to store the number will always be sizeof(double) regardless of the value of the number. The resulting output will not be human readable. Same thing if you use ostream::write.

It depends how you write them. You could use std::ostream and its (overloaded) operator <<; then they are stored in textual form. You could use binary IO e.g. std::ostream::write or fwrite then they are stored in native machine binary form.
You probably should read more about serialization, and consider using textual formats like JSON (e.g. with jsoncpp). You might be interested by binary serialization e.g. libs11n or XDR
Notice that data is often more important than code, and that disk IO or network IO is a lot (e.g. many thousand times at least) slower than CPU. So spending CPU times to make the data easier to store is often worthwhile. Also, the same data could be written on one machine, and read on some very different one.
Read also about persistence, databases, application checkpointing, endianness. See also this.

Saving binary date into file in c++

My algoritm produces stream of 9bits and 17bits I need to find solution to store this data in file. but i can't just store 9 bits as int and 17bits as int_32.
For example if my algoritm produces 10x9bit and 5x17bits the outfile size need to be 22bytes.
Also one of the big problem to solve is that the out file can be very big and size of the file is unknown.
The only idea with I have now is to use bool *vector;

If you have to save dynamic bits, then you should probably save two values: The first being either the number of bits (if bits are consecutive from 0 to x), or a bitmask to say which bits are valid; The second being the 32-bit integer representing your bits.

Taking your example literally: if you want to store 175 bits and it consists of unknown number of entities of two different lengths, then the file absolutely cannot be only 22 bytes. You need to know what is ahead of you in the file, you need the lengths. If you got only two possible sizes, then it can be only a single bit. 0 means 9 bit, 1 means 17 bit.
|0|9bit|0|9bit|1|17bit|0|9bit|1|17bit|1|17bit|...
So for your example, you would need 10*(1+9)+5*(1+17) = 190 bits ~ 24 bytes. The outstanding 2 bits need to be padded with 0's so that you align at byte boundary. The fact that you will go on reading the file as if there was another entity (because you said you don't know how long the file is) shouldn't be a problem because last such padding will be always less than 9 bits. Upon reaching end of file, you can throw away the last incomplete reading.
This approach indeed requires implementing a bit-level manipulation of the byte-level stream. Which means careful masking and logical operations. BASE64 is exactly that, only being simpler than you, consisting only of fixed 6-bit entities, stored in a textfile.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js