How are doubles represented when written to text files? [closed] - c++

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
When you write a number of doubles to a file, in which format are they stored? Is it in byte format or string format?
E.g. given 0.00083231. Is it stored with 10 bytes, where each byte represents one digit? Or is it stored as only 8 bytes, since the size of a double is 8 bytes?
Assume that the language used is C++.

If you choose to write text, e.g. with formatted output like file << x, you get text.
If you choose to write bytes, e.g. with unformatted output like file.write(&x, sizeof x), you get bytes.

It depends on how you print the value
If you print the number as a binary value, it'll take sizeof(double) bytes (which is not always 8) in the file and you can't read the value with a normal text viewer/editor. You must use a binary/hex editor to see it in binary format.
If you print the number using a text output function, the result depends on how you format it. If you use cout or functions in std::printf family using %f format, the value will be printed using 6 significant digits so it'll take only 8 bytes in textual format at most. If you use a different length/width specifier (for example printf("%9.10f\n", 0.00083231) then of course the real bytes printed will be different. Using another format will also result in different printed form outputs. For example %e will print out the string in the scientific format which is 8.323100e-04 in your case, and take at least 12 bytes in the output string. %a will print out the value in hexadecimal form which will be even longer except for values that are exactly representable in binary. See live example here

Question:
When you write a number of doubles to a file, in which format are they stored? Is it in byte format or string format?
It depends on which functions you use to write the numbers.
E.g.:
If you use fprintf or printf, the number will be written out in textual form, which, in your example, will be written as 0.000832 with the format "%lf" and will take 8 bytes. You can change the format to change the number of bytes used to write out the number. The resulting output will be in human readable form. Same thing if you use cout << number;.
If you use fwrite, the number will be written in binary form. The number of bytes necessary to store the number will always be sizeof(double) regardless of the value of the number. The resulting output will not be human readable. Same thing if you use ostream::write.

It depends how you write them. You could use std::ostream and its (overloaded) operator <<; then they are stored in textual form. You could use binary IO e.g. std::ostream::write or fwrite then they are stored in native machine binary form.
You probably should read more about serialization, and consider using textual formats like JSON (e.g. with jsoncpp). You might be interested by binary serialization e.g. libs11n or XDR
Notice that data is often more important than code, and that disk IO or network IO is a lot (e.g. many thousand times at least) slower than CPU. So spending CPU times to make the data easier to store is often worthwhile. Also, the same data could be written on one machine, and read on some very different one.
Read also about persistence, databases, application checkpointing, endianness. See also this.

Related

How to detect whether a file is formatted or unformatted?

The way I am using is the following. I try to open a file in the default formatted form and test reading it. If failed (error or reaching file end), then unformatted. But this does not give me confidence in the file types, after all, why would an unformatted file fail to give a formatted reading, and why would a formatted file give a failed unformatted reading. I would expect that unformatted file read as formatted returns most likely error but not guaranteed, formatted file read as unformatted gives weird things but not an error (a test code actually returns end of file). Any better ways to check file type?
Short answer
Formatted file contains mostly ASCII. Processors and implementations allow you to have non ascii, writing them to file is OK, but reading them back can be a problem if read as formatted. Assuming that your formatted files have only ASCII characters and that your unformatted file are not limitted to text, the following subroutine will do the job.
!
subroutine detect_format(fName)
character(*), intent(in) :: fName
integer :: fId, stat
character :: c
logical :: formatted
!
stat = 0
formatted = .true. !assume formatted
open(newunit=fId,file=fName,status='old',form='unformatted',recl=1)
! I assume that it fails only on the end of file
do while((stat==0).and.formatted)
read(fId, iostat=stat)c
formatted = formatted.and.( iachar(c)<=127 )
end do
if(formatted)then
print*, trim(fName), ' is a formatted file'
else
print*, trim(fName), ' is an unformatted file'
end if
close(fId)
!
end subroutine detect_format
If your unformatted file contains only characters, this procedure will not help. Anyway, there is no difference between formatted and unformatted characters file, unless it is an unformatted with variable record size. In that special case, you can catch it with the record size that is saved.
You can use some heuristics to simplify it. For example, you can say that you consider it ASCII if the first 100 bytes are ASCII. Or you can say you consider it ASCII if more that 80% are ASCII. The subroutine can be made simple by using stream-based IO.
Long answer
The first thing is to understand: - the internal representation of data in computer memory (RAM, disk, etc.); - the external representation; - as well as the difference between them.
The second thing is to understand the fortran distinction of formatted versus unformatted files.
Internal and external representation of data in computer memory.
By internal representation, I mean the form under which the CPU process the data. That is the binary representation. In the internal representation, you must know the type of the data to give it a meaning. By external representation I mean the glyphs that get printed on your screen or on the paper from your printer. For example, if we are processing only numbers, the glyphs are the symbols (0, 1, 2, ..., 9) for the latin based languages, (I, II, III, IV, X, ...) for roman. Follow this link for the glyphs in other languages. I am going a little far away from what the fortran standard defines, but this is for the purpose of the transition. The fortran standard uses only the symbols (0, 1, 2, ..., 9), but some implementations account for the decimal separator that can either be a comma or a dot. The human brain is able to figure out what it is by looking at the external representation.
In between the internal representation and the external representation, there is an intermediate representation that helps human and computers to understand each other. And that form is what makes the difference between the formatted and the unformatted files in fortran. That intermediate form is the computer internal representation of the external representation (computer does not store glyph, it only draws it on request when you want to see). As computer representation, the intermediate form is binary but it has a 1 to 1 correspondence with the external representation (glyphs).
The storage unit in computer science is the byte. Some people like to go to the level of the bit, but it is not necessary. Data store in computer memory are just strings of bytes. A byte itself is a string of 8 bits, meaning that there are 256 possibilities of values that a byte can store. Further, the bytes are usually grouped by 4 or 8 (in the past they use to call it word).
Now any byte or group of bytes makes sense only if you know the type of data it contains. You can process the same string of 4 bytes as a 4 bytes integer, a 4 byte IEEE floating point number, a string of 4 bytes character, etc. If you are processing 4 bytes numbers (integer or IEEE floating points), the internal representation allows byte to take all the possible 256 values (except for very few that are used to define markers NaN Inf, etc. but they are still values). If you are processing English text (ASCII), the internal representation allows byte to takes only the first 127 values.
When it comes to the external representation, everything must be turn into glyph: numbers and characters alike. The intermediate representation has to map numbers to glyphs. Each number must be turned into a string of digit. Because the digit are also ASCII characters, everything get limited to the 127 values of bytes. That is the key to determine the content of your file.
Fortran formatted and unformatted files
When it comes to fortran, it mostly uses formatted files for human readable content. The content of the file will be the intermediate representation, limited to ASCII for english language.
Unformatted files represent the binary or internal representation of data as there are processed in the CPU. It is like a dump of the RAM.
Now to detect the content with modern fortran compiler, you just have to open the file and read it byte by byte and check if it contains only ASCII. If you get non ASCII you have an unformatted file, otherwise you have a formatted file. Reading byte by byte can be done by using stream-based IO in modern compilers or fixed-size record of 1 byte each. The later is the one used in the example.
I have to add that the life is not that simple. That procedure gives only a high probability not the exact truth. Being all in the range of ASCII does not garantee that it is automatically characters.
If you have a character file, it does not matter if it is formatted or fixed size record unformatted, it will contain ASCII.
One approach is to name the files in a logical way.
Personally I use .dat, .txt or .csv for formatted data, and I use .bin for binary data.
Unless you have hundreds+ of files, then perhaps you can just open them with an editor and see what it looks like?

Writing numerical data to file as binary vs. written out?

I'm writing floating point numbers to file, but there's two different ways of writing these numbers and I'm wondering which to use.
The two choices are:
write the raw representative bits to file
write the ascii representation of the number to file
Option 1 seems like it would be more practical to me, since I'm truncating each float to 4 bytes. And parsing each number can be skipped entirely when reading. But in practice, I've only ever seen option 2 used.
The data in question is 3D model information, where small file sizes and quick reading can be very advantageous, but again, no existing 3D model format does this that I know of, and I imagine there must be a good reason behind it.
My question is, what reasons are there for choosing to write the written out form of numbers, instead of the bit representation? And are there situations where using the binary form would be preferred?
First of all, floats are 4 bytes on any architecture you might encounter normally, so nothing is "truncated" when you write the 4 bytes from memory to a file.
As for your main question, many regular file formats are designed for "interoperability" and ease of reading/writing. That's why text, which is an almost universally portable representation (character encoding issues notwithstanding,) is used most often.
For example, it is very easy for a program to read the string "123" from a text file and know that it represents the number 123.
(But note that text itself is not a format. You might choose to represent all your data elements as ASCII/Unicode/whatever strings of characters, and put all these strings along with each other to form a text file, but you still need to specify exactly what each element means and what data can be found where. For example, a very simplistic text-based 3D triangle mesh file format might have the number of triangles in the mesh on the first line of the file, followed by three triplets of real numbers on the next N lines, each, specifying the 9 numbers required for the X,Y,Z coordinates of the three vertices of a triangle.)
On the other hand are the binary formats. These usually have in them the data elements in the same format as they are found in computer memory. This means an integer is represented with a fixed number of bytes (1, 2, 4 or 8, usually in "two's complement" format) or a real number is represented by 4 or 8 bytes in IEEE 754 format. (Note that I'm omitting a lot of details for the sake of staying on point.)
Main advantages of a binary format are:
They are usually smaller in size. A 32-bit integer written as an ASCII string can get upto 10 or 11 bytes (e.g. -1000000000) but in binary it always takes up 4 bytes. And smaller means faster-to-transfer (over network, from disk to memory, etc.) and easier to store.
Each data element is faster to read. No complicated parsing is required. If the data element happens to be in the exact format/layout that your platform/language can work with, then you just need to transfer the few bytes from disk to memory and you are done.
Even large and complex data structures can be laid out on disk in exactly the same way as they would have been in memory, and then all you need to do to "read" that format would be to get that large blob of bytes (which probably contains many many data elements) from disk into memory, in one easy and fast operation, and you are done.
But that 3rd advantage requires that you match the layout of data on disk exactly (bit for bit) with the layout of your data structures in memory. This means that, almost always, that file format will only work with your code and your code only, and not even if you change some stuff around in your own code. This means that it is not at all portable or interoperable. But it is damned fast to work with!
There are disadvantages to binary formats too:
You cannot view or edit or make sense of them in a simple, generic software like a text editor anymore. You can open any XML, JSON or config file in any text editor and make some sense of it quite easily, but not a JPEG file.
You will usually need more specific code to read in/write out a binary format, than a text format. Not to mention specification that document what every bit of the file should be. Text files are generally more self-explanatory and obvious.
In some (many) languages (scripting and "higher-level" languages) you usually don't have access to the bytes that make up an integer or a float, not to read them nor to write them. This means that you'll lose most of the speed advantages that binary files give you when you are working in a lower-level language like C or C++.
Binary in-memory formats of primitive data types are almost always tied to the hardware (or more generally, the whole platform) that the memory is attached to. When you choose to write the same bits from memory to a file, the file format becomes hardware-dependent as well. One hardware might not store floating-point real numbers exactly the same way as another, which means binary files written on one cannot be read on the other naively (care must be taken and the data carefully converted into the target format.) One major difference between hardware architectures is know as "endianness" which affects how multibyte primitives (e.g. a 4-byte integer, or an 8-byte float) are expected to be stored in memory (from highest-order byte to the lowest-order, or vice versa, which are called "big endian" and "little endian" respectively.) Data written to a binary file on a big-endian architecture (e.g. PowerPC) and read verbatim on a little-endian architecture (e.g. x86) will have all the bytes in each primitive swapped from high-value to low-value, which means all (well, almost all) the values will be wrong.
Since you mention 3D model data, let me give you an example of what formats are used in a typical game engine. The game engine runtime will most likely need the most speed it can have in reading the models, and 3D models are large, so usually it has a very specific, and not-at-all-portable format for its model files. But that format would most likely not be supported by any modeling software. So you need to write a converter (also called an exporter or importer) that would take a common, generally-used format (e.g. OBJ, DAE, etc.) and convert that into the engine-specific, proprietary format. But as I mentioned, reading/transferring/working-with a text-based format is easier than a binary format, so you usually would choose a text-based common format to export your models into, then run the converter on them to the optimized, binary, engine-specific runtime format.
You might prefer binary format if:
You want more compact encoding (fewer bytes - because text encoding will probably take more space).
Precision - because if you encode as text you might lose precision - but maybe there are ways to encode as text without losing precision*.
Performance is probably also another advantage of binary encoding.
Since you mention data in question is 3D model simulation, compactness of encoding (maybe also performance) and precision maybe relevant for you. On the other hand, text encoding is human readable.
That said, with binary encoding you typically have issues like endianness, and that float representation maybe different on different machines, but here is a way to encode floats (or doubles) in binary format in a portable way:
uint64_t pack754(long double f, unsigned bits, unsigned expbits)
{
long double fnorm;
int shift;
long long sign, exp, significand;
unsigned significandbits = bits - expbits - 1; // -1 for sign bit
if (f == 0.0) return 0; // get this special case out of the way
// check sign and begin normalization
if (f < 0) { sign = 1; fnorm = -f; }
else { sign = 0; fnorm = f; }
// get the normalized form of f and track the exponent
shift = 0;
while(fnorm >= 2.0) { fnorm /= 2.0; shift++; }
while(fnorm < 1.0) { fnorm *= 2.0; shift--; }
fnorm = fnorm - 1.0;
// calculate the binary form (non-float) of the significand data
significand = fnorm * ((1LL<<significandbits) + 0.5f);
// get the biased exponent
exp = shift + ((1<<(expbits-1)) - 1); // shift + bias
// return the final answer
return (sign<<(bits-1)) | (exp<<(bits-expbits-1)) | significand;
}
*: In C, since C99 there seems a way to do this, but still I think it will take more space.

How do I write files that gzip well?

I'm working on a web project, and I need to create a format to transmit files very efficiently (lots of data). The data is entirely numerical, and split into a few sections. Of course, this will be transferred with gzip compression.
I can't seem to find any information on what makes a file compress better than another file.
How can I encode floats (32bit) and short integers (16bit) in a format that results in the smallest gzip size?
P.s. it will be a lot of data, so saving 5% means a lot here. There won't likely be any repeats in the floats, but the integers will likely repeat about 5-10 times in each file.
The only way to compress data is to remove redundancy. This is essentially what any compression tool does - it looks for redundant/repeatable parts and replaces them with link/reference to the same data that was observed before in your stream.
If you want to make your data format more efficient, you should remove everything that could be possibly removed. For example, it is more efficient to store numbers in binary rather than in text (JSON, XML, etc). If you have to use text format, consider removing unnecessary spaces or linefeeds.
One good example of efficient binary format is google protocol buffers. It has lots of benefits, and not least of them is storing numbers as variable number of bytes (i.e. number 1 consumes less space than number 1000000).
Text or binary, but if you can sort your data before sending, it can increase possibility for gzip compressor to find redundant parts, and most likely to increase compression ratio.
Since you said 32-bit floats and 16-bit integers, you are already coding them in binary.
Consider the range and useful accuracy of your numbers. If you can limit those, you can recode the numbers using fewer bits. Especially the floats, which may have more bits than you need.
If the right number of bits is not a multiple of eight, then treat your stream of bytes as a stream of bits and use only the bits needed. Be careful to deal with the end of your data properly so that the added bits to go to the next byte boundary are not interpreted as another number.
If your numbers have some correlation to each other, then you should take advantage of that. For example, if the difference between successive numbers is usually small, which is the case for a representation of a waveform for example, then send the differences instead of the numbers. Differences can be coded using variable-length integers or Huffman coding or a combination, e.g. Huffman codes for ranges and extra bits within each range.
If there are other correlations that you can use, then design a predictor for the next value based on the previous values. Then send the difference between the actual and predicted value. In the previous example, the predictor is simply the last value. An example of a more complex predictor is a 2D predictor for when the numbers represent a 2D table and both adjacent rows and columns are correlated. The PNG image format has a few examples of 2D predictors.
All of this will require experimentation with your data, ideally large amounts of your data, to see what helps and what doesn't or has only marginal benefit.
Use binary instead of text.
A float in its text representation with 8 digits (a float has a precision of eight decimal digits), plus decimal separator, plus field separator, consumes 10 bytes. In binary representation, it takes only 4.
If you need to use text, use hex. It consumes less digits.
But although this makes a lot of difference for the uncompressed file, these differences might disappear after compression, since the compression algo should implicitly take care if that. But you may try.

Why and how should I write and read from binary files?

I'm coding a game project as a hobby and I'm currently in the part where I need to store some resource data (.BMPs for example) into a file format of my own so my game can parse all of it and load into the screen.
For reading BMPs, I read the header, and then the RGB data for each pixel, and I have a array[width][height] that stores these values.
I was told I should save these type of data in binary, but not the reason. I've read about binary and what it is (the 0-1 representation of data), but why should I use it to save a .BMP data for example?
If I'm going to read it later in the game, doesn't it just adds more complexness and maybe even slow down the loading process?
And lastly, if it is better to save in binary (I'm guessing it is, seeing as how everyone seems to do so from what I researched in other game resource files) how do I read and write binary in C++?
I've seen lots of questions but with many different ways for many different types of variables, so I'm asking which is the best/more C++ish way of doing it?
You have it all backwards. A computer processor operates with data at the binary level. Everything in a computer is binary. To deal with data in human-readable form, we write functions that jump through hoops to make that binary data look like something that humans understand. So if you store your .BMP data in a file as text, you're actually making the computer do a whole lot more work to convert the .BMP data from its natural binary form into text, and then from its text form back into binary in order to display it.
The truth of the matter is that the more you can handle data in its raw binary form, the faster your code will be able to run. Less conversions means faster code. But there's obviously a tradeoff: If you need to be able to look at data and understand it without pulling out a magic decoder ring, then you might want to store it in a file as text. But in doing so, we have to understand that there is conversion processing that must be done to make that human-readable text meaningful to the processor, which as I said, operates on nothing but pure binary data.
And, just in case you already knew that or sort-of-knew-it, and your question was "why should I open my .bmp file in binary mode and not in text mode", then the reason for that is that opening a file in text mode asks the platform to perform CRLF-to-LF conversions ("\r\n"-to-"\n" conversions), as necessary based on the platform, so that at the internal string-processing level, all you're dealing with is '\n' characters. If your file consists of binary data, you don't want that conversion going on, or else it will corrupt the data from the file as you read it. In this state, most of the data will be fine, and things may work fine most of the time, but occasionally you'll run across a pair of bytes of the hexadecimal form 0x0d,0x0a (decimal 13,10) that will get converted to just 0x0a (10), and you'll be missing a byte in the data you read. Therefore be sure to open binary files in binary mode!
OK, based on your most recent comment (below), here's this:
As you (now?) understand, data in a computer is stored in binary format. Yes, that means it's in 0's and 1's. However, when programming, you don't actually have to fiddle with the 0's and 1's yourself, unless you're doing bitwise logical operations for some reason. A variable of type, let's say int for example, is a collection of individual bits, each of which can be either 0 or 1. It's also a collection of bytes, and assuming that there are 8 bits in a byte, then there are generally 2, 4, or 8 bytes in an int, depending on your platform and compiler options. But you work with that int as an int, not as individual 0's and 1's. If you write that int out to a file in its purest form, the bytes (and thus the bits) get written out in an unconverted raw form. But you could also convert them to ASCII text and write them out that way. If you're displaying an int on the screen, you don't want to see the individual 0's and 1's of course, so you print it in its ASCII form, generally decoded as a decimal number. You could just as easily print that same int in its hexadecimal form, and the result would look different even though it's the same number. For example, in decimal, you might have the decimal value 65. That same value in hexadecimal is 0x41 (or, just 41 if we understand that it's in base 16). That same value is the letter 'A' if we display it in ASCII form (and consider only the low byte of the 2,- 4,- or 8-byte int, i.e. treat it as a char).
For the rest of this discussion, forget that we were talking about an int and now consider that we're discussing a char, or 1 byte (8 bits). Let's say we still have that same value, 65, or 0x41, or 'A', however you want to look at it. If you want to send that value to a file, you can send it in its raw form, or you can convert it to text form. If you send it in its raw form, it will occupy 8 bits (one byte) in the file. But if you want to write it to the file in text form, you'd convert it to ASCII, which depending on the format you want to write it an the actual value (65 in this case), it will occupy either 1, 2, or 3 bytes. Say you want to write it in decimal ASCII with no padding characters. The value 65 will then take 2 bytes: one for the '6' and one for the '5'. If you want to print it in hexadecimal form, it will still take 2 bytes: one for the '4' and one for the '1', unless you prepend it with "0x", in which case it will take 4 bytes, one for '0', one for 'x', one for '4', and another for '1'. Or suppose your char is the value 255 (the maximum value of a char): If we write it to the file in decimal ASCII form, it will take 3 bytes. But if we write that same value in hexadecimal ASCII form, it will still take 2 bytes (or 4, if we're prepending "0x"), because the value 255 in hexadecimal is 0xFF. Compare this to writing that 8-bit byte (char) in its raw binary form: A char takes 1 byte (by definition), so it will consume only 1 byte of the file in binary form regardless of what its value is.

Integer Types in file formats

I am currently trying to learn some more in depth stuff of file formats.
I have a spec for a 3D file format (U3D in this case) and I want to try to implement that. Nothing serious, just for the learning effect.
My problem starts very early with the types, that need to be defined. I have to define different integers (8Bit, 16bit, 32bit unsigned and signed) and these then need to be converted to hex before writing that to a file.
How do I define these types, since I can not just create an I16 i.e.?
Another problem for me is how to convert that I16 to a hex number with 8 digits
(i.e. 0001 0001).
Hex is just a representation of a number. Whether you interpret the number as binary, decimal, hex, octal etc is up to you. In C++ you have support for decimal, hex, and octal representations, but they are all stored in the same way.
Example:
int x = 0x1;
int y = 1;
assert(x == y);
Likely the file format wants you to store the files in normal binary format. I don't think the file format wants the hex numbers as a readable text string. If it does though then you could use std::hex to do the conversion for you. (Example: file << hex << number;)
If the file format talks about writing more than a 1 byte type to file then be careful of the Endianness of your architecture. Which means do you store the most significant byte of the multi byte type first or last.
It is very common in file format specifications to show you how the binary should look for a given part of the file. Don't confuse this though with actually storing binary digits as strings. Likewise they will sometimes give a shortcut for this by specifying in hex how it should look. Again most of the time they don't actually mean text strings.
The smallest addressable unit in C++ is a char which is 1 byte. If you want to set bits within that byte you need to use bitwise operators like & and |. There are many tutorials on bitwise operators so I won't go into detail here.
If you include <stdint.h> you will get types such as:
uint8_t
int16_t
uint32_t
First, let me understand.
The integers are stored AS TEXT in a file, in hexadecimal format, without prefix 0x?
Then, use this syntax:
fprintf(fp, "%08x", number);
Will write 0abc1234 into a file.
As for "define different integers (8Bit, 16bit, 32bit unsigned and signed)", unless you roll your own, and concomitant math operations for them, you should stick to the types supplied by your system. See stdint.h for the typedefs available, such as int32_t.