I'm learning to write binary files in C++. I'm a bit confused with the result. Let's say I have this code:
#include<fstream>
#include<string>
using namespace std;
int main(){
ofstream file;
string text = "Some text over here";
file.open("test.bin",ios::out|ios::binary);
file.write(text.c_str(), text.length());
file.close();
return 0;
}
I'm expecting the output file test.bin to be "in binary", but when I look at it in notepad, I see normal text:
Some text over here
Is my expectation wrong? What makes things binary and what should I use to achieve it?
The most "important" definition of what the word "binary" means comes from just a situation where a number can take on one of two values. Whatever you call those doesn't strictly matter ("on"/"off", "1"/"0", "yes"/"no"). All that matters is that there are just two states.
Keep that core definition in mind. But you will find a large number of other idiomatic usages of the word "binary" in the computer world, depending on context.
As an example: Some people will refer to a file representing an executable image (such as an .EXE file on Windows) as simply "a binary" or ("the binary", when compiling a certain codebase and you know what executable you'd be talking about.)
You've tripped across another confusing distinction of how sometimes people will talk about a file format as being either "textual" or "binary". Yet today's computers are based on systems that are always binary (technically they don't have to be). So if "textual" files aren't stored ultimately as binary bits somewhere, how else would they be stored? :-/
So really what it means for a file format to be labeled as "textual" is to say that it is "stricter about what binary patterns it uses, such that it will only use those patterns which make sense in certain textual encodings". That's why those files look readable when you load them up in text editors.
So a "textual file format" is a subset of all "file formats". And sometimes when people want to refer to something that is not in that subset of textual files, they will call it a "binary file format".
Plenty of room for confusion! But the upshot is that all you do when you open a file in "textual" vs. "binary" mode in C++ is to tell the stream that you are not using only the bit patterns likely to look good in a text editor when loaded. Opening in binary asks for all bytes to be sent to the file verbatim, instead of having it try and take care of cross-platform text-file differences in newline handling "under the hood" as a convenience.
Related
I am currently working on a program that needs to read and write simple binary files. The data that these files contain is large enough to make even compressed text files impractical. In these binary files, however, I need to store some metadata as text. I can easily create one or more std::strings in my IO code that can then be stored, but my simplest idea for storing them is something like this:
std::string toStore; //assume that this string is decently long and has only ascii characters in it
std::vector<unsigned char> output;
unsigned char *strPointer = reinterpret_cast<unsigned char*>(toStore.data());
output.insert(strPointer, strPointer+toStore.length());
output.push_back('\0');
std::ofstream fout;
fout.write(output.data(), output.size());
This method, however, has a major problem: because the character encoding on different C++ implementations is not guaranteed to be the same, the file format will be different when it is read or written with different C++ implementations.
I am interested in my program supporting Linux, Mac OS, and Windows with just a recompile (ie. no code changes). These strings contain only ASCII characters. As such, my other idea is to define my own character encoding and use switch statements (or something similar) to convert between it and whatever the implementation's encoding is. This is practical because ASCII doesn't contain many characters. Such a new encoding, however, is not ideal because it will make my files more difficult to read for other programmers wishing to implement my format (which is plausible in the context of what this program is). I could attempt to do the aforementioned technique with ASCII as the character encoding, but this is needlessly inefficient because most compilers will be using ASCII anyway.
Is there some better way to handle this? For example, is there a standard library function to convert a string to its ASCII or Unicode represnetation?
Is there any problem with using the formatted IO operations in binary mode, especially if I'm only dealing with text files?
(1):
For binary files, reading and writing data with the extraction and insertion operators (<< and >>) and functions like getline is not efficient, since we do not need to format any data and data is likely not formatted in lines.
(2):
Normally, for binary file i/o you do not use the conventional text-oriented << and >> operators! It can be done, but that is an advanced topic.
The "advanced topic" nature is what made me question mixing these two. There is a mingw bug with the seek and tell functions which can be resolved by opening up in binary mode. Is there any issue with using << and >> in binary mode compared to text mode or must I always resort to unformatted IO if opening up in binary? As far as I can tell for text files, I just have to account for carriage-returns (\r) which aren't implictly removed/added for me, but is that all there is to account for?
Is there any problem with using the formatted IO operations in binary
mode, especially if I'm only dealing with text files?
I just have to account for carriage-returns (\r) which aren't
implictly removed/added for me
If you want or need \r in your data, you are probably dealing with text / strings. For that you do not need to use binary files. Although you could open textfiles in binary mode to do a quick scan for newlines for example (line count), without having to do a less efficient readline().
Binary files are used to store binary values directly (mostly numbers or data structures), without the need to convert them to text and back to binary again.
Another advantage of binary files is that you don't have to do any parsing. You can access all your data directly, wherever it may be in the file (assuming the data is stored in a well structured manner).
For example: if you need to store records, each containing 5 32-bit numbers, you can write those directly to the binary file in their native binary format (no time wasted with converting and parsing). To later read record nr 1000 for example, you can seek directly to position 5 x 4 x (1000-1), and read your 20-byte record from there. With text files on the other hand, you would need to scan every byte from the beginning of the file, until you have counted 1000 lines (with would also be of different lengths).
You would use read() and write() (or fread() / fwrite()) directly (although << and >> could be used too for serialization of objects with variable lengths).
Binary files should also have a header with some basic information. See my answer here for more information on that.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
i have some missunderstandings about Binary files, i dont understand what a binary file is, i know text files are also binary files, but it needs to be parsed in order to extract information, unlike text files binary files with the same contents looks diffrent, for example while storing my name in a binary file "Rishabh" it not only stores Rishabh in that file but with some extra unreadable characters, what is it?? Why does'nt it only store characters like a text file, and what are binary file formats, eg. .3d, .zip, .mp3 etc... From my knowledge in text files, format extension specifies what the format is or how to process that file, like .dae, .xml, .htm etc... These contains tags to store datas, but what about binary files, because it dont needs any tags because its stored as a variable in that file from which we have to copy contents to the programs variables, (i mean to say its like stored in memory) so why these binary file formats are diffrent, why just not only a single program read all the contents of the file which is unkown to the world and to me?? And what is binary file format cracking??
All files have some kind of pre-determined encoding since computers can't store anything but bit-patterns in bytes on disk. A text file contains only the encoding for printable characters plus space, and few other encodings to end-a-line, tab, and maybe form-feed and a few others related to character display on a device. Because the encoding in a text file is a well-known standard, and is quite common, there are functions in most, if not all languages, to deal specifically with that type of file. Most importantly, they know how to read a line at a time - they recognize line-terminator character(s).
If however, you type the characters of your name in some other program besides a text editor - say you write using the text tool in Gimp or Microsoft Paint, and then save it. The program has to save more information than just your name. Your name has a position on a canvas that must be saved. It also has a font and a size and whether it is bold or italic or underlined, that need to be saved. The size of the canvas needs to be saved. The color being used, even if white and black, needs to be saved. This encoding will be different than the encoding used to save the letters of your name. So if you edit the file with a text editor, you will see some gibberish since the text editor is expecting character encoding and knows nothing about the encoding Gimp uses for fonts, font sizes, x,y positions, etc.
C++ compilers are not written with routines to understand any binary file encodings. The routines for reading/writing binary files in C++ will just read and write sequences of bytes. Although, since the fundamental type that holds a byte of data in C++ is a char (or unsigned char), you will see binary prototypes like
write ( char * buffer, streamsize size );
read ( char * buffer, streamsize size );
But the char pointer in this case should be considered as a "byte *" since the read/write functions are just moving bytes of data from/to disk or memory without any regard for character encodings.
C++ read/write routines don't know, or care what the format or encoding is for the bytes they are moving. So it is left up to the programmer to write code to process or handle these bytes according to the pre-defined format for the file. However, the routines written to process a specific format of binary file can be compiled into a library that can then be shared or sold, and used by many C++ programmers. For example, LibXL can be used to read the binary format of Excel files from a C++ program.
From the perspective of C/C++, the only difference between text and binary files is how line endings are handled.
If you open a file in binary mode, then read reads exactly the bytes in the file, and write writes exactly the bytes which are in memory.
If you open a file in text mode, then whatever character or character sequence is conventionally used to represent the end of a line in a file is transformed into some single character (which is written in the source code as '\n', although it is only one character) when the file is read, and the \n is transformed into the conventional end-of-line character or sequence when the file is written to. Also, it is not technically legal for the file to not end with an end-of-line sequence, and there may be a limit to the length of a line.
In Unix, the two modes are identical, because \n is a representation of the character code 10 (0A in hex), and that is precisely the conventional line-ending character. In Windows, by contrast, the conventional line-ending sequence is two bytes long -- {13,10} or {0D,0A}. \n is still 0A, so effectively the 0D preceding the 0A is deleted from the data read from the file, and an 0D is inserted before every 0A when data is written to the file.
Some (much) older operating systems had no conventional line-ending character. Instead, all lines were padded with space characters to the exact same length, making it possible to directly seek to a specific line number. C libraries working in text mode would typically read exactly the line length, and then delete the trailing spaces (if any) and finally add the code corresponding to \n (some such systems used EBCDIC instead of ASCII, so \n was a different integer value). Writing the data out, the \n would be deleted and replaced with exactly the correct number of spaces to bring the line to the standard length. Fortunately, those of us who don't work in a computing museum don't have to deal with that stuff any more, and Apple abandoned its use of 0D as the line-end character with the advent of OSX, so the text/binary difference is now limited to Windows.
Technically text files are binary, as all files are binary files really. Text files tend to only store the text characters, and binary stores any conceivable value - numbers, images, text, etc. Numbers for example, are not stored in decimal notation like "1234", they will be stored in binary using 0s and 1s only. There are a few ways to do this (depending on your operating system), so the same number could look like a different set of 0s and 1s. eg 0001110101011 etc. If you open binary files in Notepad, it tries to display everything as text, and what you see is also some garbage instead, which is the other data represented in binary.
Cracking a binary file format is knowing exactly what information is stored in each byte of the file...Sometimes text, numbers, arrays, classes, structures...Anything really. Given experience one could slowly work out what is what, but thats pretty advanced stuff!
Sometimes the information (format) is freely available and easy to follow, or a nightmare to follow like the format for a MS Word document. (MS Word format is freely available, but reputed to be insanely complicated due to backwards compatibility ...Nonetheless, having the format documentation allows you to 'crack' the binary file format and know exactly what all the binary represents)
Its one of the fundamentals of a Computer system.
Probably a great explanation in this link
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/asciiBin.html
Some text quoted:
Although ASCII files are binary files, some people treat them as
different kinds of files. I like to think of ASCII files as special
kinds of binary files. They're binary files where each byte is written
in ASCII code.
A full, general binary file has no such restrictions. Any of the 256
bit patterns can be used in any byte of a binary file.
We work with binary files all the time. Executables, object files,
image files, sound files, and many file formats are binary files. What
makes them binary is merely the fact that each byte of a binary file
can be one of 256 bit patterns. They're not restricted to the ASCII
codes.
As far as you know, there are two standard to read a text file in C++ (in this case 2 numbers in every line) :
The two standard methods are:
Assume that every line consists of 2 numbers and read token by token:
#include <fstream>
std::ifstream infile("thefile.txt");
int a, b;
while (infile >> a >> b)
{
// process pair (a,b)
}
Line-based parsing, using string streams:
#include <sstream>
#include <string>
#include <fstream>
std::ifstream infile("thefile.txt");
std::string line;
while (std::getline(infile, line))
{
std::istringstream iss(line);
int a, b;
if (!(iss >> a >> b)) { break; } // error
// process pair (a,b)
}
And also I can use the below code to see if the files ends or not :
while (!infile.eof())
My question is :
Question1: how this functions understand that one line is the last
line? I mean "how eof() returns false\true?"
As far as I know, they reading a part of memory. what is the
difference between the part that belongs to the file and the parts
that not?
Question2: Is there anyway to cheat this function?! I mean, Is it
possible to add something in the middle of the text file (for example
by a Hex editor tools) and make the eof() wrongly returns True in
the middle of the text file?
Appreciate your time and consideration.
Question1: how this functions understand that one line is the last line? I mean "how eof() returns false\true?"
It doesn't. The functions know when you've tried to read past the very last character in the file. They don't necessarily know whether a line is the last line. "Files" aren't the only things that you can read with streams. Keyboard input, a special purpose device, internet sockets: All can be read with the right kind of I/O stream. When reading from standard input, the stream has no knowing of if the very next thing I type is control-Z.
With regard to files on a computer disk, most modern operating systems store metadata regarding the file separate from the file. These metadata include the length of the file (and oftentimes when the file was last modified and when it was last read). On these systems, the stream buffer than underlies the I/O stream knows the current read location within the file and knows how long the file is. The stream buffer signals EOF when the read location reaches the length of the file.
That's not universal, however. There are some not-so-common operating systems that don't use this concept of metadata stored elsewhere. End of file on a disk file is just as surprising on these systems as is end of file from user input on a keyboard.
As far as I know, they reading a part of memory. what is the difference between the part that belongs to the file and the parts that not?
Learn the difference between memory and disk files. There's a huge difference between the two. Unless you're working with an embedded computer, memory is much more limited than is disk space.
Question2: Is there anyway to cheat this function?! I mean, Is it possible to add something in the middle of the text file (for example by a Hex editor tools) and make the eof() wrongly returns True in the middle of the text file?
That depends very much on how the operating system implements files. On most modern operating systems, the answer is not just "no" but "No!". The concept of using some special signature that indicates end of file in a disk file is one of many computer science concepts that for the most part have been dumped into the pile of "that wasn't very smart" ideas. You asked your question on the internet. That most likely means you are using a Windows machine, a Linux machine, or a Mac. All of them store the length of a file as metadata separate from the contents of a file.
However, there is a need for the ability to clear the end of file indicator. One program might be writing to a file while at the same time another is reading from it. The reader might hit EOF while the writer is still active. The reader needs to clear the EOF indicator to continue reading what the writer has written. The C++ I/O streams provide the ability to do just that. Every I/O stream has a clear function. Whether it works, that's a different story. The clear will work temporarily, but the very next read might well reset the EOF bit. For example, when I type control-Z on my keyboard, that means I am done interacting with the program, period, My next action might well be to go out for lunch.
I'm writing a set of unit tests that write calculated values out to files. Each test produces a square matrix that holds anywhere from 50,000 to 500,000 doubles, and I have a total of 128 combinations of test cases.
Is there any significant overhead involved in writing cout statements and then piping that output to files, or would I be better off writing directly to the file using an ofstream?
This is going to be dependent on your system and environment. This likely to be very little difference, but there is only one way to be sure: try both approaches and measure them.
Since the dimensions involved are so large I'm assuming that these files are not meant to be read by a human being? Just make sure you write them out as binary and not human-readable text because that will make so much more difference than the difference between using ofstream or piping cout.
Whether this means you have to use ofstream or not I don't know. I've never written binary to cout so I can't say whether that's possible...
As Charles Bailey said, it's implementation dependent; what follows is mostly for linux implementation with gnu toolchain, but I hardly imagine it being very different in other os.
In libstdc++ 4.4.2:
An fstream contain an underlying stdio_filebuf which is a basic_filebuf. This basic_filebuf contain it's own buffer by inheriting basic_streambuf, and actually contain a __basic_file, itself containing an underlying plain C stdio abstraction (FILE* or std::__c_file*), in which it flush the buffer.
cout, which is an ostream is initialized with a stdio_sync_filebuf itself initialized with the C file abstraction stdout. stdio_sync_filebuf call plain C stdio functions.
Considering only C++, it appear that an fstream may be more efficient thanks to two layers of buffer.
Considering C only, if the process is forked with the stdout file descriptor redirected in a file, there should be no difference between writing to a new opened file (what fstream does at the end) or to stdout since the fd point to a file anyway (what cout does at the end).
If I were you, I would use an fstream since it's your intent.