Using formatted IO operations in binary mode? - c++

Is there any problem with using the formatted IO operations in binary mode, especially if I'm only dealing with text files?
(1):
For binary files, reading and writing data with the extraction and insertion operators (<< and >>) and functions like getline is not efficient, since we do not need to format any data and data is likely not formatted in lines.
(2):
Normally, for binary file i/o you do not use the conventional text-oriented << and >> operators! It can be done, but that is an advanced topic.
The "advanced topic" nature is what made me question mixing these two. There is a mingw bug with the seek and tell functions which can be resolved by opening up in binary mode. Is there any issue with using << and >> in binary mode compared to text mode or must I always resort to unformatted IO if opening up in binary? As far as I can tell for text files, I just have to account for carriage-returns (\r) which aren't implictly removed/added for me, but is that all there is to account for?

Is there any problem with using the formatted IO operations in binary
mode, especially if I'm only dealing with text files?
I just have to account for carriage-returns (\r) which aren't
implictly removed/added for me
If you want or need \r in your data, you are probably dealing with text / strings. For that you do not need to use binary files. Although you could open textfiles in binary mode to do a quick scan for newlines for example (line count), without having to do a less efficient readline().
Binary files are used to store binary values directly (mostly numbers or data structures), without the need to convert them to text and back to binary again.
Another advantage of binary files is that you don't have to do any parsing. You can access all your data directly, wherever it may be in the file (assuming the data is stored in a well structured manner).
For example: if you need to store records, each containing 5 32-bit numbers, you can write those directly to the binary file in their native binary format (no time wasted with converting and parsing). To later read record nr 1000 for example, you can seek directly to position 5 x 4 x (1000-1), and read your 20-byte record from there. With text files on the other hand, you would need to scan every byte from the beginning of the file, until you have counted 1000 lines (with would also be of different lengths).
You would use read() and write() (or fread() / fwrite()) directly (although << and >> could be used too for serialization of objects with variable lengths).
Binary files should also have a header with some basic information. See my answer here for more information on that.

Related

What's the best way to store binary

Ive recently implemented Hoffman compression in c++, if I were to store the results as binary it would take up a lot more space as each 1 and 0 is a character. Alternatively I was thinking maybe I could break the binary into sections of 8 and put characters in the text file, but that would kinda be annoying (so hopefully that can be avoided). My question here is what is the best way to store binary in a text file in terms of character efficietcy?
[To recap the comments...]
My question here is what is the best way to store binary in a text file in terms of character efficiently?
If you can store the data as-is, then do so (in other words, do not use any encoding; simply save the raw bytes).
If you need to store the data within a text file (for instance as a paragraph or as a quoted string), then you have many ways of doing so. For instance, base64 is a very common one, but there are many others.

Using Getline on a Binary File

I have read that getline behaves as an unformatted input function. Which I believe should allow it to be used on a binary file. Let's say for example that I've done this:
ofstream ouput("foo.txt", ios_base::binary);
const auto foo = "lorem ipsum";
output.write(foo, strlen(foo) + 1);
output.close();
ifstream input("foo.txt", ios_base::binary);
string bar;
getline(input, bar, '\0');
Is that breaking any rules? It seems to work fine, I think I've just traditionally seen arrays handled by writing the size and then writing the array.
No, it's not breaking any rules that I can see.
Yes, it's more common to write an array with a prefixed size, but using a delimiter to mark the end can work perfectly well also. The big difference is that (like with a text file) you have to read through data to find the next item. With a prefixed size, you can look at the size, and skip directly to the next item if you don't need the current one. Of course, you also need to ensure that if you're using something to mark the end of a field, that it can never occur inside the field (or come up with some way of detecting when it's inside a field, so you can read the rest of the field when it does).
Depending on the situation, that can mean (for example) using Unicode text. This gives you a lot of options for values that can't occur inside the text (because they aren't legal Unicode). That, on the other hand, would also mean that your "binary" file is really a text file, and has to follow some basic text-file rules to make sense.
Which is preferable depends on how likely it is that you'll want to read random pieces of the file rather than reading through it from beginning to end, as well as the difficulty (if any) of finding a unique delimiter and if you don't have one, the complexity of making the delimiter recognizable from data inside a field. If the data is only meaningful if written in order, then having to read it in order doesn't really pose a problem. If you can read individual pieces meaningfully, then being able to do so much more likely to be useful.
In the end, it comes down to a question of what you want out of your file being "binary'. In the typical case, all 'binary" really means is that what end of line markers that might be translated from a new-line character to (for example) a carriage-return/line-feed pair, won't be. Depending on the OS you're using, it might not even mean that much though--for example, on Linux, there's normally no difference between binary and text mode at all.
Well, there are no rules broken and you'll get away with that just fine, except that may miss the precision of reading binary from a stream object.
With binary input, you usually want to know how many characters were read successfully, which you can obtain afterwards with gcount()... Using std::getline will not reflect the bytes read in gcount().
Of cause, you can simply get such info from the size of the string you passed into std::getline. But the stream will no longer encapsulate the number of bytes you consumed in the last Unformatted Operation

C++ ifstream, ofstream: What's the difference between raw read()/write() calls and opening file in binary mode?

This question concerns the behaviour of ifstream and ofstream when reading and writing data to files.
From reading around stackoverflow.com I have managed to find out that operator<< (stream insertion operator) converts objects such as doubles to text representation before output, and calls to read() and write() read and write raw data as it is stored in memory (binary format) respectively. EDIT: This much is obvious, nothing unexpected here.
I also found out that opening a file in binary mode prevents automatic translation of newline characters as required by different operating systems.
So my question is this: Does this automatic translation, eg; from \n to \r\n occur when calling functions read() and write()? Or is this behaviour just specific to the operator<<. (And also operator>>.)
Note there is a similar but slightly less specific question here. It does not give a definite answer. Difference in using read/write when stream is opened with/without ios::binary mode
The difference between binary and text mode its at a lower level.
If you open a file in text mode you will get translated data even when using read and write operations.
Please also note that you're allowed to seek to a position in a text file only if the position was obtained from a previous tell (or 0). To be able to do random positioning, the file must have been opened in binary mode.

How do Binary Files works? (From c++'s point of view) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
i have some missunderstandings about Binary files, i dont understand what a binary file is, i know text files are also binary files, but it needs to be parsed in order to extract information, unlike text files binary files with the same contents looks diffrent, for example while storing my name in a binary file "Rishabh" it not only stores Rishabh in that file but with some extra unreadable characters, what is it?? Why does'nt it only store characters like a text file, and what are binary file formats, eg. .3d, .zip, .mp3 etc... From my knowledge in text files, format extension specifies what the format is or how to process that file, like .dae, .xml, .htm etc... These contains tags to store datas, but what about binary files, because it dont needs any tags because its stored as a variable in that file from which we have to copy contents to the programs variables, (i mean to say its like stored in memory) so why these binary file formats are diffrent, why just not only a single program read all the contents of the file which is unkown to the world and to me?? And what is binary file format cracking??
All files have some kind of pre-determined encoding since computers can't store anything but bit-patterns in bytes on disk. A text file contains only the encoding for printable characters plus space, and few other encodings to end-a-line, tab, and maybe form-feed and a few others related to character display on a device. Because the encoding in a text file is a well-known standard, and is quite common, there are functions in most, if not all languages, to deal specifically with that type of file. Most importantly, they know how to read a line at a time - they recognize line-terminator character(s).
If however, you type the characters of your name in some other program besides a text editor - say you write using the text tool in Gimp or Microsoft Paint, and then save it. The program has to save more information than just your name. Your name has a position on a canvas that must be saved. It also has a font and a size and whether it is bold or italic or underlined, that need to be saved. The size of the canvas needs to be saved. The color being used, even if white and black, needs to be saved. This encoding will be different than the encoding used to save the letters of your name. So if you edit the file with a text editor, you will see some gibberish since the text editor is expecting character encoding and knows nothing about the encoding Gimp uses for fonts, font sizes, x,y positions, etc.
C++ compilers are not written with routines to understand any binary file encodings. The routines for reading/writing binary files in C++ will just read and write sequences of bytes. Although, since the fundamental type that holds a byte of data in C++ is a char (or unsigned char), you will see binary prototypes like
write ( char * buffer, streamsize size );
read ( char * buffer, streamsize size );
But the char pointer in this case should be considered as a "byte *" since the read/write functions are just moving bytes of data from/to disk or memory without any regard for character encodings.
C++ read/write routines don't know, or care what the format or encoding is for the bytes they are moving. So it is left up to the programmer to write code to process or handle these bytes according to the pre-defined format for the file. However, the routines written to process a specific format of binary file can be compiled into a library that can then be shared or sold, and used by many C++ programmers. For example, LibXL can be used to read the binary format of Excel files from a C++ program.
From the perspective of C/C++, the only difference between text and binary files is how line endings are handled.
If you open a file in binary mode, then read reads exactly the bytes in the file, and write writes exactly the bytes which are in memory.
If you open a file in text mode, then whatever character or character sequence is conventionally used to represent the end of a line in a file is transformed into some single character (which is written in the source code as '\n', although it is only one character) when the file is read, and the \n is transformed into the conventional end-of-line character or sequence when the file is written to. Also, it is not technically legal for the file to not end with an end-of-line sequence, and there may be a limit to the length of a line.
In Unix, the two modes are identical, because \n is a representation of the character code 10 (0A in hex), and that is precisely the conventional line-ending character. In Windows, by contrast, the conventional line-ending sequence is two bytes long -- {13,10} or {0D,0A}. \n is still 0A, so effectively the 0D preceding the 0A is deleted from the data read from the file, and an 0D is inserted before every 0A when data is written to the file.
Some (much) older operating systems had no conventional line-ending character. Instead, all lines were padded with space characters to the exact same length, making it possible to directly seek to a specific line number. C libraries working in text mode would typically read exactly the line length, and then delete the trailing spaces (if any) and finally add the code corresponding to \n (some such systems used EBCDIC instead of ASCII, so \n was a different integer value). Writing the data out, the \n would be deleted and replaced with exactly the correct number of spaces to bring the line to the standard length. Fortunately, those of us who don't work in a computing museum don't have to deal with that stuff any more, and Apple abandoned its use of 0D as the line-end character with the advent of OSX, so the text/binary difference is now limited to Windows.
Technically text files are binary, as all files are binary files really. Text files tend to only store the text characters, and binary stores any conceivable value - numbers, images, text, etc. Numbers for example, are not stored in decimal notation like "1234", they will be stored in binary using 0s and 1s only. There are a few ways to do this (depending on your operating system), so the same number could look like a different set of 0s and 1s. eg 0001110101011 etc. If you open binary files in Notepad, it tries to display everything as text, and what you see is also some garbage instead, which is the other data represented in binary.
Cracking a binary file format is knowing exactly what information is stored in each byte of the file...Sometimes text, numbers, arrays, classes, structures...Anything really. Given experience one could slowly work out what is what, but thats pretty advanced stuff!
Sometimes the information (format) is freely available and easy to follow, or a nightmare to follow like the format for a MS Word document. (MS Word format is freely available, but reputed to be insanely complicated due to backwards compatibility ...Nonetheless, having the format documentation allows you to 'crack' the binary file format and know exactly what all the binary represents)
Its one of the fundamentals of a Computer system.
Probably a great explanation in this link
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/asciiBin.html
Some text quoted:
Although ASCII files are binary files, some people treat them as
different kinds of files. I like to think of ASCII files as special
kinds of binary files. They're binary files where each byte is written
in ASCII code.
A full, general binary file has no such restrictions. Any of the 256
bit patterns can be used in any byte of a binary file.
We work with binary files all the time. Executables, object files,
image files, sound files, and many file formats are binary files. What
makes them binary is merely the fact that each byte of a binary file
can be one of 256 bit patterns. They're not restricted to the ASCII
codes.

Formatted and unformatted input and output and streams

I had been reading a few articles on some sites about Formatted and Unformatted I/O, however i have my mind more messed up now.
I know this is a very basic question, but i would request anyone can give a link [ to some site or previously answered question on Stackoverflow ] which explains, the idea of streams in C and C++.
Also, i would like to know about Formatted and Unformatted I/O.
The standard doesn't define what these terms mean, it just says which of the functions defined in the standard are formatted IO and which are not. It places some requirements on the implementation of these functions.
Formatted IO is simply the IO done using the << and >> operators. They are meant to be used with text representation of the data, they involve some parsing, analyzing and conversion of the data being read or written. Formatted input skips whitespace:
Each formatted input function begins execution by constructing an object of class sentry with the noskipws (second) argument false.
Unformatted IO reads and writes the data just as a sequence of 'characters' (with possibly applying the codecvt of the imbued locale). It's meant to read and write binary data, or function as a lower-level used by the formatted IO implementation. Unformatted input doesn't skip whitespace:
Each unformatted input function begins execution by constructing an object of class sentry with the default argument noskipws (second) argument true.
And allows you to retrieve the number of characters read by the last input operation using gcount():
Returns: The number of characters extracted by the last unformatted input member function called for the object.
Formatted IO means that your output is determined by a "format string", that means you provide a string with certain placeholders, and you additionally give arguments that should be used to fill these placeholders:
const char *daughter_name = "Lisa";
int daughter_age = 5;
printf("My daughter %s is %d years old\n", daughter_name, daughter_age);
The placeholders in the example are %s, indicating that this shall be substituted using a string, and %d, indicating that this is to be replaced by a signed integer number. There are a lot more options that give you control over how the final string will present itself. It's a convenience for you as the programmer, because it relieves you from the burden of converting the different data types into a string and it additionally relieves you from string appending operations via strcat or anything similar.
Unformatted IO on the other hand means you simply write character or byte sequences to a stream, not using any format string while you are doing so.
Which brings us to your question about streams. The general concept behind "streaming" is that you don't have to load a file or whatever input as a whole all the time. For small data this does work though, but imagine you need to process terabytes of data - no way this will fit into a single byte array without your machine running out of memory. That's why streaming allows you to process data in smaller-sized chunks, one at a time, one after the other, so that at any given time you just have to deal with a fix-sized amount of data. You read the data into a helper variable over and over again and process it, until your underlying stream tells you that you are done and there is no more data left.
The same works on the output side, you write your output step for step, chunk for chunk, rather than writing the whole thing at once.
This concept brings other nice features, too. Because you can nest streams within streams within streams, you can build a whole chain of transformations, where each stream may modify the data until you finally receive the end result, not knowing about the single transformations, because you treat your stream as if there were just one.
This can be very useful, for example C or C++ streams buffer the data that they read natively from e.g. a file to avoid unnecessary calls and to read the data in optimized chunks, so that the overall performance will be much better than if you would read directly from the file system.
Unformatted Input/Output is the most basic form of input/output. Unformatted input/output transfers the internal binary representation of the data directly between memory and the file. Formatted output converts the internal binary representation of the data to ASCII characters which are written to the output file. Formatted input reads characters from the input file and converts them to internal form. Formatted