Difference between opening a file in binary vs text [duplicate] - c++

This question already has answers here:
Difference between files written in binary and text mode
(7 answers)
Closed 9 years ago.
I've done some stuff like:
FILE* a = fopen("a.txt", "w");
const char* data = "abc123";
fwrite(data, 6, 1, a);
fclose(a);
and then in the generated text file, it says "abc123" just like expected. But then I do:
//this time it is "wb" not just "w"
FILE* a = fopen("a.txt", "wb");
const char* data = "abc123";
fwrite(data, 6, 1, a);
fclose(a);
and get the exact same result. If I read the file using binary or normal mode, it also gives me the same result. So my question is, what is the difference between fopening with or without binary mode.
Where I read about fopen modes: http://www.cplusplus.com/reference/cstdio/fopen/

The link you gave does actually describe the differences, but it's buried at the bottom of the page:
http://www.cplusplus.com/reference/cstdio/fopen/
Text files are files containing sequences of lines of text. Depending on the environment where the application runs, some special character conversion may occur in input/output operations in text mode to adapt them to a system-specific text file format. Although on some environments no conversions occur and both text files and binary files are treated the same way, using the appropriate mode improves portability.
The conversion could be to normalize \r\n to \n (or vice-versa), or maybe ignoring characters beyond 0x7F (a-la 'text mode' in FTP). Personally I'd open everything in binary-mode and use a good Unicode or other text-encoding library for dealing with text.

The most important difference to be aware of is that with a stream opened in text mode you get newline translation on non-*nix systems (it's also used for network communications, but this isn't supported by the standard library). In *nix newline is just ASCII linefeed, \n, both for internal and external representation of text. In Windows the external representation often uses a carriage return + linefeed pair, "CRLF" (ASCII codes 13 and 10), which is converted to a single \n on input, and conversely on output.
From the C99 standard (the N869 draft document), §7.19.2/2,
A text stream is an ordered sequence of characters composed into lines, each line
consisting of zero or more characters plus a terminating new-line character. Whether the
last line requires a terminating new-line character is implementation-defined. Characters
may have to be added, altered, or deleted on input and output to conform to differing
conventions for representing text in the host environment. Thus, there need not be a one-
to-one correspondence between the characters in a stream and those in the external
representation. Data read in from a text stream will necessarily compare equal to the data
that were earlier written out to that stream only if: the data consist only of printing
characters and the control characters horizontal tab and new-line; no new-line character is
immediately preceded by space characters; and the last character is a new-line character.
Whether space characters that are written out immediately before a new-line character
appear when read in is implementation-defined.
And in §7.19.3/2
Binary files are not truncated, except as defined in 7.19.5.3. Whether a write on a text
stream causes the associated file to be truncated beyond that point is implementation-
defined.
About use of fseek, in §7.19.9.2/4:
For a text stream, either offset shall be zero, or offset shall be a value returned by
an earlier successful call to the ftell function on a stream associated with the same file and whence shall be SEEK_SET.
About use of ftell, in §17.19.9.4:
The ftell function obtains the current value of the file position indicator for the stream pointed to by stream. For a binary stream, the value is the number of characters from the beginning of the file. For a text stream, its file position indicator contains unspecified information, usable by the fseek function for returning the file position indicator for the stream to its position at the time of the ftell call; the difference between two such return values is not necessarily a meaningful measure of the number of characters written or read.
I think that’s the most important, but there are some more details.

Related

Is the "NULL" or "\0"-sign, of a Null-terminated string, stored in a file?

If i want to store a Null-terminated string into a file, and the file will only containing that string, is the "\0" or "NULL"-character stored in the file (before the "EOF" (End of File)-sign)?
Furthermore: Is the result depended from the operation system and so on the compiler, on which i will compile the source code on?
You might be able to write null characters to a text file, but you almost certainly don't want to.
A string (defined as "a contiguous sequence of characters terminated by and including the first null character") is an in-memory data format.
A text stream consists of a sequence of lines:
A text stream is an ordered sequence of characters composed into
lines, each line consisting of zero or more characters plus a terminating new-line character. Whether the last line requires a
terminating new-line character is implementation-defined.
A string may or may not contain a single line of text. If it represents a line of text, it may or may not include the terminating new-line '\n' character (you'll need to keep track of that yourself).
If you have a sequence of strings in memory, the usual way to write them to a text file is to write the contents of each string, not including the terminating null character, to the file, adding a new-line character if necessary. Functions like fprintf and fputs assume their arguments are strings, so they take care of omitting the '\0'.
You can write a null character to a text stream, but it's implementation-defined what will actually be written to the file. You can write a null character, or any byte value, to a binary stream -- but then you can't safely use string functions (strlen() et al, or even fgets() and fputs()) on data written to or read from the stream. (And in practice, most systems allow null characters to be written to and read from text files -- though a number of standard library functions assume that text files contain only printable characters.)
'\0' is not a printing character so if you use an io stream in text mode, then whether it will be preserved when you write it to a file through such a stream is implementation-dependent.
7.21.2p2
A text stream is an ordered sequence of characters composed into
lines, each line consisting of zero or more characters plus a
terminating new-line character. Whether the last line requires a
terminating new-line character is implementation-defined. Characters
may have to be added, altered, or deleted on input and output to
conform to differing conventions for representing text in the host
environment. Thus, there need not be a one- to-one correspondence
between the characters in a stream and those in the external
representation. Data read in from a text stream will necessarily
compare equal to the data that were earlier written out to that stream
only if: the data consist only of printing characters and the control
characters horizontal tab and new-line; no new-line character is
immediately preceded by space characters; and the last character is a
new-line character. Whether space characters that are written out
immediately before a new-line character appear when read in is
implementation-defined.
If you write the '\0' to a file through a binary stream (one opened with e.g., fopen("file","wb")), e.g., with fputc('\0',f) or fwrite("",1,1,f), you should be able to get it back.
No, the functions that write a string to a file will not include the terminating null. You can write a null to a file using a function that takes a byte count, but that doesn't make sense because there's no corresponding read function.

What is the behavior of writing a non-printing character in C/C++?

Is the behavior of writing a non-printing character undefined or implementation-defined, if the character is written via printf/fprintf? I am confused because the words in the C standard N1570/5.2.2 only talks about the display semantics for printing characters and alphabetic escape sequences.
In addition, what if the character is written via std::ostream (C++ only)?
The output of ASCII non-printable (control) characters is implementation defined.
Specifically, interpretation is the responsibility of the output device.
Edit 1:
When the output device is opened as a file, it can be opened as binary. When opened as binary the output is not translated (e.g. line endings).

How do I search for end of line ('\n') in a UTF-8 text?

I have a C++ library that provides an I/O device interface (including an implementation for files). It also provides a UTF-8 string class. Now, I just need to read a line from this IODevice. The reason I'm mentioning this library is I can't, for example, open the file with std::ifstream and read it using something like std::wbuffer_convert<std::codecvt_utf8<wchar_t>>. I don't mind using stdlib (in fact, I prefer it), but I do need to read the line from my IODevice and return it as my String.
Now, the specific question: if I read the file byte by byte, is it safe to assume that any byte with value '\n' is in fact a new line symbol, and not the trailing part of some different multi-byte symbol?
Is it safe to assume that any byte with value '\n' is in fact a new line symbol, and not the trailing part of some different multi-byte symbol?
Yes, in UTF-8, all ASCII bytes do not occur in non-ASCII code points.
Just to add on what #Yu Hao said, UTF8 is actually backward compatible with ASCII, it cannot break it in any sort.
here is the reason why : UTF8 dictate that any ASCII characters will retain their bit-representation from ASCII, and this causes their leading bit to be always 0.
any non-ascii character will be encoded to 2-4 bytes, and their leading bits will always start with 1 (the first byte will have consecutive sequence of 1 as the number of bytes needed to represented the characters, followed by 0, the rest of the bytes will start with 10).
this encoding pattern assures that ASCII characters canot be mixed with non-ASCII encoded sequences.

Clarification on fsetpos, C++

I am a little confused with function fsetpos in the stdio.h library. I want to be to write to different indexes (i.e do not want to write to a file contiguously) in a file. I was considering using fsetpos however the documentation states..
The internal file position indicator associated with stream is set to the position
represented by pos, which is a pointer to an fpos_t object whose value shall have been
previously obtained by a call to fgetpos.
It does not make sense to me that I have to set the position based on the call from fgetpos. Whats the point since it will just set it to the position it is already set at. Or I am I not understanding it correctly ?
From the C11 standard, fseek has a similar limitation:
For a text stream, either offset shall be zero, or offset shall be a value returned by an earlier successful call to the ftell function on a stream associated with the same file and whence shall be SEEK_SET
The reason is that text streams don't have a one-to-one mapping between the actual bytes of the source and the bytes you would get from fgetc; e.g. on windows systems, the newline character in C tends to be translated into a sequence of two binary characters: carriage return, then line feed.
Consequently, the notion of arbitrarily positioning a text stream based on a numerical index is fraught with complications and surprises.
In fact, the documentation of ftell warns
For a text stream, its file position indicator contains unspecified information, usable by the fseek function for returning the file position indicator for the stream to its position at the time of the ftell call; the difference between two such return values is not necessarily a meaningful measure of the number of characters written or read.
Binary streams don't have this limitation, although
A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END
The above assumes you are working with byte-oriented streams. Wide-oriented streams have additional restrictions. e.g. under Streams:
Binary wide-oriented streams have the file-positioning restrictions ascribed to both text and binary streams
and
For wide-oriented streams, after a successful call to a file-positioning function that leaves the file position indicator prior to the end-of-file, a wide character output function can overwrite a partial multibyte character; any file contents beyond the byte(s) written are henceforth indeterminate
fsetpos does more than just set the file position: again from the C11 standard:
The fsetpos function sets the mbstate_t object (if any) and file position indicator
which makes it more suitable for setting the position in a wide-oriented streams.

Why CR LF is changed to LF in Windows?

In Windows when you read characters \r\n from the file(or stdin) in text mode, \r gets deleded and you only read \n.
Is there a standard according to which it should be so?
Could I be sure that it will be true for any compiler on Windows? Will others platform-specifics character combinations will replaced by \n on those platforms too?
I use this code to generate the input and use this code to read it. The results are here. You may note few missed \r's
Yes, this comes from compatibility with C. In C text streams, lines are terminated by a newline character. This is the internal representation of the text stream as seen by the program. The I/O library converts between the internal representation and some external one.
The internal representation is platform-independent, whereas there are different platform-specific conventions for text. That's the point of having a text mode in the stream library; portable text manipulating programs can be written which do not have to contain a pile of #ifdef directives to work on different platforms, or build their own platform-independent text abstraction.
It so happens that the internal representation for C text streams matches the native Unix representation of text files, since the C language and its library originated on Unix. For portability of C programs to other platforms, the text stream abstraction was added which makes text files on non-Unix system look like Unix text files.
In the ISO/IEC 9899:1999 standard ("C99"), we have this:
7.19.2 Streams
[...]
A text stream is an ordered sequence of characters composed into lines, each line
consisting of zero or more characters plus a terminating new-line character. Whether the
last line requires a terminating new-line character is implementation-defined. Characters
may have to be added, altered, or deleted on input and output to conform to differing
conventions for representing text in the host environment. Thus, there need not be a one-to-one correspondence between the characters in a stream and those in the external
representation.
Bold emphasis mine. C++ streams are defined in terms of C streams. There is no explanation of text versus binary mode in the C++ standard, except for a table which maps various stream mode flag combinations to strings suitable as mode arguments to fopen.