How does feof() actually know when the end of file is reached? - c++

I'm a beginner in C++ and trying to better understand feof(). I've read that feof() flag is set to true only after trying to read past the end of a file so many times beginners will read once more than they were expecting if they do something like while(!feof(file)). What I'm trying to understand though, is how does it actually interpret that an attempt has been made to read past the end of the file? Is the entire file already read in and the number of characters already known or is there some other mechanism at work?
I realize this may be a duplicate question somewhere, but I've been unable to find it, probably because I don't know the best way to word what I'm asking. If there is an answer already out there a link would be much appreciated. Thanks.

Whatever else the C++ library does, eventually it has to read from the file. Somewhere in the operating system, there is a piece of code that eventually handles that read. It obtains from the filesystem the length of the file, stored the same way the filesystem stores everything else. Knowing the length of the file, the position of the read, and the number of bytes to be read, it can make the determination that the low-level read hits the end of the file.
When that determination is made, it is passed up the stack. Eventually, it gets to the standard library which records internally that the end of file has been reached. When a read request into the library tries to go past that recorded end, the EOF flag is set and feof will start returning true.

feof() is a part of the standard C library buffered I/O. Since it's buffered, fread() pre-reads some data (definitely not the whole file, though). If, while buffering, fread() detects EOF (the underlying OS routine returns a special value, usually -1), it sets a flag on the FILE structure. feof() simply checks for that flag. So feof() returning true essentially means “a previous read attempt encountered end of file”.
How EOF is detected is OS/FS-specific and has nothing to do whatsoever with the C library/language. The OS has some interface to read data from files. The C library is just a bridge between the OS and the program, so you don't have to change your program if you move to another OS. The OS knows how the files are stored in its filesystem, so it knows how to detect EOF. My guess is that typically it is performed by comparing the current position to the length of the file, but it may be not that easy and may involve a lot of low-level details (for example, what if the file is on a network drive?).
An interesting question is what happens when the stream is at the end, but it was not yet detected by any reads. For example, if you open an empty file. Does the first call to feof() before any fread() return true or false? The answer is probably false. The docs aren't terribly clear on this subject:
This indicator is generally set by a previous operation on the stream
that attempted to read at or past the end-of-file.
It sounds as if a particular implementation may choose some other unusual ways to set this flag.

Most file system maintain meta information about the file (including it's size), and an attempt to read past the end of results in the feof flag being set. Others, for instance, old or lightweight file systems, set feof when they come to the last byte of the last block in the chain.

How does feof() actually know when the end of file is reached?
When code attempts to read passed the last character.
Depending on the file type, the last character is not necessarily known until a attempt to read past it occurs and no character is available.
Sample code demonstrating feof() going from 0 to 1
#include <stdio.h>
void ftest(int n) {
FILE *ostream = fopen("tmp.txt", "w");
if (ostream) {
while (n--) {
fputc('x', ostream);
}
fclose(ostream);
}
FILE *istream = fopen("tmp.txt", "r");
if (istream) {
char buf[10];
printf("feof() %d\n", feof(istream));
printf("fread %zu\n", fread(buf, 1, 10, istream));
printf("feof() %d\n", feof(istream));
printf("fread %zu\n", fread(buf, 1, 10, istream));
printf("feof() %d\n", feof(istream));
puts("");
fclose(istream);
}
}
int main(void) {
ftest(9);
ftest(10);
return 0;
}
Output
feof() 0
fread 9 // 10 character read attempted, 9 were read
feof() 1 // eof is set as previous read attempted to read passed the 9th or last char
fread 0
feof() 1
feof() 0
fread 10 // 10 character read attempted, 10 were read
feof() 0 // eof is still clear as no attempt to read passed the 10th, last char
fread 0
feof() 1

The feof() function sets the end of file indicator when the EOF character is read. So when feof() reads the last item, the EOF is not read along with it at first. Since no EOF indicator is set and feof() returns zero, the flow enters the while loop again. This time fgets comes to know that the next character is EOF, its discards it and returns NULL but also sets the EOF indicator. So feof() detects the end of file indicator and returns a non-zero value therefore breaking the while loop.

Related

C++ istream::peek - shouldn't it be nonblocking?

It seems well accepted that the istream::peek operation is blocking.
The standard, though arguably a bit ambiguous, leans towards nonblocking behavior. peek calls sgetc in turn, whose behavior is:
"The character at the current position of the controlled input sequence, as a value of type int.
If there are no more characters to read from the controlled input sequence, the function returns the end-of-file value (EOF)."
It doesn't say "If there are no more characters.......wait until there are"
Am I missing something here? Or are the peek implementations we use just kinda wrong?
The controlled input sequence is the file (or whatever) from which you're reading. So if you're at end of file, it returns EOF. Otherwise it returns the next character from the file.
I see nothing here that's ambiguous at all--if it needs a character that hasn't been read from the file, then it needs to read it (and wait till it's read, and return it).
If you're reading from something like a socket, then it's going to wait until data arrives (or the network stack detects EOF, such as the peer disconnecting).
The description from cppreference.com might be clearer than the one in your question:
Ensures that at least one character is available in the input area by [...] reading more data in from the input sequence (if applicable)."
"if applicable" does apply in this case; and "reading data from the input sequence" entails waiting for more data if there is none and the stream is not in an EOF or other error state.
When I get confused about console input I remind myself that console input can be redirected to come from a file, so the behavior of the keyboard more or less mimics the behavior of a file. When you try to read a character from file, you can get one of two results: you get a character, or you get EOF because you've reached the end of the file -- there are no more characters to be read. Same thing for keyboard input: either you get a character, or you get EOF because you've reached the end of the file. With a file, there is no notion of waiting for more characters: either a file has unread characters or it doesn't. Same thing for the keyboard. So if you have't reached EOF on the keyboard, reading a character returns the next character. You reach EOF on the keyboard by typing whatever character your system recognizes as EOF; on Unix systems that's ctrl-D, on Windows (if I remember correctly) that's ctrl-C. If you haven't reached EOF, there are more characters to be read.

Is the inconsistency of C++'s istream::eof() a bug in the spec or a bug in the implementation?

The following program demonstrates an inconsistency in the way that std::istream (specifically in my test code, std::istringstream) sets eof().
#include <sstream>
#include <cassert>
int main(int argc, const char * argv[])
{
// EXHIBIT A:
{
// An empty stream doesn't recognize that it's empty...
std::istringstream stream( "" );
assert( !stream.eof() ); // (Not yet EOF. Maybe should be.)
// ...until I read from it:
const int c = stream.get();
assert( c < 0 ); // (We received garbage.)
assert( stream.eof() ); // (Now we're EOF.)
}
// THE MORAL: EOF only happens when actually attempting to read PAST the end of the stream.
// EXHIBIT B:
{
// A stream that still has data beyond the current read position...
std::istringstream stream( "c" );
assert( !stream.eof() ); // (Clearly not yet EOF.)
// ... clearly isn't eof(). But when I read the last character...
const int c = stream.get();
assert( c == 'c' ); // (We received something legit.)
assert( !stream.eof() ); // (But we're already EOF?! THIS ASSERT FAILS.)
}
// THE MORAL: EOF happens when reading the character BEFORE the end of the stream.
// Conclusion: MADNESS.
return 0;
}
So, eof() "fires" when you read the character before the actual end-of-file. But if the stream is empty, it only fires when you actually attempt to read a character. Does eof() mean "you just tried to read off the end?" or "If you try to read again, you'll go off the end?" The answer is inconsistent.
Moreover, whether the assert fires or not depends on the compiler. Apple Clang 4.1, for example, fires the assertion (raises eof() when reading the preceding character). GCC 4.7.2, for example, does not.
This inconsistency makes it hard to write sensible loops that read through a stream but handle both empty and non-empty streams well.
OPTION 1:
while( stream && !stream.eof() )
{
const int c = stream.get(); // BUG: Wrong if stream was empty before the loop.
// ...
}
OPTION 2:
while( stream )
{
const int c = stream.get();
if( stream.eof() )
{
// BUG: Wrong when c in fact got the last character of the stream.
break;
}
// ...
}
So, friends, how do I write a loop that parses through a stream, dealing with each character in turn, handles every character, but stops without fuss either when we hit the EOF, or in the case when the stream is empty to begin with, never starts?
And okay, the deeper question: I have the intuition that using peek() could maybe workaround this eof() inconsistency somehow, but...holy crap! Why the inconsistency?
The eof() flag is only useful to determine if you hit end of file after some operation. The primary use is to avoid an error message if reading reasonably failed because there wasn't anything more to read. Trying to control a loop or something using eof() is bound to fail. In all cases you need to check after you tried to read if the read was successful. Before the attempt the stream can't know what you are going to read.
The semantics of eof() is defined thoroughly as "this flag gets set when reading the stream caused the stream buffer to return a failure". It isn't quite as easy to find this statement if I recall correct but this is what comes down. At some point the standard also says that the stream is allowed to read more than it has to in some situation which may cause eof() to be set when you don't necessarily expect it. One such example is reading a character: the stream may end up detecting that there is nothing following that character and set eof().
If you want to handle an empty stream, it's trivial: look at something from the stream and proceed only if you know it's not empty:
if (stream.peek() != std::char_traits<char>::eof()) {
do_what_needs_to_be_done_for_a_non_empty_stream();
}
else {
do_something_else();
}
Never, ever check for eof alone.
The eof flag (which is the same as the eofbit bit flag in a value returned by rdstate()) is set when end-of-file is reached during an extract operation. If there were no extract operations, eofbit is never set, which is why your first check returns false.
However eofbit is no indication as to whether the operation was successful. For that, check failbit|badbit in rdstate(). failbit means "there was a logical error", and badbit means "there was an I/O error". Conveniently, there's a fail() function that returns exactly rdstate() & (failbit|badbit). Even more conveniently, there's an operator bool() function that returns !fail(). So you can do things like while(stream.read(buffer)){ ....
If the operation has failed, you may check eofbit, badbit and failbit separately to figure out why it has failed.
What compiler / standard c++ library are you using? I tried it with gcc 4.6.3/4.7.2 and clang 3.1, and all of them worked just fine (i.e. the assertion does not fire).
I think you should report this as a bug in your tool-chain, since my reading of the standard accords with your intuition that eof() should not be set as long as get() is able to return a character.
It's not a bug, in the sense that it's the intended behavior. It is
not the intent that you use test for eof() until after input has
failed. It's main purpose is for use inside extraction functions, where
in early implementations, the fact that std::streambuf::sgetc()
returned EOF didn't mean that it would the next time it was called:
the intent was that anytime sgetc() returned EOF (now
std::char_traits<>::eof(), this would be memorized, and the stream
would make no further calls to the streambuf.
Practically speaking: we really need two eof(): one for internal use,
as above, and another which will reliably state that failure was due to
having reached end of file. As it is, given something like:
std::istringstream s( "1.23e+" );
s >> aDouble;
There's no way of detecting that the error is due to a format error,
rather than the stream not having any more data. In this case, the
internal eof should return true (because we have seen end of file, when
looking ahead, and we want to suppress all further calls to the
streambuf extractor functions), but the external one should be false,
because there was data present (even after skipping initial whitespace).
If you're not implementing an extractor function, of course, you should
never test ios_base::eof() until you've actually had an input failure.
It was never the intent that this would provide any useful information
(which makes one wonder why they defined ios_base::good()—the
fact that it returns false if eof() means that it can provide nor
reliable information untin fail() returns true, at which point, we
know that it will return false, so there's no point in calling it).
And I'm not sure what your problem is. Because the stream cannot know
in advance what your next input will be (e.g. whether it will skip
whitespace or not), it cannot know in advance whether your next input
will fail because of end of file or not. The idiom adopted is clear:
try the input, then test whether is succeeded or not. There is no
other way, because no other alternative can be implemented. Pascal did
it differently, but a file in Pascal was typed—you could only read
one type from it, so it could always read ahead one element under the
hood, and return end of file if this read ahead failed. Not having
previsional end of file is the price we pay for being able to read more
than one type from a file.
The behavior is somewhat subtle. eofbit is set when an attempt is made to read past the end of the file, but that may or may not cause failure of the current extraction operation.
For example:
ifstream blah;
// assume the file got opened
int i, j;
blah >> i;
if (!blah.eof())
blah >> j;
If the file contains 142<EOF>, then the sequence of digits is terminated by end of file, so eofbit is set AND the extraction succeeds. Extraction of j won't be attempted, because the end of file has already been encountered.
If the file contains 142 <EOF>, the the sequence of digits is terminated by whitespace (extraction of i succeeds). eofbit is not set yet, so blah >> j will be executed, and it will reach end of file without finding any digits, so it will fail.
Notice how the innocuous-looking whitespace at the end of file changed the behavior.

Why is ::feof() different from ::_eof(::fileno())?

DISCLAIMER: Don't use ::feof() as your loop condition. For example, see the answer to: file reading: feof() for binary files
However, I have "real" code that demonstrates a problem that does not use ::feof() as my loop condition, but LOGICALLY, this is the easiest way to demonstrate the problem.
Consider the following: We iterate a character stream one-at-a-time:
FILE* my_file;
// ...open "my_file" for reading...
int c;
while(0 == ::feof(my_file))
{ // We are not at EOF
c = ::getc(my_file);
// ...process "c"
}
The above code works as expected: The file is processed one-char-at-a-time, and upon EOF, we drop out.
HOWEVER, the following has unexpected behavior:
FILE* my_file;
// ...open "my_file" for reading...
int c;
while(0 == ::_eof(::fileno(my_file)))
{ // We are not at EOF
c = ::getc(my_file);
// ...process "c"
}
I would have expected them to perform the same. ::fileno() properly returns the (integer) file descriptor every time. However, the test ::_eof(::fileno(my_file)) works exactly once, and then returns 1 (indicating an EOF) on the second attempt.
I do not understand this.
I suppose it is conceivable that ::feof() is "buffered" (so it works correctly) while ::_eof() is "un-buffered" and thinks the whole file is "read-in" already (because the whole file would have fit into the first block read in from disk). However, that can't possibly be true given the purpose of those functions. So, I'm really at a loss.
What's going on?
(Files are opened as "text", are ASCII text files with about a dozen lines, MSVS2008, Win7/64.)
I suppose it is conceivable that ::feof() is "buffered" (so it works
correctly) while ::_eof() is "un-buffered" and thinks the whole file
is "read-in" already (because the whole file would have fit into the
first block read in from disk). However, that can't possibly be true
given the purpose of those functions. So, I'm really at a loss.
I don't know why you would think it "can't possibly be true given the purpose of those functions." The 2 functions are meant to operate on files that are opened and operated on in different ways, so they are not compatible.
In fact, that is exactly what is happening. Try this:
FILE* my_file;
// ...open "my_file" for reading...
int c;
while(0 == ::_eof(::fileno(my_file)))
{ // We are not at EOF
c = ::getc(my_file);
long offset1 = ftell(my_file);
long offset2 = _tell(fileno(my_file));
if (offset1 != offset2)
{
//here you will see that the file pointers are different
//which means that _eof and feof will fire true under different conditions
}
// ...process "c"
}
I will try to elaborate a bit based on your comment.
When you call fopen, you are getting back a pointer to a file stream. The underlying stream object keeps it's own file pointer which is separate from the actual file pointer associated with the underlying file descriptor.
When you call _eof you are asking if you have reached the end of the actual file. When you call feof, you are asking if you have reached the end of the file stream. Since file streams are usually buffered, the end of the file is reached before the end of the stream.
I'm still trying to understand your answer below, and what the purpose
is for _eof() if it always returns 1 even when you didn't read
anything (after the first char).
To answer this question, the purpose of _eof is to determine if you have reached the end of the file when using _open and _read to work directly with file descriptors, not when you use fopen and fread or getc to work with file streams.

file reading: feof() for binary files

I am reading a binary file. and when it reaches the end. it seems it is terminated by feof() function. is it because there is no EOF character for binary files? if so how can i solve it.
currently my code is using a while loop
while (!feof(f))
when it reaches the end of file at position 5526900. it doesn't stop. it just keeps trying to read, and i am stuck at the loop.
can anyone tell me why and how to solve it.
Thanks
You should not use feof() to loop on - instead, use the return value of fread() - loop until it returns zero. This is easy to see if you consider reading an empty file - feof() returns the EOF status AFTER a read operation, so it will always try to read bogus data if used as a loop control.
I don't know why so many people think feof() (and the eof() member of C++ streams) can predict if the next read operation will succeed, but believe me, they can't.

Can we write an EOF character ourselves?

Most of the languages like C++ when writing into a file, put an EOF character even if we miss to write statements like :
filestream.close
However is there any way, we can put the EOF character according to our requirement, in C++, for an instance.
Or any other method we may use apart from using the functions provided in C++.
If you need to ask more of information then kindly do give a comment.
EDIT:
What if, we want to trick the OS and place an EOF character in a file and write some data after the EOF so that an application like notepad.exe is not able to read after our EOF character.
I have read answers to the question related to this topic and have come to know that nowdays OS generally don't see for an EOF character rather check the length of file to get the correct idea of knowing about the length of the file but, there must be a procedure in OS which would be checking the length of file and then updating the file records.
I am sorry if I am wrong at any point in my estimation but please do help me because it can lead to a lot of new ideas.
There is no EOF character. EOF by definition "is unequal to any valid character code". Often it is -1. It is not written into the file at any point.
There is a historical EOF character value (CTRL+Z) in DOS, but it is obsolete these days.
To answer the follow-up question of Apoorv: The OS never uses the file data to determine file length (files are not 'null terminated' in any way). So you cannot trick the OS. Perhaps old, stupid programs won't read after CTRL+Z character. I wouldn't assume that any Windows application (even Notepad) would do that. My guess is that it would be easier to trick them with a null (\0) character.
Well, EOF is just a value returned by the function defined in the C stdio.h header file. Its actually returned to all the reading functions by the OS, so its system dependent. When OS reaches the end of file, it sends it to the function, which in its return value than places most commonly (-1), but not always. So, to summarize, EOF is not character, but constant returned by the OS.
EDIT: Well, you need to know more about filesystem, look at this.
Hi, to your second question:
once again, you should look better into filesystems. FAT is a very nice example because you can find many articles about it, and its principles are very similar to NTFS. Anyway, once again, EOF is NOT a character. You cannot place it in file directly. If you could do so, imagine the consequences, even "dumb" image file could not be read by the system.
Why? Because OS works like very complex structure of layers. One of the layers is the filesystem driver. It makes sure that it transfers data from every filesystem known to the driver. It provides a bridge between applications and the actual system of storing files into HDD.
To be exact, FAT filesystem uses the so-called FAT table - it is a table located close to the start of the HDD (or partition) address space, and it contains map of all clusters (little storage cells). OK, so now, when you want to save some file to the HDD, OS (filesystem driver) looks into FAT table, and searches for the value "0x0". This "0x0" value says to the OS that cluster which address is described by the location of that value in FAT table is free to write.
So it writes into it the first part of the file. Then, it looks for another "0x0" value in FAT, and if found, it writes the second part of the file into cluster which it points to. Then, it changes the value of the first FAT table record where the file is located to the physical address of the next in our case second part of the file.
When your file is all stored on HDD, now there comes the final part, it writes desired EOF value, but into FAT table, not into the "data part" of the HDD. So when the file is read next time, it knows this is the end, don´t look any further.
So, now you see, if you would want to manually write EOF value into the place it doesn't belong to, you have to write your own driver which would be able to rewrite the FAT record, but this is practically impossible to do for beginners.
I came here while going through the Kernighan & Ritchie C exercises.
Ctrl+D sends the character that matches the EOF constant from stdio.h.
(Edit: this is on Mac OS X; thanks to #markmnl for pointing out that the Windows 10 equivalent is Ctrl+Z)
Actually in C++ there is no physical EOF character written to a file using either the fprintf() or ostream mechanisms. EOF is an I/O condition to indicate no more data to read.
Some early disk operating systems like CP/M actually did use a physical 0x1A (ASCII SUB character) to indicate EOF because the file system only maintained file size in blocks so you never knew exactly how long a file was in bytes. With the advent of storing actual length counts in the directory it is no longer typical to store an "EOF" character as part of the 'in-band' file data.
Under Windows, if you encounter an ASCII 26 (EOF) in stdin, it will stop reading the rest of the data. I believe writing this character will also terminate output sent to stdout, but I haven't confirmed this. You can switch the stream to binary mode as in this SO question:
#include <io.h>
#include <fcntl.h>
...
_setmode(0, _O_BINARY)
And not only will you stop 0x0A being converted to 0x0D 0x0A, but you'll also gain the ability to read/write 0x1A as well. Note you may have to switch both stdin (0) and stdout (1).
If by the EOF character you mean something like Control-Z, then modern operating systems don't need such a thing, and the C++ runtime will not write one for you. You can of course write one yourself:
filestream.put( 26 ); // write Ctrl-Z
but there is no good reason to do so. There is also no need to do:
filesystem.close();
as the file stream will be closed for you automatically when its destructor is called, but it is (I think) good practice to do so.
There is no such thing as the "EOF" character. The fact of closing the stream in itself is the "EOF" condition.
When you press Ctrl+D in a unix shell, that simply closes the standard input stream, which in turn is recognized by the shell as "EOF" and it exits.
So, to "send" an "EOF", just close the stream to which the "EOF" needs to be sent.
Nobody has yet mentioned the [f]truncate system calls, which are how you make a file shorter without recreating it from scratch.
The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes.
If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes ('\0').
Understand that this is a distinct operation from writing any sort of data to the file. The file is a linear array of bytes, laid out on disk somehow, with metadata that says how long it is; truncate changes the metadata.
On modern filesystems EOF is not a character, so you don't have to issue it when finishing to write to a file. You just have to close the file or let the OS do it for you when your process terminates.
Yes, you can manually add EOF to a file.
1) in Mac terminan, create a new file. touch filename.txt
2) Open the file in VI
vi filename.txt
3) In Insert mode (hit i), type Control+V and then Control+D. Do not let go of the Control key on the Mac.
Alternatively, if I want other ^NewLetters, like ^N^M^O^P, etc, I could do Contorl+V and then Control+NewLetter. So for example, to do ^O, hold down control, and then type V and O, then let go of Control.