Different EOF behavior with read versus ignore - c++

I was recently just tripped up by a subtle distinction between the behavior of std::istream::read versus std::istream::ignore. Basically, read extracts N bytes from the input stream, and stores them in a buffer. The ignore function extracts N bytes from the input stream, but simply discards them rather than storing them in a buffer. So, my understanding was that read and ignore are basically the same in every way, except for the fact that read saves the extracted bytes whereas ignore just discards them.
But there is another subtle difference between read and ignore which managed to trip me up. If you read to the end of a stream, the EOF condition is not triggered. You have to read past the end of a stream in order for the EOF condition to be triggered. But with ignore it is different: you only need to read to the end of a stream.
Consider:
#include <sstream>
#include <iostream>
using namespace std;
int main()
{
{
std::stringstream ss;
ss << "abcd";
char buf[1024];
ss.read(buf, 4);
std::cout << "EOF: " << std::boolalpha << ss.eof() << std::endl;
}
{
std::stringstream ss;
ss << "abcd";
ss.ignore(4);
std::cout << "EOF: " << std::boolalpha << ss.eof() << std::endl;
}
}
On GCC 4.4.5, this prints out:
EOF: false
EOF: true
So, why is the behavior different here? This subtle difference managed to confuse me enough to wonder why there is a difference. Is there some compelling reason that EOF is triggered "early" with a call to ignore?

eof() should only return true if you have already attempted to read past the end. In neither case should it be true. This may be a bug in your implementation.

I'm going to go out on a limb here and answer my own question: it really looks like this is a bug in GCC.
The standard reads in 27.6.1.3 paragraph 23:
[istream::ignore] behaves as an
unformatted input function (as
described in 27.6.1.3, paragraph 1).
After constructing a sentry object,
extracts characters and discards them.
Characters are extracted until any of
the following occurs:
if n != numeric_limits::max()
(18.2.1), n characters are extracted
end-of-file occurs on the input sequence (in which case the function
calls setstate(eofbit), which may
throw ios_base::failure(27.4.4.3));
c == delim for the next available input character c (in which case c is
extracted). Note: The last condition
will never occur if delim ==
traits::eof()
My (somewhat tentative) interpretation is that GCC is wrong here, because of the bold parts above. Ignore should behave as an unformatted input function, (like read()), which means that end-of-file should only occur on the input sequence if there is an attempt to extract additional bytes after the last byte in the stream has been extracted.
I'll submit a bug report if I find that enough people agree with this answer.

The consensus seemed to be that this was a legitimate bug in gcc. Since I saw no indication a bug report had been filed, I'm doing so now. The report can be viewed at:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51651

Related

Is size of char_traits<char16_t>::int_type not large enough?

Consider the following program:
#include <iostream>
#include <sstream>
#include <string>
int main(int, char **) {
std::basic_stringstream<char16_t> stream;
stream.put(u'\u0100');
std::cout << " Bad: " << stream.bad() << std::endl;
stream.put(u'\uFFFE');
std::cout << " Bad: " << stream.bad() << std::endl;
stream.put(u'\uFFFF');
std::cout << " Bad: " << stream.bad() << std::endl;
return 0;
}
The output is:
Bad: 0
Bad: 0
Bad: 1
It seems the reason the badbit gets set is because 'put' sets the badbit if the character equals std::char_traits::eof(). I can now no longer put to the stream.
At http://en.cppreference.com/w/cpp/string/char_traits it states:
int_type: an integer type that can hold all values of char_type plus
EOF
But if char_type is the same as int_type (uint_least16_t) then how can this be true?
The standard is quite explicit, std::char_traits<char16_t>::int_type is a typedef for std::uint_least16_t, see [char.traits.specializations.char16_t], which also says:
The member eof() shall return an implementation-defined constant that cannot appear as a valid UTF-16 code unit.
I'm not sure precisely how that interacts with http://www.unicode.org/versions/corrigendum9.html but existing practice in the major C++ implementations is to use the all-ones bit pattern for char_traits<char16_t>::eof(), even when uint_least16_t has exactly 16 bits.
After a bit more thought, I think it's possible for implementations to meet the Character traits requirements by making std::char_traits<char16_t>::to_int_type(char_type) return U+FFFD when given U+FFFF. This satisfies the requirement for eof() to return:
a value e such that X::eq_int_type(e,X::to_int_type(c)) is false for all values c.
This would also ensure that it's possible to distinguish success and failure when checking the result of basic_streambuf<char16_t>::sputc(u'\uFFFF'), so that it only returns eof() on failure, and returns u'\ufffd' otherwise.
I'll try that. I've created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80624 to track this in GCC.
I've also reported an issue against the standard, so we can fix the "cannot appear as a valid UTF-16 code unit" wording, and maybe fix it some other way too.
The behavior is interesting, that:
stream.put(u'\uFFFF');
sets the badbit, while:
stream << u'\uFFFF';
char16_t c = u'\uFFFF'; stream.write( &c, 1 );
does not set badbit.
This answer only focus on the differences.
So let's check gcc's source code in bits/ostream.tcc, line 164~165, we can see that put() checks if the value equals to eof(), and set the badbit.
if (traits_type::eq_int_type(__put, traits_type::eof())) // <== It checks the value!
__err |= ios_base::badbit;
From line 196, we can see write() does not have this logic, it only checks if all the chars are written to the buffer.
This explains the behavior.
From std::basic_ostream::put's description:
Internally, the function accesses the output sequence by first
constructing a sentry object. Then (if good), it inserts c into its
associated stream buffer object as if calling its member function
sputc, and finally destroys the sentry object before returning.
It does not tell anything about the check of eof().
So I would think this is either a bug in document or a bug in implementation.
That really depends on what you mean by "large enough". char16_t is not "a type large enough to hold any Unicode character including those which I'm not allowed to use". You chose to try to cram \uFFFF, which is "reserved for internal use", into a char16_t, and thus you are the one at fault. The program is simply doing as you instructed.

Example of Why stream::good is Wrong?

I gave an answer which I wanted to check the validity of stream each time through a loop here.
My original code used good and looked similar to this:
ifstream foo("foo.txt");
while (foo.good()){
string bar;
getline(foo, bar);
cout << bar << endl;
}
I was immediately pointed here and told to never test good. Clearly this is something I haven't understood but I want to be doing my file I/O correctly.
I tested my code out with several examples and couldn't make the good-testing code fail.
First (this printed correctly, ending with a new line):
bleck 1
blee 1 2
blah
ends in new line
Second (this printed correctly, ending in with the last line):
bleck 1
blee 1 2
blah
this doesn't end in a new line
Third was an empty file (this printed correctly, a single newline.)
Fourth was a missing file (this correctly printed nothing.)
Can someone help me with an example that demonstrates why good-testing shouldn't be done?
They were wrong. The mantra is 'never test .eof()'.
Why is iostream::eof inside a loop condition considered wrong?
Even that mantra is overboard, because both are useful to diagnose the state of the stream after an extraction failed.
So the mantra should be more like
Don't use good() or eof() to detect eof before you try to read any further
Same for fail(), and bad()
Of course stream.good can be usefully employed before using a stream (e.g. in case the stream is a filestream which has not been successfully opened)
However, both are very very very often abused to detect the end of input, and that's not how it works.
A canonical example of why you shouldn't use this method:
std::istringstream stream("a");
char ch;
if (stream >> ch) {
std::cout << "At eof? " << std::boolalpha << stream.eof() << "\n";
std::cout << "good? " << std::boolalpha << stream.good() << "\n";
}
Prints
false
true
See it Live On Coliru
This is already covered in other answers, but I'll go over it briefly for completeness. The only functional difference with
while(foo.good()) { // effectively same as while(foo) {
getline(foo, bar);
consume(bar); // consume() represents any operation that uses bar
}
And
while(getline(foo, bar)){
consume(bar);
}
Is that the former will do an extra loop when there are no lines in the file, making that case indistinguishable from the case of one empty line. I would argue that this is not typically desired behaviour. But I suppose that's matter of opinion.
As sehe says, the mantra is overboard. It's a simplification. What really is the point is that you must not consume() the result of reading the stream before you test for failure or at least EOF (and any test before the read is irrelevant). Which is what people easily do when they test good() in the loop condition.
However, the thing about getline(), is that it tests EOF internally, for you and returns an empty string even if only EOF is read. Therefore, the former version could maybe be roughly the similar to following pseudo c++:
while(foo.good()) {
// inside getline
bar = ""; // Reset bar to empty
string sentry;
if(read_until_newline(foo, sentry)) {
// The streams state is tested implicitly inside getline
// after the value is read. Good
bar = sentry // The read value is used only if it's valid.
// ... // Otherwise, bar is empty.
consume(bar);
}
I hope that illustrates what I'm trying to say. One could say that there is a "correct" version of the read loop inside getline(). This is why the rule is at least partially satisfied by the use of readline even if the outer loop doesn't conform.
But, for other methods of reading, breaking the rule hurts more. Consider:
while(foo.good()) {
int bar;
foo >> bar;
consume(bar);
}
Not only do you always get the extra iteration, the bar in that iteration is uninitialized!
So, in short, while(foo.good()) is OK in your case, because getline() unlike certain other reading functions, leaves the output in a valid state after reading EOF bit. and because you don't care or even do expect the extra iteration when the file is empty.
both good() and eof() will both give you an extra line in your code. If you have a blank file and run this:
std::ifstream foo1("foo1.txt");
std::string line;
int lineNum = 1;
std::cout << "foo1.txt Controlled With good():\n";
while (foo1.good())
{
std::getline(foo1, line);
std::cout << lineNum++ << line << std::endl;
}
foo1.close();
foo1.open("foo1.txt");
lineNum = 1;
std::cout << "\n\nfoo1.txt Controlled With getline():\n";
while (std::getline(foo1, line))
{
std::cout << line << std::endl;
}
The output you will get is
foo1.txt Controlled With good():
1
foo1.txt Controlled With getline():
This proves that it isn't working correctly since a blank file should never be read. The only way to know that is to use a read condition since the stream will always be good the first time it reads.
Using foo.good() just tells you that the previous read operation worked just fine and that the next one might as well work. .good() checks the state of the stream at a given point. It does not check if the end of the file is reached. Lets say something happened while the file was being read (network error, os error, ...) good will fail. That does not mean the end of the file was reached. Nevertheless .good() fails when end of file is reached because the stream is not able to read anymore.
On the other hand, .eof() checks if the end of file was truly reached.
So, .good() might fail while the end of file was not reached.
Hope this helps you understand why using .good() to check end of file is a bad habit.
Let me clearly say that sehe's answer is the correct one.
But the option proposed by, Nathan Oliver, Neil Kirk, and user2079303 is to use readline as the loop condition rather than good. Needs to be addressed for the sake of posterity.
We will compare the loop in the question to the following loop:
string bar;
while (getline(foo, bar)){
cout << bar << endl;
}
Because getline returns the istream passed as the first argument, and because when an istream is cast to bool it returns !(fail() || bad()), and since reading the EOF character will set both the failbit and the eofbit this makes getline a valid loop condition.
The behavior does change however when using getline as a condition because if a line containing only an EOF character is read the loop will exit preventing that line from being outputted. This doesn't occur in Examples 2 and 4. But Example 1:
bleck 1
blee 1 2
blah
ends in new line
Prints this with the good loop condition:
bleck 1
blee 1 2
blah
ends in new line
But chops the last line with the getline loop condition:
bleck 1
blee 1 2
blah
ends in new line
Example 3 is an empty file:
Prints this with the good condition:
Prints nothing with the getline condition.
Neither of these behaviors are wrong. But that last line can make a difference in code. Hopefully this answer will be helpful to you when deciding between the two for coding purposes.

Reading with setw: to eof or not to eof?

Consider the following simple example
#include <string>
#include <sstream>
#include <iomanip>
using namespace std;
int main() {
string str = "string";
istringstream is(str);
is >> setw(6) >> str;
return is.eof();
}
At the first sight, since the explicit width is specified by the setw manipulator, I'd expect the >> operator to finish reading the string after successfully extracting the requested number of characters from the input stream. I don't see any immediate reason for it to try to extract the seventh character, which means that I don't expect the stream to enter eof state.
When I run this example under MSVC++, it works as I expect it to: the stream remains in good state after reading. However, in GCC the behavior is different: the stream ends up in eof state.
The language standard, it gives the following list of completion conditions for this version of >> operator
n characters are stored;
end-of-file occurs on the input sequence;
isspace(c,is.getloc()) is true for the next available input character c.
Given the above, I don't see any reason for the >> operator to drive the stream into the eof state in the above code.
However, this is what the >> operator implementation in GCC library looks like
...
__int_type __c = __in.rdbuf()->sgetc();
while (__extracted < __n
&& !_Traits::eq_int_type(__c, __eof)
&& !__ct.is(__ctype_base::space,
_Traits::to_char_type(__c)))
{
if (__len == sizeof(__buf) / sizeof(_CharT))
{
__str.append(__buf, sizeof(__buf) / sizeof(_CharT));
__len = 0;
}
__buf[__len++] = _Traits::to_char_type(__c);
++__extracted;
__c = __in.rdbuf()->snextc();
}
__str.append(__buf, __len);
if (_Traits::eq_int_type(__c, __eof))
__err |= __ios_base::eofbit;
__in.width(0);
...
As you can see, at the end of each successful iteration, it attempts to prepare the next __c character for the next iteration, even though the next iteration might never occur. And after the cycle it analyzes the last value of that __c character and sets the eofbit accordingly.
So, my question is: triggering the eof stream state in the above situation, as GCC does - is it legal from the standard point of view? I don't see it explicitly specified in the document. Is both MSVC's and GCC's behavior compliant? Or is only one of them behaving correctly?
The definition for that particular operator>> is not relevant to the setting of the eofbit, as it only describes when the operation terminates, but not what triggers a particular bit.
The description for the eofbit in the standard (draft) says:
eofbit - indicates that an input operation reached the end of an input sequence;
I guess here it depends on how you want to interpret "reached". Note that gcc implementation correctly does not set failbit, which is defined as
failbit - indicates that an input operation failed to read the expected characters, or
that an output operation failed to generate the desired characters.
So I think eofbit does not necessarily mean that the end of file impeded the extractions of any new characters, just that the end of file has been "reached".
I can't seem to find a more accurate description for "reached", so I guess that would be implementation defined. If this logic is correct, then both MSVC and gcc behaviors are correct.
EDIT: In particular, it seems that eofbit gets set when sgetc() would return eof. This is described both in the istreambuf_iterator section and in the basic_istream::sentry section. So now the question is: when is the current position of the stream allowed to advance?
FINAL EDIT: It turns out that probably g++ has the correct behavior.
Every character scan passes through <locale>, in order to allow different character sets, money formats, time descriptions and number formats to be parsed. While there does not seem to be a through description on how the operator>> works for strings, there are very specific descriptions on how do_get functions for numbers, time and money are supposed to operate. You can find them from page 687 of the draft forward.
All of these start off by reading a ctype (the "global" version of a character, as read through locales) from a istreambuf_iterator (for numbers, you can find the call definitions at page 1018 of the draft). Then the ctype is processed, and finally the iterator is advanced.
So, in general, this requires the internal iterator to always point to the next character after the last one read; if that was not the case you could in theory extract more than you wanted:
string str = "strin1";
istringstream is(str);
is >> setw(6) >> str;
int x;
is >> x;
If the current character for is after the extraction for str was not on the eof, then the standard would require that x gets the value 1, since for numeric extraction the standard explicitly requires that the iterator is advanced after the first read.
Since this does not make much sense, and given that all complex extractions described in the standard behave in the same way, it makes sense that for strings the same would happen. Thus, as the pointer for is after reading 6 characters falls on the eof, the eofbit needs to be set.

Why does in_avail() output zero even if the stream has some char?

#include <iostream>
int main( )
{
using namespace std;
cout << cin.rdbuf()->in_avail() << endl;
cin.putback(1);
cin.putback(1);
cout << cin.rdbuf()->in_avail() << endl;
return 0;
} //compile by g++-4.8.1
I think this will output 0 and 2
but when I run the code, it output 0 and 0, why?
or if I change cin.putback(1); to int a; cin >> a; with input 12 12;
it still outputs 0 and 0
Apparently it's a bug/feature of some compiler implementations
Insert the line
cin.sync_with_stdio(false);
somewhere near the beginning of code, and that should fix it
EDIT: Also remember that in_avail will always return 1 more than the number of chars in the input because it counts the end of input character.
EDIT2: Also as I just checked, putback does not work unless you have attempted to read something from the stream first, hence the "back" in "putback". If you want to insert characters into the cin, this thread will provide the answer:
Injecting string to 'cin'
What must have happened is that your putback didn't find any room in the streambuf get area associated with std::cin (otherwise a read position would have been available and egptr() - gptr() would have been non-zero) and must have gone to an underlying layer thanks to pbackfail.
in_avail() will call showmanyc() and zero (which is the default implementation of this virtual function) is a safe thing to return as it means that a read might block and it might fail but isn't guaranteed to do either. Obviously it is possible for an implementation to provide a more helpful implementation for showmanyc() in this case, but the simple implementation is cheap and conformant.

Reading a fixed number of chars with << on an istream

I was trying out a few file reading strategies in C++ and I came across this.
ifstream ifsw1("c:\\trys\\str3.txt");
char ifsw1w[3];
do {
ifsw1 >> ifsw1w;
if (ifsw1.eof())
break;
cout << ifsw1w << flush << endl;
} while (1);
ifsw1.close();
The content of the file were
firstfirst firstsecond
secondfirst secondsecond
When I see the output it is printed as
firstfirst
firstsecond
secondfirst
I expected the output to be something like:
fir
stf
irs
tfi
.....
Moreover I see that "secondsecond" has not been printed. I guess that the last read has met the eof and the cout might not have been executed. But the first behavior is not understandable.
The extraction operator has no concept of the size of the ifsw1w variable, and (by default) is going to extract characters until it hits whitespace, null, or eof. These are likely being stored in the memory locations after your ifsw1w variable, which would cause bad bugs if you had additional variables defined.
To get the desired behavior, you should be able to use
ifsw1.width(3);
to limit the number of characters to extract.
It's virtually impossible to use std::istream& operator>>(std::istream&, char *) safely -- it's like gets in this regard -- there's no way for you to specify the buffer size. The stream just writes to your buffer, going off the end. (Your example above invokes undefined behavior). Either use the overloads accepting a std::string, or use std::getline(std::istream&, std::string).
Checking eof() is incorrect. You want fail() instead. You really don't care if the stream is at the end of the file, you care only if you have failed to extract information.
For something like this you're probably better off just reading the whole file into a string and using string operations from that point. You can do that using a stringstream:
#include <string> //For string
#include <sstream> //For stringstream
#include <iostream> //As before
std::ifstream myFile(...);
std::stringstream ss;
ss << myFile.rdbuf(); //Read the file into the stringstream.
std::string fileContents = ss.str(); //Now you have a string, no loops!
You're trashing the memory... its reading past the 3 chars you defined (its reading until a space or a new line is met...).
Read char by char to achieve the output you had mentioned.
Edit : Irritate is right, this works too (with some fixes and not getting the exact result, but that's the spirit):
char ifsw1w[4];
do{
ifsw1.width(4);
ifsw1 >> ifsw1w;
if(ifsw1.eof()) break;
cout << ifsw1w << flush << endl;
}while(1);
ifsw1.close();
The code has undefined behavior. When you do something like this:
char ifsw1w[3];
ifsw1 >> ifsw1w;
The operator>> receives a pointer to the buffer, but has no idea of the buffer's actual size. As such, it has no way to know that it should stop reading after two characters (and note that it should be 2, not 3 -- it needs space for a '\0' to terminate the string).
Bottom line: in your exploration of ways to read data, this code is probably best ignored. About all you can learn from code like this is a few things you should avoid. It's generally easier, however, to just follow a few rules of thumb than try to study all the problems that can arise.
Use std::string to read strings.
Only use fixed-size buffers for fixed-size data.
When you do use fixed buffers, pass their size to limit how much is read.
When you want to read all the data in a file, std::copy can avoid a lot of errors:
std::vector<std::string> strings;
std::copy(std::istream_iterator<std::string>(myFile),
std::istream_iterator<std::string>(),
std::back_inserter(strings));
To read the whitespace, you could used "noskipws", it will not skip whitespace.
ifsw1 >> noskipws >> ifsw1w;
But if you want to get only 3 characters, I suggest you to use the get method:
ifsw1.get(ifsw1w,3);