When does `ifstream::readsome` set `eofbit`? - c++

This code loops forever:
#include <iostream>
#include <fstream>
#include <sstream>
int main(int argc, char *argv[])
{
std::ifstream f(argv[1]);
std::ostringstream ostr;
while(f && !f.eof())
{
char b[5000];
std::size_t read = f.readsome(b, sizeof b);
std::cerr << "Read: " << read << " bytes" << std::endl;
ostr.write(b, read);
}
}
It's because readsome is never setting eofbit.
cplusplus.com says:
Errors are signaled by modifying the internal state flags:
eofbit The get pointer is at the end of the stream buffer's internal input
array when the function is called, meaning that there are no positions to be
read in the internal buffer (which may or not be the end of the input
sequence). This happens when rdbuf()->in_avail() would return -1 before the
first character is extracted.
failbit The stream was at the end of the source of characters before the
function was called.
badbit An error other than the above happened.
Almost the same, the standard says:
[C++11: 27.7.2.3]: streamsize readsome(char_type* s, streamsize n);
32. Effects: Behaves as an unformatted input function (as described in
27.7.2.3, paragraph 1). After constructing a sentry object, if !good() calls
setstate(failbit) which may throw an exception, and return. Otherwise extracts
characters and stores them into successive locations of an array whose first
element is designated by s. If rdbuf()->in_avail() == -1, calls
setstate(eofbit) (which may throw ios_base::failure (27.5.5.4)), and extracts
no characters;
If rdbuf()->in_avail() == 0, extracts no characters
If rdbuf()->in_avail() > 0, extracts min(rdbuf()->in_avail(),n)).
33. Returns: The number of characters extracted.
That the in_avail() == 0 condition is a no-op implies that ifstream::readsome itself is a no-op if the stream buffer is empty, but the in_avail() == -1 condition implies that it will set eofbit when some other operation has led to in_avail() == -1.
This seems like an inconsistency, even despite the "some" nature of readsome.
So what are the semantics of readsome and eof? Have I interpreted them correctly? Are they an example of poor design in the streams library?
(Stolen from the [IMO] invalid libstdc++ bug 52169.)

I think this is a customization point, not really used by the default stream implementations.
in_avail() returns the number of chars it can see in the internal buffer, if any. Otherwise it calls showmanyc() to try to detect if chars are known to be available elsewhere, so a buffer fill request is guaranteed to succeed.
In turn, showmanyc() will return the number of chars it knows about, if any, or -1 if it knows that a read will fail, or 0 if it doesn't have a clue.
The default implementation (basic_streambuf) always returns 0, so that is what you get unless you have a stream with some other streambuf overriding showmanyc.
Your loop is essentially read-as-many-chars-as-you-know-is-safe, and it gets stuck when that is zero (meaning "not sure").

I don't think that readsome() is meant for what you're trying to do (read from a file on disk)... from cplusplus.com:
The function is intended to be used to read binary data from certain
types of asynchronic sources that may wait for more characters, since
it stops reading when the local buffer exhausts, avoiding potential
unexpected delays.
So it sounds like readsome() is intended for streams from a network socket or something like that, and you probably want to just use read().

If no character is available (i.e. gptr() == egptr() for the std:streambuf) the virtual member function showhowmanyc() is called. I could have an implementation of showmanyc() which returns an error code. Why that may be useful is a different question. However, this could set eof(). Of course, in_avail() is meant not to fail and not to block and just return the characters known to be available. That is, the loop you have above is essentially guaranteed to be an infinite loop unless you have a rather odd stream buffer.

Others have answered why readsome won't set eofbit by design. I will suggest a way to read some bytes until eof without setting fail bit in a intuitive way, in the same way you were trying to use readsome. This is the result of research in another question.
#include <iostream>
#include <fstream>
#include <sstream>
using namespace std;
streamsize Read(istream &stream, char *buffer, streamsize count)
{
// This consistently fails on gcc (linux) 4.8.1 with failbit set on read
// failure. This apparently never fails on VS2010 and VS2013 (Windows 7)
streamsize reads = stream.rdbuf()->sgetn(buffer, count);
// This rarely sets failbit on VS2010 and VS2013 (Windows 7) on read
// failure of the previous sgetn()
stream.rdstate();
// On gcc (linux) 4.8.1 and VS2010/VS2013 (Windows 7) this consistently
// sets eofbit when stream is EOF for the conseguences of sgetn(). It
// should also throw if exceptions are set, or return on the contrary,
// and previous rdstate() restored a failbit on Windows. On Windows most
// of the times it sets eofbit even on real read failure
stream.peek();
return reads;
}
int main(int argc, char *argv[])
{
ifstream instream("filepath", ios_base::in | ios_base::binary);
while (!instream.eof())
{
char buffer[0x4000];
size_t read = Read(instream, buffer, sizeof(buffer));
// Do something with buffer
}
}

Related

Why does writing empty std::istringstream.rdbuf() set failbit?

I have learned, that I can copy a C++ std::istream to an C++ std::ostream by outputting the istreams' rdbuf(). I used it several times and it worked fine.
Today I got in trouble, because this operation sets badbit, if the std::istream is empty (at least for std::istringstream). I have written the following code to demonstrate my problem:
#include <stdio.h>
#include <sstream>
int main(int argc, char *argv[])
{
std::ostringstream ss;
ss << std::istringstream(" ").rdbuf(); // this does not set failbit
printf("fail=%d\n", ss.fail());
ss << std::istringstream("").rdbuf(); // why does this set failbit ???
printf("fail=%d\n", ss.fail());
}
I tried Windows/VS2017 and Linux/gcc-9.20 and both behave the same.
I am using std::istream& in a method signature with a default value of std::istringstream(""). The calling code shall be able to pass an optional istream, which is appended to some other data.
Can anybody explain, why badbit is set?
Is there better way to implement this optional std::istream& parameter?
I know, I could write two methods, one with an additional std::istream& parameter, but I want to avoid duplicate code.
Thanks in advance,
Mario
Update 22-Apr-20
I now use the following code:
#include <stdio.h>
#include <sstream>
int main(int argc, char *argv[])
{
std::ostringstream out;
std::istringstream in("");
while (in)
{
char Buffer[4096];
in.read(Buffer, sizeof(Buffer));
out.write(Buffer, in.gcount());
}
printf("fail=%d\n", out.fail());
}
I also added a warning about setting failbit when copying empty files to https://stackoverflow.com/a/10195497/6832488
The documentation of ostream::operator<< describes the following behavior for reading streams:
Behaves as an UnformattedOutputFunction. After constructing and checking the sentry object, checks if sb is a null pointer. If it is, executes setstate(badbit) and exits. Otherwise, extracts characters from the input sequence controlled by sb and inserts them into *this until one of the following conditions are met:
* end-of-file occurs on the input sequence;
* inserting in the output sequence fails (in which case the character to be inserted is not extracted);
* an exception occurs (in which case the exception is caught).
If no characters were inserted, executes setstate(failbit). If an exception was thrown while extracting, sets failbit and, if failbit is set in exceptions(), rethrows the exception.
As you can tell, that explicitly says that trying to insert an empty buffer will set the failbit. If you want to essentially make it "optional", just check that the stream is good before inserting the buffer and do ss.clear() afterwards to clear the failbit.

Why does std::ios_base::ignore() set the EOF bit?

When I read all data from a stream, but make no attempt to read past its end, the stream's EOF is not set. That's how C++ streams work, right? It's the reason this works:
#include <sstream>
#include <cassert>
char buf[255];
int main()
{
std::stringstream ss("abcdef");
ss.read(buf, 6);
assert(!ss.eof());
assert(ss.tellg() == 6);
}
However, if instead of read()ing data I ignore() it, EOF is set:
#include <sstream>
#include <cassert>
int main()
{
std::stringstream ss("abcdef");
ss.ignore(6);
assert(!ss.eof()); // <-- FAILS
assert(ss.tellg() == 6); // <-- FAILS
}
This is on GCC 4.8 and GCC trunk (Coliru).
It also has the unfortunate side-effect of making tellg() return -1 (because that's what tellg() does), which is annoying for what I'm doing.
Is this standard-mandated? If so, which passage and why? Why would ignore() attempt to read more than I told it to?
I can't find any reason for this behaviour on cppreference's ignore() page. I can probably .seekg(6, std::ios::cur) instead, right? But I'd still like to know what's going on.
I think this is a libstdc++ bug (42875, h/t NathanOliver). The requirements on ignore() in [istream.unformatted] are:
Characters are extracted until any
of the following occurs:
— n != numeric_limits<streamsize>::max() (18.3.2) and n characters have been extracted so far
— end-of-file occurs on the input sequence (in which case the function calls setstate(eofbit),
which may throw ios_base::failure (27.5.5.4));
— traits::eq_int_type(traits::to_int_type(c), delim) for the next available input character
c (in which case c is extracted).
Remarks: The last condition will never occur if traits::eq_int_type(delim, traits::eof()).
So we have two conditions (the last is ignored) - we either read n characters, or at some point we hit end-of-file in which case we set the eofbit. But, we are able to read n characters from the stream in this case (there are in fact 6 characters in your stream), so we will not hit end-of-file on the input sequence.
In libc++, eof() is not set and tellg() does return 6.

Taking input from cin and storing it in a char variable

I'm taking input using cin and storing it into a char variable. My question is if there is any input that could cause cin.fail() to return true.
I know that trying to store input such as "foo" into an int variable will fail, but is there any case in which this is possible with a char variable?
The overloads for operator>> which take a char follow the normal behavior of a formatted input function, that is they call rdbuf()->sbumpc() or rdbuf()->sgetc() to perform the extraction. Naturally, if eof is encountered, then eofbit is set. If one of the functions throw an exception, then badbit is set. If either of these are set, then failbit is set. There's no evidence to indicate that the operation would fail otherwise. (This is covered under section [istream] in the C++11 draft standard.) For other types, like int, do_get() is used to convert the character (similar to scanf). Of course, the conversion can fail, but no conversion is needed if the input is already a char.
Now the comments are misleading. CTRL+C would kill the application in Linux. CTRL+Z would send a character that signals EOF on some operating systems.
You can even use an emoji and it would work:
#include <iostream>
int main()
{
char c;
if (std::cin >> c)
std::cout << "Huzzah!";
}
With input 😁 outputs "Huzzah!" as expected.
I guess not because a char variable with just take only the first character from the given input, no matter how long the input or of what type it is (int,long,double..)
No, failbit is only set if there's a logical error reading the input stream, AKA, someone rips out the USB flashdrive containing the file from which you're reading. ;)

Reading with setw: to eof or not to eof?

Consider the following simple example
#include <string>
#include <sstream>
#include <iomanip>
using namespace std;
int main() {
string str = "string";
istringstream is(str);
is >> setw(6) >> str;
return is.eof();
}
At the first sight, since the explicit width is specified by the setw manipulator, I'd expect the >> operator to finish reading the string after successfully extracting the requested number of characters from the input stream. I don't see any immediate reason for it to try to extract the seventh character, which means that I don't expect the stream to enter eof state.
When I run this example under MSVC++, it works as I expect it to: the stream remains in good state after reading. However, in GCC the behavior is different: the stream ends up in eof state.
The language standard, it gives the following list of completion conditions for this version of >> operator
n characters are stored;
end-of-file occurs on the input sequence;
isspace(c,is.getloc()) is true for the next available input character c.
Given the above, I don't see any reason for the >> operator to drive the stream into the eof state in the above code.
However, this is what the >> operator implementation in GCC library looks like
...
__int_type __c = __in.rdbuf()->sgetc();
while (__extracted < __n
&& !_Traits::eq_int_type(__c, __eof)
&& !__ct.is(__ctype_base::space,
_Traits::to_char_type(__c)))
{
if (__len == sizeof(__buf) / sizeof(_CharT))
{
__str.append(__buf, sizeof(__buf) / sizeof(_CharT));
__len = 0;
}
__buf[__len++] = _Traits::to_char_type(__c);
++__extracted;
__c = __in.rdbuf()->snextc();
}
__str.append(__buf, __len);
if (_Traits::eq_int_type(__c, __eof))
__err |= __ios_base::eofbit;
__in.width(0);
...
As you can see, at the end of each successful iteration, it attempts to prepare the next __c character for the next iteration, even though the next iteration might never occur. And after the cycle it analyzes the last value of that __c character and sets the eofbit accordingly.
So, my question is: triggering the eof stream state in the above situation, as GCC does - is it legal from the standard point of view? I don't see it explicitly specified in the document. Is both MSVC's and GCC's behavior compliant? Or is only one of them behaving correctly?
The definition for that particular operator>> is not relevant to the setting of the eofbit, as it only describes when the operation terminates, but not what triggers a particular bit.
The description for the eofbit in the standard (draft) says:
eofbit - indicates that an input operation reached the end of an input sequence;
I guess here it depends on how you want to interpret "reached". Note that gcc implementation correctly does not set failbit, which is defined as
failbit - indicates that an input operation failed to read the expected characters, or
that an output operation failed to generate the desired characters.
So I think eofbit does not necessarily mean that the end of file impeded the extractions of any new characters, just that the end of file has been "reached".
I can't seem to find a more accurate description for "reached", so I guess that would be implementation defined. If this logic is correct, then both MSVC and gcc behaviors are correct.
EDIT: In particular, it seems that eofbit gets set when sgetc() would return eof. This is described both in the istreambuf_iterator section and in the basic_istream::sentry section. So now the question is: when is the current position of the stream allowed to advance?
FINAL EDIT: It turns out that probably g++ has the correct behavior.
Every character scan passes through <locale>, in order to allow different character sets, money formats, time descriptions and number formats to be parsed. While there does not seem to be a through description on how the operator>> works for strings, there are very specific descriptions on how do_get functions for numbers, time and money are supposed to operate. You can find them from page 687 of the draft forward.
All of these start off by reading a ctype (the "global" version of a character, as read through locales) from a istreambuf_iterator (for numbers, you can find the call definitions at page 1018 of the draft). Then the ctype is processed, and finally the iterator is advanced.
So, in general, this requires the internal iterator to always point to the next character after the last one read; if that was not the case you could in theory extract more than you wanted:
string str = "strin1";
istringstream is(str);
is >> setw(6) >> str;
int x;
is >> x;
If the current character for is after the extraction for str was not on the eof, then the standard would require that x gets the value 1, since for numeric extraction the standard explicitly requires that the iterator is advanced after the first read.
Since this does not make much sense, and given that all complex extractions described in the standard behave in the same way, it makes sense that for strings the same would happen. Thus, as the pointer for is after reading 6 characters falls on the eof, the eofbit needs to be set.

Different EOF behavior with read versus ignore

I was recently just tripped up by a subtle distinction between the behavior of std::istream::read versus std::istream::ignore. Basically, read extracts N bytes from the input stream, and stores them in a buffer. The ignore function extracts N bytes from the input stream, but simply discards them rather than storing them in a buffer. So, my understanding was that read and ignore are basically the same in every way, except for the fact that read saves the extracted bytes whereas ignore just discards them.
But there is another subtle difference between read and ignore which managed to trip me up. If you read to the end of a stream, the EOF condition is not triggered. You have to read past the end of a stream in order for the EOF condition to be triggered. But with ignore it is different: you only need to read to the end of a stream.
Consider:
#include <sstream>
#include <iostream>
using namespace std;
int main()
{
{
std::stringstream ss;
ss << "abcd";
char buf[1024];
ss.read(buf, 4);
std::cout << "EOF: " << std::boolalpha << ss.eof() << std::endl;
}
{
std::stringstream ss;
ss << "abcd";
ss.ignore(4);
std::cout << "EOF: " << std::boolalpha << ss.eof() << std::endl;
}
}
On GCC 4.4.5, this prints out:
EOF: false
EOF: true
So, why is the behavior different here? This subtle difference managed to confuse me enough to wonder why there is a difference. Is there some compelling reason that EOF is triggered "early" with a call to ignore?
eof() should only return true if you have already attempted to read past the end. In neither case should it be true. This may be a bug in your implementation.
I'm going to go out on a limb here and answer my own question: it really looks like this is a bug in GCC.
The standard reads in 27.6.1.3 paragraph 23:
[istream::ignore] behaves as an
unformatted input function (as
described in 27.6.1.3, paragraph 1).
After constructing a sentry object,
extracts characters and discards them.
Characters are extracted until any of
the following occurs:
if n != numeric_limits::max()
(18.2.1), n characters are extracted
end-of-file occurs on the input sequence (in which case the function
calls setstate(eofbit), which may
throw ios_base::failure(27.4.4.3));
c == delim for the next available input character c (in which case c is
extracted). Note: The last condition
will never occur if delim ==
traits::eof()
My (somewhat tentative) interpretation is that GCC is wrong here, because of the bold parts above. Ignore should behave as an unformatted input function, (like read()), which means that end-of-file should only occur on the input sequence if there is an attempt to extract additional bytes after the last byte in the stream has been extracted.
I'll submit a bug report if I find that enough people agree with this answer.
The consensus seemed to be that this was a legitimate bug in gcc. Since I saw no indication a bug report had been filed, I'm doing so now. The report can be viewed at:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51651