Need's formatting, editing will take some time.
Reading using fin >> d and using fin.read does different things. As fin >> d works it seems you have a file where the string representation of a double is written. Using fin.read suggests that your file is written in binary which it seems it is not. Also you should better use sizeof(double) instead of the hard coded constant 8.
The problem is that you missunderstood the semantic of std::ifstream::read function. According to C++ reference:
Note: This doc is for std::istream but applies to ifstream
std::istream::operator>>()
This operator (>>) applied to an input stream is known as extraction operator. It is overloaded as a member function for:
arithmetic types
Extracts and parses characters sequentially from the stream to interpret them as the representation of a value of the proper type, which is stored as the value of val.
Internally, the function accesses the input sequence by first constructing a sentry object (with noskipws set to false). Then (if good), it calls num_get::get (using the stream's selected locale) to perform both the extraction and the parsing operations, adjusting the stream's internal state flags accordingly. Finally, it destroys the sentry object before returning.
stream buffers and manipulators.
while for std::istream::read:
This function simply copies a block of data, without checking its contents nor appending a null character at the end.
So, when you do:
double d;
...
fin >> d;
You're storing a double into a double var. But ...
if you do:
double d;
...
fin.read((char*)&d, ...);
you are telling to c++: Ok here (&d) I have an address, I want you trate it as it was char* (the cast). And the function do what you want to do. But as you see in doc, the function will put in &d a block of data that has nothing to do with the walue you're expecting.
That's why operator>> works while read doesn't.
Related
I am reading numbers from an istreamby using the >> operator overload. This works fine, but now I need to know how many characters have been consumed by that operation. I'm currently using something like
int startPos = in.tellg();
double number;
in >> number;
int readChars = in.tellg() - startPos;
This does work in some cases but it is quite fragile. When using std::cin as in, this doesn't work at all though (I assume that this is because std::cin doesn't have a position in the stream a it's potentially an endless one).
My question is (I think) rather simple: How can I get the amount of characters that have been read when using the >> operator?
During my search I encountered gcount() but this only works for unformatted input.
The documentation of the >> operator doesn't seem to give a hint on this either: http://www.cplusplus.com/reference/istream/istream/operator%3E%3E/
If the stream is formatted can't you just check the length of it?
Anways, std::istream::operator>> for C++ 98:
The function is considered to perform formatted input: Internally, the function accesses the input sequence by first constructing a sentry object (with noskipws set to false). Then (if good), it extracts characters from its associated stream buffer object as if calling its member functions sbumpc or sgetc, and finally destroys the sentry object before returning.
For C++ 11:
The function is considered to perform unformatted input: Internally, the function accesses the input sequence by first constructing a sentry object (with noskipws set to true). Then (if good), it extracts characters from its associated stream buffer object as if calling its member functions sbumpc or sgetc, and finally destroys the sentry object before returning.
The number of characters successfully read and stored by this function can be accessed by calling member gcount.
So it seems that you can only count characters from unformatted input.
But:
The unformatted input operations that modify the value returned by this function (gcount()) are: get, getline, ignore, peek, read, readsome, putback and unget.
Notice though, that peek, putback and unget do not actually extract any characters, and thus gcount will always return zero after calling any of them.
So maybe you can use, for instance, istream& getline (istream& is, string& str); or std::istream::get to get gcount() to count a formatted stream,
Given this:
auto f = std::ifstream{file};
if (f) {
std::stringstream stream;
stream << f.rdbuf();
return stream.str();
}
return std::string{};
I don't see why it works.
I don't know what type f is, because it says auto, but apparently you can check that for non-zero-ness.
But when the file is large, like 2 gig, the delay in running happens in
this line:
stream << f.rdbuf();
The documentation says rdbuf() gets you the pointer to the ifstream's internal buffer. So in order for it to read the entire file, the buffer would have to size the file, and load it all in in one shot. But by the time the stream << happens, rdbuf() has to already be set, or it won't be able to return a point.
I would expect the constructor to do that in this case, but it's obviously lazy loaded, because reading in the entire file on construction would be bad for every other use case, and the delay is in the stream << command.
Any thoughts? All other stack overflow references to reading in a file to a string always loop in some way.
If there's some buffer involved, which obviously there is, how big can it get? What if it is 1 byte, it would surely be slow.
Adorable c++ is very opaque, bad for programmers who have to know what's going on under the covers.
It's a function of how operator<< is defined on ostreams when the argument is a streambuf. As long as the streambuf isn't a null pointer, it extracts characters from the input sequence controlled by the streambuf and inserts them into *this until one of the following conditions are met (see operator<< overload note #9):
end-of-file occurs on the input sequence;
inserting in the output sequence fails (in which case the character to be inserted is not extracted);
an exception occurs (in which case the exception is caught).
Basically, the ostream (which stringstream inherits from) knows how to exercise a streambuf to pull all the data from the file it's associated with. It's an idiomatic, but as you note, not intuitive, way to slurp a whole file. The streambuf isn't actually buffering all the data here (as you note, reading the whole file into the buffer would be bad in the general case), it's just that it has the necessary connections to adjust the buffered window as an ostream asks for more (and more, and more) data.
if (f) works because ifstream has an overload for operator bool that is implicitly invoked when the "truthiness" of the ifstream is tested that tells you if the file is in a failure state.
To answer your first question first:
f is of the type that's assigned to it, an std::ifstream, but that's a rather silly way to write it. One would usually write std::ifstream f {...}. A stream has an overloaded operator bool () which gives you !fail().
As for the second question: What .rdbuf() returns is a streambuf object. This object doesn't contain the whole file contents when it is returned. Instead, it provides an interface to access data, and this interface is used by the stringstream stream.
auto f = std::ifstream{file};
Type of f is std::ifstream.
stream << f.rdbuf();
std::ifstream maintains a buffer which you can get by f.rdbuf() and it does not load entire file in 1 shot. The loading happens when the above commands is called, stringstream will extract data from that buffer, and ifstream will perform loading as the buffer runs out of data.
You can manually set the buffer size by using setbuf.
Let's say we have a stream containing simply:
hello
Note that there's no extra \n at the end like there often is in a text file. Now, the following simple code shows that the eof bit is set on the stream after extracting a single std::string.
int main(int argc, const char* argv[])
{
std::stringstream ss("hello");
std::string result;
ss >> result;
std::cout << ss.eof() << std::endl; // Outputs 1
return 0;
}
However, I can't see why this would happen according to the standard (I'm reading C++11 - ISO/IEC 14882:2011(E)). operator>>(basic_stream<...>&, basic_string<...>&) is defined as behaving like a formatted input function. This means it constructs a sentry object which proceeds to eat away whitespace characters. In this example, there are none, so the sentry construction completes with no problems. When converted to a bool, the sentry object gives true, so the extractor continues to get on with the actual extraction of the string.
The extraction is then defined as:
Characters are extracted and appended until any of the following occurs:
n characters are stored;
end-of-file occurs on the input sequence;
isspace(c,is.getloc()) is true for the next available input character c.
After the last character (if any) is extracted, is.width(0) is called and the sentry object k is destroyed.
If the function extracts no characters, it calls is.setstate(ios::failbit), which may throw ios_base::failure (27.5.5.4).
Nothing here actually causes the eof bit to be set. Yes, extraction stops if it hits the end-of-file, but it doesn't set the bit. In fact, the eof bit should only be set if we do another ss >> result;, because when the sentry attempts to gobble up whitespace, the following situation will occur:
If is.rdbuf()->sbumpc() or is.rdbuf()->sgetc() returns traits::eof(), the function calls setstate(failbit | eofbit)
However, this is definitely not happening yet because the failbit isn't being set.
The consequence of the eof bit being set is that the only reason the evil-idiom while (!stream.eof()) doesn't work when reading files is because of the extra \n at the end and not because the eof bit isn't yet set. My compiler is happily setting the eof bit when the extraction stops at the end of file.
So should this be happening? Or did the standard mean to say that setstate(eofbit) should occur?
To make it easier, the relevant sections of the standard are:
21.4.8.9 Inserters and extractors [string.io]
27.7.2.2 Formatted input functions [istream.formatted]
27.7.2.1.3 Class basic_istream::sentry [istream::sentry]
std::stringstream is a basic_istream and the operator>> of std::string "extracts" characters from it (as you found out).
27.7.2.1 Class template basic_istream
2 If rdbuf()->sbumpc() or rdbuf()->sgetc() returns traits::eof(), then the input function, except as
explicitly noted otherwise, completes its actions and does setstate(eofbit), which may throw ios_-
base::failure (27.5.5.4), before returning.
Also, "extracting" means calling these two functions.
3 Two groups of member function signatures share common properties: the formatted input functions (or
extractors) and the unformatted input functions. Both groups of input functions are described as if they
obtain (or extract) input characters by calling rdbuf()->sbumpc() or rdbuf()->sgetc(). They may use
other public members of istream.
So eof must be set.
Intuitively speaking, the EOF bit is set because during the read operation to extract the string, the stream did indeed hit the end of the file. Specifically, it continuously read characters out of the input stream, stopping because it hit the end of the stream before encountering a whitespace character. Accordingly, the stream set the EOF bit to mark that the end of stream was reached. Note that this is not the same as reporting failure - the operation was completed successfully - but the point of the EOF bit is not to report failure. It's to mark that the end of the stream was encountered.
I don't have a specific part of the spec to back this up, though I'll try to look for one when I get the chance.
I found myself writing a lot of parsing code lately (mostly custom formats, but it isn't really relevant).
To enhance reusability, I chose to base my parsing functions on i/o streams so that I can use them with things like boost::lexical_cast<>.
I however realized I have never read anywhere anything about how to do that properly.
To illustrate my question, lets consider I have three classes Foo, Bar and FooBar:
A Foo is represented by data in the following format: string(<number>, <number>).
A Bar is represented by data in the following format: string[<number>].
A FooBar is kind-of a variant type that can hold either a Foo or a Bar.
Now let's say I wrote an operator>>() for my Foo type:
istream& operator>>(istream& is, Foo& foo)
{
char c1, c2, c3;
is >> foo.m_string >> c1 >> foo.m_x >> c2 >> std::ws >> foo.m_y >> c3;
if ((c1 != '(') || (c2 != ',') || (c3 != ')'))
{
is.setstate(std::ios_base::failbit);
}
return is;
}
The parsing goes fine for valid data. But if the data is invalid:
foo might be partially modified;
Some data in the input stream was read and is thus no longer available to further calls to is.
Also, I wrote another operator>>() for my FooBar type:
istream& operator>>(istream& is, FooBar foobar)
{
Foo foo;
if (is >> foo)
{
foobar = foo;
}
else
{
is.clear();
Bar bar;
if (is >> bar)
{
foobar = bar;
}
}
return is;
}
But obviously it doesn't work because if is >> foo fails, some data has already been read and is no longer available for the call to is >> bar.
So here are my questions:
Where is my mistake here ?
Should one write his calls to operator>> to leave the initial data still available after a failure ? If so, how can I do that efficiently ?
If not, is there a way to "store" (and restore) the complete status of an input stream: state and data ?
What differences are they between failbit and badbit ? When should we use one or the other ?
Is there any online reference (or a book) that explains deeply how to deal with iostreams ? not just the basic stuff: the complete error handling.
Thank you very much.
Personally, I think these are reasonable questions and I remember very well that I struggled with them myself. So here we go:
Where is my mistake here ?
I wouldn't call it a mistake but you probably want to make sure you don't have to back off from what you have read. That is, I would implement three versions of the input functions. Depending on how complex the decoding of a specific type is I might not even share the code because it might be just a small piece anyway. If it is more than a line or two probably would share the code. That is, in your example I would have an extractor for FooBar which essentially reads the Foo or the Bar members and initializes objects correspondingly. Alternatively, I would read the leading part and then call a shared implementation extracting the common data.
Let's do this exercise because there are a few things which may be a complication. From your description of the format it isn't clear to me if the "string" and what follows the string are delimited e.g. by a whitespace (space, tab, etc.). If not, you can't just read a std::string: the default behavior for them is to read until the next whitespace. There are ways to tweak the stream into considering characters as whitespace (using std::ctype<char>) but I'll just assume that there is space. In this case, the extractor for Foo could look like this (note, all code is entirely untested):
std::istream& read_data(std::istream& is, Foo& foo, std::string& s) {
Foo tmp(s);
if (is >> get_char<'('> >> tmp.m_x >> get_char<','> >> tmp.m_y >> get_char<')'>)
std::swap(tmp, foo);
return is;
}
std::istream& operator>>(std::istream& is, Foo& foo)
{
std::string s;
return read_data(is >> s, foo, s);
}
The idea is that read_data() read the part of a Foo which is different from Bar when reading a FooBar. A similar approach would be used for Bar but I omit this. The more interesting bit is the use of this funny get_char() function template. This is something called a manipulator and is just a function taking a stream reference as argument and returning a stream reference. Since we have different characters we want to read and compare against, I made it a template but you can have one function per character as well. I'm just too lazy to type it out:
template <char Expect>
std::istream& get_char(std::istream& in) {
char c;
if (in >> c && c != 'e') {
in.set_state(std::ios_base::failbit);
}
return in;
}
What looks a bit weird about my code is that there are few checks if things worked. That is because the stream would just set std::ios_base::failbit when reading a member failed and I don't really have to bother myself. The only case where there is actually special logic added is in get_char() to deal with expecting a specific character. Similarly there is no skipping of whitespace characters (i.e. use of std::ws) going on: all the input functions are formatted input functions and these skip whitespace by default (you can turn this off by using e.g. in >> std::noskipws) but then lots of things won't work.
With a similar implementation for reading a Bar, reading a FooBar would look something like this:
std::istream& operator>> (std::istream& in, FooBar& foobar) {
std::string s;
if (in >> s) {
switch ((in >> std::ws).peek()) {
case '(': { Foo foo; read_data(in, foo, s); foobar = foo; break; }
case '[': { Bar bar; read_data(in, bar, s); foobar = bar; break; }
default: in.set_state(std::ios_base::failbit);
}
}
return in;
}
This code uses an unformatted input function, peek() which just looks at the next character. It either return the next character or it returns std::char_traits<char>::eof() if it fails. So, if there is either an opening parenthesis or an opening bracket we have read_data() take over. Otherwise we always fail. Solved the immediate problem. On to distributing information...
Should one write his calls to operator>> to leave the initial data still available after a failure ?
The general answer is: no. If you failed to read something went wrong and you give up. This might mean that you need to work harder to avoid failing, though. If you really need to back off from the position you were at to parse your data, you might want to read data first into a std::string using std::getline() and then analyze this string. Use of std::getline() assumes that there is a distinct character to stop at. The default is newline (hence the name) but you can use other characters as well:
std::getline(in, str, '!');
This would stop at the next exclamation mark and store all characters up to it in str. It would also extract the termination character but it wouldn't store it. This makes it interesting sometimes when you read the last line of a file which may not have a newline: std::getline() succeeds if it can read at least one character. If you need to know if the last character in a file is a newline, you can test if the stream reached:
if (std::getline(in, str) && in.eof()) { std::cout << "file not ending in newline\"; }
If so, how can I do that efficiently ?
Streams are by their very nature single pass: you receive each character just once and if you skip over one you consume it. Thus, you typically want to structure your data in a way such that you don't have to backtrack. That said, this isn't always possible and most streams actually have a buffer under the hood two which characters can be returned. Since streams can be implemented by a user there is no guarantee that characters can be returned. Even for the standard streams there isn't really a guarantee.
If you want to return a character, you have to put back exactly the character you extracted:
char c;
if (in >> c && c != 'a')
in.putback(c);
if (in >> c && c != 'b')
in.unget();
The latter function has slightly better performance because it doesn't have to check that the character is indeed the one which was extracted. It also has less chances to fail. Theoretically, you can put back as many characters as you want but most streams won't support more than a few in all cases: if there is a buffer, the standard library takes care of "ungetting" all characters until the start of the buffer is reached. If another character is returned, it calls the virtual function std::streambuf::pbackfail() which may or may not make more buffer space available. In the stream buffers I have implemented it will typically just fail, i.e. I typically don't override this function.
If not, is there a way to "store" (and restore) the complete status of an input stream: state and data ?
If you mean to entirely restore the state you were at, including the characters, the answer is: sure there is. ...but no easy way. For example, you could implement a filtering stream buffer and put back characters as described above to restore the sequence to be read (or support seeking or explicitly setting a mark in the stream). For some streams you can use seeking but not all streams support this. For example, std::cin typically doesn't support seeking.
Restoring the characters is only half the story, though. The other stuff you want to restore are the state flags and any formatting data. In fact, if the stream went into a failed or even bad state you need to clear the state flags before the stream will do most operations (although I think the formatting stuff can be reset anyway):
std::istream fmt(0); // doesn't have a default constructor: create an invalid stream
fmt.copyfmt(in); // safe the current format settings
// use in
in.copyfmt(fmt); // restore the original format settings
The function copyfmt() copies all fields associated with the stream which are related to formatting. These are:
the locale
the fmtflags
the information storage iword() and pword()
the stream's events
the exceptions
the streams's state
If you don't know about most of them don't worry: most stuff you probably won't care about. Well, until you need it but by then you have hopefully acquired some documentation and read about it (or ask and got a good response).
What differences are they between failbit and badbit ? When should we use one or the other ?
Finally a short and simple one:
failbit is set when formatting errors are detected, e.g. a number is expected but the character 'T' is found.
badbit is set when something goes wrong in the stream's infrastructure. For example, when the stream buffer isn't set (as in the stream fmt above) the stream has std::badbit set. The other reason is if an exception is thrown (and caught by way of the the exceptions() mask; by default all exceptions are caught).
Is there any online reference (or a book) that explains deeply how to deal with iostreams ? not just the basic stuff: the complete error handling.
Ah, yes, glad you asked. You probably want to get Nicolai Josuttis's "The C++ Standard Library". I know that this book describes all the details because I contributed to writing it. If you really want to know everything about IOStreams and locales you want Angelika Langer & Klaus Kreft's "IOStreams and Locales". In case you wonder where I got the information from originally: this was Steve Teale's "IOStreams" I don't know if this book is still in print and it lacking a lot of the stuff which was introduced during standardization. Since I implemented my own version of IOStreams (and locales) I know about the extensions as well, though.
So here are my questions:
Q: Where is my mistake here ?
I would not call your technique a mistake. It is absolutely fine.
When you read data from a stream you normally already know the objects coming off that stream (if the objects have multiple interpretations then that also needs to either be encoded into the stream (or you need to be able to rollback the stream).
Q: Should one write his calls to operator>> to leave the initial data still available after a failure?
Failure state should be there only if something really bad went wrong.
In your case if you are expecting a foobar (that has two representations) you have a choice:
Mark the type of object that is coming in the stream with some prefix data.
In the foobar parsing section use ftell() and fseek() to restore the stream position.
Try:
std::streampos point = stream.tellg();
if (is >> foo)
{
foobar = foo;
}
else
{
stream.seekg(point)
is.clear();
Q: If so, how can I do that efficiently ?
I prefer method 1 where you know the type on the stream.
Method two can used when this is unknowable.
Q: If not, is there a way to "store" (and restore) the complete status of an input stream: state and data ?
Yes but it requires two calls: see
std::iostate state = stream.rdstate()
std::istream holder;
holder.copyfmt(stream)
Q: What differences are they between failbit and badbit ?
From the documentation to the call fail():
failbit: is generally set by an input operation when the error was related to the internal logic of the operation itself, so other operations on the stream may be possible.
badbit: is generally set when the error involves the loss of integrity of the stream, which is likely to persist even if a different operation is performed on the stream. badbit can be checked independently by calling member function bad.
Q: When should we use one or the other ?
You should be setting failbit.
This means that your operation failed. If you know how it failed then you can reset and try again.
badbit is when you accidentally mash internal members of the stream or do something so bad that to the stream object itself is completely forked.
When you serialize your FooBar you should have a flag indicating which one it is, which will be the "header" for your write/read.
When you read it back, you read the flag then read in the appropriate datatype.
And yes, it is safest to read first into a temporary object then move the data. You can sometimes optimise this with a swap() function.
I had been reading a few articles on some sites about Formatted and Unformatted I/O, however i have my mind more messed up now.
I know this is a very basic question, but i would request anyone can give a link [ to some site or previously answered question on Stackoverflow ] which explains, the idea of streams in C and C++.
Also, i would like to know about Formatted and Unformatted I/O.
The standard doesn't define what these terms mean, it just says which of the functions defined in the standard are formatted IO and which are not. It places some requirements on the implementation of these functions.
Formatted IO is simply the IO done using the << and >> operators. They are meant to be used with text representation of the data, they involve some parsing, analyzing and conversion of the data being read or written. Formatted input skips whitespace:
Each formatted input function begins execution by constructing an object of class sentry with the noskipws (second) argument false.
Unformatted IO reads and writes the data just as a sequence of 'characters' (with possibly applying the codecvt of the imbued locale). It's meant to read and write binary data, or function as a lower-level used by the formatted IO implementation. Unformatted input doesn't skip whitespace:
Each unformatted input function begins execution by constructing an object of class sentry with the default argument noskipws (second) argument true.
And allows you to retrieve the number of characters read by the last input operation using gcount():
Returns: The number of characters extracted by the last unformatted input member function called for the object.
Formatted IO means that your output is determined by a "format string", that means you provide a string with certain placeholders, and you additionally give arguments that should be used to fill these placeholders:
const char *daughter_name = "Lisa";
int daughter_age = 5;
printf("My daughter %s is %d years old\n", daughter_name, daughter_age);
The placeholders in the example are %s, indicating that this shall be substituted using a string, and %d, indicating that this is to be replaced by a signed integer number. There are a lot more options that give you control over how the final string will present itself. It's a convenience for you as the programmer, because it relieves you from the burden of converting the different data types into a string and it additionally relieves you from string appending operations via strcat or anything similar.
Unformatted IO on the other hand means you simply write character or byte sequences to a stream, not using any format string while you are doing so.
Which brings us to your question about streams. The general concept behind "streaming" is that you don't have to load a file or whatever input as a whole all the time. For small data this does work though, but imagine you need to process terabytes of data - no way this will fit into a single byte array without your machine running out of memory. That's why streaming allows you to process data in smaller-sized chunks, one at a time, one after the other, so that at any given time you just have to deal with a fix-sized amount of data. You read the data into a helper variable over and over again and process it, until your underlying stream tells you that you are done and there is no more data left.
The same works on the output side, you write your output step for step, chunk for chunk, rather than writing the whole thing at once.
This concept brings other nice features, too. Because you can nest streams within streams within streams, you can build a whole chain of transformations, where each stream may modify the data until you finally receive the end result, not knowing about the single transformations, because you treat your stream as if there were just one.
This can be very useful, for example C or C++ streams buffer the data that they read natively from e.g. a file to avoid unnecessary calls and to read the data in optimized chunks, so that the overall performance will be much better than if you would read directly from the file system.
Unformatted Input/Output is the most basic form of input/output. Unformatted input/output transfers the internal binary representation of the data directly between memory and the file. Formatted output converts the internal binary representation of the data to ASCII characters which are written to the output file. Formatted input reads characters from the input file and converts them to internal form. Formatted