strange eof flags in stream

strange eof flags in stream - c++

I have encountered a strange problem when parsing text file using c++ file stream. Here is the code:
while (true)
{
std::getline(inFile, line);
if (!inFile.good())
{
std::cout << "Fail, bad and eof flags:" << inFile.fail() << inFile.bad() << inFile.eof() << std::endl;
break;
}
parseLine(line);
}
When the read terminates, the output is:
Fail, bad and eof flags:001
But actually the reader does not reach the end of file. I open the file and find that the next character is actually 26 (ASCII code). Then the problem is: 1) why the eof flag is set when reading this character, and how to avoid this kind of false termination? and 2) how to recover from this state? Thanks!
PS: thanks the replies. What if I read the file in binary mode? Any better solution? I use the Windows platform but the file seems to be an unix file.

why the eof flag is set when reading this character
Because it's the EOF marker character.
From Wikipedia:
In Microsoft's DOS and Windows (and in CP/M and many DEC operating
systems), reading from the terminal will never produce an EOF.
Instead, programs recognize that the source is a terminal (or other
"character device") and interpret a given reserved character or
sequence as an end-of-file indicator; most commonly this is an ASCII
Control-Z, code 26.
how to avoid this kind of false termination
It's not a "false" termination.
how to recover from this state?
You don't need to.
If you were trying to read a "binary file" where arbitrary characters would be expected, you would open your file stream in binary mode.

The ASCII character 26 is the SUB control character, which in caret notation is ^Z. This might be recognizable to you as the Windows end of file character. So assuming ASCII and Windows, there you go.

Here you go:
Getline and 16h (26d) character
Looks like you have to write your own getline function. Seems there is no way around it :p That I know of, and it seems no one else knows. If anyone knows a better way, chime in.

Related

Is there a character that can be used as a delimiter for EOF?

I wrote a function that spellchecks a line read from a file which takes in a file stream and a delimiter as parameters. My problem is that the function requires a delimiter, but when reading in the last line, I haven't got one. I would use the last character of the file, but I need that last character for spellcheck purposes.
Is there any way to use the EOF macro as a delimiter?

Typically, you would let the stream tell you when it has received an EOF signal in whatever platform-dependent way is appropriate (be that the end of a file, or Ctrl+D on a Linux terminal emulator).
So, stop reading when you hit your custom delimiter, or when an attempt to read from the stream sets the stream's EOF bit. You ought to be checking the stream's state anyway — what if there's an error? You'll be looping forever at the moment.
That's how std::getline and co do it, anyway.

How Can I Detect That a Binary File Has Been Completely Consumed?

If I do this:
ofstream ouput("foo.txt");
output << 13;
output.close();
ifstream input("foo.txt");
int dummy;
input >> dummy;
cout << input.good() << endl;
I'll get the result: "0"
However if I do this:
ofstream ouput("foo.txt", ios_base::binary);
auto dummy = 13;
output.write(reinterpret_cast<const char*>(&dummy), sizeof(dummy));
output.close();
ifstream input("foo.txt", ios_base::binary);
input.read(reinterpret_cast<char*>(&dummy), sizeof(dummy));
cout << input.good() << endl;
I'll get the result: "1"
This is frustrating to me. Do I have to resort to inspecting the ifstream's buffer to determine whether it has been entirely consumed?

Regarding
How Can I Detect That a Binary File Has Been Completely Consumed?
A slightly inefficient but easy to understand way is to measure the size of the file:
ifstream input("foo.txt", ios_base::binary);
input.seekg(0, ios_base::end); // go to end of the file
auto filesize = input.tellg(); // current position is the size of the file
input.seekg(0, ios_base::beg); // go back to the beginning of the file
Then check current position whenever you want:
if (input.tellg() == filesize)
cout << "The file was consumed";
else
cout << "Some stuff left in the file";
This way has some disadvantages:
Not efficient - goes back and forth in the file
Doesn't work with special files (e.g. pipes)
Doesn't work if the file is changed (e.g. you open your file in read-write mode)
Only works for binary files (seems your case, so OK), not text files
So better just use the regular way people do it, that is, try to read and bail if it fails:
if (input.read(reinterpret_cast<char*>(&dummy), sizeof(dummy)))
cout << "I have read the stuff, will work on it now";
else
cout << "No stuff in file";
Or (in a loop)
while (input.read(reinterpret_cast<char*>(&dummy), sizeof(dummy)))
{
cout << "Working on your stuff now...";
}

You are doing totally different things.
The operator>> is greedy and will read as much as possible into dummy. It so happens that while doing so, it runs into the end of file. That sets the input.eof(), and the stream is no longer good(). As it did find some digits before the end, the operation is still successful.
In the second read, you ask for a specific number of bytes (4 most likely) and the read is successful. So the stream is still good().
The stream interface doesn't predict the outcome of any future I/O, because in the general case it cannot know. If you use cin instead of input there might now be more to read, if the user continued typing.
Specifically, the eof() state doesn't appear until someone tries to read past end-of-file.

For text streams, as you have written only the integer value and not even a space not an end of line, at read time, the library must try to read one character passed the 1 and 3 and hits the end of file. So the good bit is false and the eof is true.
For binary streams, you have written 4 bytes (sizeof(int)) assuming ints are 32 bits large, and you read 4 bytes. Fine. No problem has still occured and the good bit is true and eof false. Only next read will hit the end of file.
But beware. In text example, if you open the text file in a editor and simply save it without changing anything, chances are that the editor automacally adds an end of line. In that case, the read will stop on the end of line and as for the binary case the good bit will be true and eof false. Same is you write with output << 13 << std::endl;
All that means that you must never assume that a read is not the last element of a file when good it true and eof is false, because the end of file may be hit only on next read even if nothing is returned then.
TL/DR: the only foolproof way to know that there is nothing left in a file is when you are no longer able to read something from it.

You do not need to resort to inspecting the buffer. You can determine if the whole file has been consumed: cout << (input.peek() != char_traits<char>::eof()) << endl This uses: peek, which:
Reads the next character from the input stream without extracting it
good in the case of the example is:
Returning false after the last extraction operation, which occurs because the int extraction operator has to read until it finds a character that is not a digit. In this case that's the EOF character, and when that character is read even as a delimiter the stream's eofbit is set, causing good to fail
Returning true after calling read, because read extracts exactly sizeof(int)-bytes so even if the EOF character is the next character it is not read, leaving the stream's eofbit unset and good passing
peek can be used after either of these and will correctly return char_traits<char>::eof() in both cases. Effectively this is inspecting the buffer for you, but with one vital distinction for binary files: If you were to inspect a binary file yourself you'd find that it may contain the EOF character. (On most systems that's defined as 0xFF, 4 of which are in the binary representation of -1.) If you are inspecting the buffer's next char you won't know whether that's actually the end of the file or not.
peek doesn't just return a char though, it returns an int_type. If peek returns 0x000000FF then you're looking at an EOF character, but not the end of file. If peek returns char_traits<char>::eof() (typically 0xFFFFFFFF) then you're looking at the end of the file.

std::getline deal with \n, \r and \r\n [duplicate]

Specifically I'm interested in istream& getline ( istream& is, string& str );. Is there an option to the ifstream constructor to tell it to convert all newline encodings to '\n' under the hood? I want to be able to call getline and have it gracefully handle all line endings.
Update: To clarify, I want to be able to write code that compiles almost anywhere, and will take input from almost anywhere. Including the rare files that have '\r' without '\n'. Minimizing inconvenience for any users of the software.
It's easy to workaround the issue, but I'm still curious as to the right way, in the standard, to flexibly handle all text file formats.
getline reads in a full line, up to a '\n', into a string. The '\n' is consumed from the stream, but getline doesn't include it in the string. That's fine so far, but there might be a '\r' just before the '\n' that gets included into the string.
There are three types of line endings seen in text files:
'\n' is the conventional ending on Unix machines, '\r' was (I think) used on old Mac operating systems, and Windows uses a pair, '\r' following by '\n'.
The problem is that getline leaves the '\r' on the end of the string.
ifstream f("a_text_file_of_unknown_origin");
string line;
getline(f, line);
if(!f.fail()) { // a non-empty line was read
// BUT, there might be an '\r' at the end now.
}
Edit Thanks to Neil for pointing out that f.good() isn't what I wanted. !f.fail() is what I want.
I can remove it manually myself (see edit of this question), which is easy for the Windows text files. But I'm worried that somebody will feed in a file containing only '\r'. In that case, I presume getline will consume the whole file, thinking that it is a single line!
.. and that's not even considering Unicode :-)
.. maybe Boost has a nice way to consume one line at a time from any text-file type?
Edit I'm using this, to handle the Windows files, but I still feel I shouldn't have to! And this won't fork for the '\r'-only files.
if(!line.empty() && *line.rbegin() == '\r') {
line.erase( line.length()-1, 1);
}

As Neil pointed out, "the C++ runtime should deal correctly with whatever the line ending convention is for your particular platform."
However, people do move text files between different platforms, so that is not good enough. Here is a function that handles all three line endings ("\r", "\n" and "\r\n"):
std::istream& safeGetline(std::istream& is, std::string& t)
{
t.clear();
// The characters in the stream are read one-by-one using a std::streambuf.
// That is faster than reading them one-by-one using the std::istream.
// Code that uses streambuf this way must be guarded by a sentry object.
// The sentry object performs various tasks,
// such as thread synchronization and updating the stream state.
std::istream::sentry se(is, true);
std::streambuf* sb = is.rdbuf();
for(;;) {
int c = sb->sbumpc();
switch (c) {
case '\n':
return is;
case '\r':
if(sb->sgetc() == '\n')
sb->sbumpc();
return is;
case std::streambuf::traits_type::eof():
// Also handle the case when the last line has no line ending
if(t.empty())
is.setstate(std::ios::eofbit);
return is;
default:
t += (char)c;
}
}
}
And here is a test program:
int main()
{
std::string path = ... // insert path to test file here
std::ifstream ifs(path.c_str());
if(!ifs) {
std::cout << "Failed to open the file." << std::endl;
return EXIT_FAILURE;
}
int n = 0;
std::string t;
while(!safeGetline(ifs, t).eof())
++n;
std::cout << "The file contains " << n << " lines." << std::endl;
return EXIT_SUCCESS;
}

Are you reading the file in BINARY or in TEXT mode? In TEXT mode the pair carriage return/line feed, CRLF, is interpreted as TEXT end of line, or end of line character, but in BINARY you fetch only ONE byte at a time, which means that either character MUST be ignored and left in the buffer to be fetched as another byte! Carriage return means, in the typewriter, that the typewriter car, where the printing arm lies in, has reached the right edge of the paper and is returned to the left edge. This is a very mechanical model, that of the mechanical typewriter. Then the line feed means that the paper roll is rotated a little bit up so the paper is in position to begin another line of typing. As fas as I remember one of the low digits in ASCII means move to the right one character without typing, the dead char, and of course \b means backspace: move the car one character back. That way you can add special effects, like underlying (type underscore), strikethrough (type minus), approximate different accents, cancel out (type X), without needing an extended keyboard, just by adjusting the position of the car along the line before entering the line feed. So you can use byte sized ASCII voltages to automatically control a typewriter without a computer in between. When the automatic typewriter is introduced, AUTOMATIC means that once you reach the farthest edge of the paper, the car is returned to the left AND the line feed applied, that is, the car is assumed to be returned automatically as the roll moves up! So you do not need both control characters, only one, the \n, new line, or line feed.
This has nothing to do with programming but ASCII is older and HEY! looks like some people were not thinking when they begun doing text things! The UNIX platform assumes an electrical automatic typemachine; the Windows model is more complete and allows for control of mechanical machines, though some control characters become less and less useful in computers, like the bell character, 0x07 if I remember well... Some forgotten texts must have been originally captured with control characters for electrically controlled typewriters and it perpetuated the model...
Actually the correct variation would be to just include the \r, line feed, the carriage return being unnecessary, that is, automatic, hence:
char c;
ifstream is;
is.open("",ios::binary);
...
is.getline(buffer, bufsize, '\r');
//ignore following \n or restore the buffer data
if ((c=is.get())!='\n') is.rdbuf()->sputbackc(c);
...
would be the most correct way to handle all types of files. Note however that \n in TEXT mode is actually the byte pair 0x0d 0x0a, but 0x0d IS just \r: \n includes \r in TEXT mode but not in BINARY, so \n and \r\n are equivalent... or should be. This is a very basic industry confusion actually, typical industry inertia, as the convention is to speak of CRLF, in ALL platforms, then fall into different binary interpretations. Strictly speaking, files including ONLY 0x0d (carriage return) as being \n (CRLF or line feed), are malformed in TEXT mode (typewritter machine: just return the car and strikethrough everything...), and are a non-line oriented binary format (either \r or \r\n meaning line oriented) so you are not supposed to read as text! The code ought to fail maybe with some user message. This does not depend on the OS only, but also on the C library implementation, adding to the confusion and possible variations... (particularly for transparent UNICODE translation layers adding another point of articulation for confusing variations).
The problem with the previous code snippet (mechanical typewriter) is that it is very inefficient if there are no \n characters after \r (automatic typewriter text). Then it also assumes BINARY mode where the C library is forced to ignore text interpretations (locale) and give away the sheer bytes. There should be no difference in the actual text characters between both modes, only in the control characters, so generally speaking reading BINARY is better than TEXT mode. This solution is efficient for BINARY mode typical Windows OS text files independently of C library variations, and inefficient for other platform text formats (including web translations into text). If you care about efficiency, the way to go is to use a function pointer, make a test for \r vs \r\n line controls however way you like, then select the best getline user-code into the pointer and invoke it from it.
Incidentally I remember I found some \r\r\n text files too... which translates into double line text just as is still required by some printed text consumers.

The C++ runtime should deal correctly with whatever the endline convention is for your particular platform. Specifically, this code should work on all platforms:
#include <string>
#include <iostream>
using namespace std;
int main() {
string line;
while( getline( cin, line ) ) {
cout << line << endl;
}
}
Of course, if you are dealing with files from another platform, all bets are off.
As the two most common platforms (Linux and Windows) both terminate lines with a newline character, with Windows preceding it with a carriage return,, you can examine the last character of the line string in the above code to see if it is \r and if so remove it before doing your application-specific processing.
For example, you could provide yourself with a getline style function that looks something like this (not tested, use of indexes, substr etc for pedagogical purposes only):
ostream & safegetline( ostream & os, string & line ) {
string myline;
if ( getline( os, myline ) ) {
if ( myline.size() && myline[myline.size()-1] == '\r' ) {
line = myline.substr( 0, myline.size() - 1 );
}
else {
line = myline;
}
}
return os;
}

One solution would be to first search and replace all line endings to '\n' - just like e.g. Git does by default.

Other than writing your own custom handler or using an external library, you are out of luck. The easiest thing to do is to check to make sure line[line.length() - 1] is not '\r'. On Linux, this is superfluous as most lines will end up with '\n', meaning you'll lose a fair bit of time if this is in a loop. On Windows, this is also superfluous. However, what about classic Mac files which end in '\r'? std::getline would not work for those files on Linux or Windows because '\n' and '\r' '\n' both end with '\n', eliminating the need to check for '\r'. Obviously such a task that works with those files would not work well. Of course, then there exist the numerous EBCDIC systems, something that most libraries won't dare tackle.
Checking for '\r' is probably the best solution to your problem. Reading in binary mode would allow you to check for all three common line endings ('\r', '\r\n' and '\n'). If you only care about Linux and Windows as old-style Mac line endings shouldn't be around for much longer, check for '\n' only and remove the trailing '\r' character.

Unfortunately the accepted solution does not behave exactly like std::getline(). To obtain that behavior (to my tests), the following change is necessary:
std::istream& safeGetline(std::istream& is, std::string& t)
{
t.clear();
// The characters in the stream are read one-by-one using a std::streambuf.
// That is faster than reading them one-by-one using the std::istream.
// Code that uses streambuf this way must be guarded by a sentry object.
// The sentry object performs various tasks,
// such as thread synchronization and updating the stream state.
std::istream::sentry se(is, true);
std::streambuf* sb = is.rdbuf();
for(;;) {
int c = sb->sbumpc();
switch (c) {
case '\n':
return is;
case '\r':
if(sb->sgetc() == '\n')
sb->sbumpc();
return is;
case std::streambuf::traits_type::eof():
is.setstate(std::ios::eofbit); //
if(t.empty()) // <== change here
is.setstate(std::ios::failbit); //
return is;
default:
t += (char)c;
}
}
}
According to https://en.cppreference.com/w/cpp/string/basic_string/getline:
Extracts characters from input and appends them to str until one of the following occurs (checked in the order listed)
end-of-file condition on input, in which case, getline sets eofbit.
the next available input character is delim, as tested by Traits::eq(c, delim), in which case the delimiter character is extracted from input, but is not appended to str.
str.max_size() characters have been stored, in which case getline sets failbit and returns.
If no characters were extracted for whatever reason (not even the discarded delimiter), getline sets failbit and returns.

If it is known how many items/numbers each line has, one could read one line with e.g. 4 numbers as
string num;
is >> num >> num >> num >> num;
This also works with other line endings.

Having problems with 0x0A character in C++ even in binary mode. (interprets it as new file)

Hi this might seem a bit noobie, but here we go. Im developing a program that downloads leaderboards of a certain game from the internet and transforms it into a proper format to work with it (elaborate rankings, etc).
The files contains the names, ordered by rank, but between each name there are 7 random control codes (obivously unprintable). The txt file looks like this:
..C...hName1..)...&Name2......)Name3..é...þName4..Ü...†Name5..‘...QName6..~...bName7..H...NName8..|....Name9..v...HName10.
Checked via an hexEditor and saw the first control code after each name is always a null character (0x00). So, what I do is read everything, and then cout every character. When a 0x00 character is found, skip 7 characters and keep couting. Therefore you end up with the list, right?
At first I had the problem that on those random control codes, sometimes you would find like a "soft EOF" (0x1A), and the program would stop reading there. So I finally figured out to open it in binary mode. It worked, and then everything would be couted... or thats what I thought.
But I came across another file which still didn't work, and finally found out that there was an EOF character! (0x0A) Which doesn't makes sense since Im opening it in binary mode. But still, after reading that character, C++ interprets that as a new file, and hence skips 7 characters, so the name after that character will always appear cut.
Here's my current code:
#include <cstdlib>
#include <iostream>
#include <fstream>
using namespace std;
int main () {
string scores;
system("wget http://certainwebsite/001.txt"); //download file
ifstream highin ("001.txt", ios::binary);
ofstream highout ("board.txt", ios::binary);
if (highin.is_open())
{
while ( highin.good() )
{
getline (highin, scores);
for (int i=0;i<scores.length(); i++)
{
if (scores[i]==0x00){
i=i+7; //skip 7 characters if 'null' is found
cout << endl;
highout << endl;
}
cout << scores[i];
highout << scores[i]; //cout names and save them in output file
}
}
highin.close();
}
else cout << "Unable to open file";
system("pause>nul");
}
Not sure how to ignore that character if being already in binary mode doesn't work. Sorry for the long question but I wanted to be detailed and specific. In this case, the EOF character is located before the Name3, and hence this is how the output looks like:
http://i.imgur.com/yu1NjoZ.png

By default getline() reads until the end of line and discards the newline character. However, the delimiter character could be customized (by supplying the third parameter). If you wish to read until the null character (not until the end of line), you could try using getline (highin, scores, '\0'); (and adjusting the logic of skipping the characters).

I'm glad you figured it out and it doesn't surprise me that getline() was the culprit. I had a similar issue dealing with the newline character when I was trying to read in a CSV file. There are several different getline() functions in C++ depending on how you call the function and each seems to handle the newline character differently.
As a side note, in your for loop, I'd recommend against performing a method call in your test. That adds unnecessary overhead to the loop. It'd be better to call the method once and put that value into a variable, then enter the loop and test i against the length variable. Unless you expect the length to change, calling the length() method each iteration is a waste of system resources.

Thank you all guys, it worked, it was the getline() which was giving me problems indeed. Due to the 'while' loop, each time it found a new line character, it restarted the process, hence skipping those 7 characters.

std::getline and eol vs eof

I've got a program that is tailing a growing file.
I'm trying to avoid grabbing a partial line from the file (e.g. reading before the line is completely written by the other process.) I know it's happening in my code, so I'm trying to catch it specifically.
Is there a sane way to do this?
Here's what I'm trying:
if (getline (stream, logbuffer))
{
if (stream.eof())
{
cout << "Partial line found!" << endl;
return false;
}
return true;
}
return false;
However, I can't easily reproduce the problem so I'm not sure I'm detecting it with this code. std::getline strips off newlines, so I can't check the buffer for a trailing newline. My log message (above) is NEVER tripping.
Is there some other way of trying to check what I want to detect? Is there a way to know if the last line I read hit EOF without finding a EOL character?
Thanks.

This will never be true:
if (getline (stream, logbuffer))
{
if (stream.eof())
{
/// will never get here
If getline() worked, the stream cannot be in an eof state. The eof() and related state tests only work on the results of a previous read operation such as getline()- they do not predict what the next read will do.
As far as I know, there is no way of doing what you want. However, if the other process writes a line at a time, the problems you say you are experiencing should be very rare (non -existent in my experience), depending to some extent on the OS you are are using. I suspect the problem lies elsewhere, probably in your code. Tailing a file is a very common thing to do, and one does not normally need to resort to special code to do it.
However, should you find you do need to read partial lines, the basic algorithm is as follows:
forever do
wait for file change
read all possible input using read or readsome (not getline)
chop input into lines and possible partial line
process as required
end

An istream object such as std::cin has a get function that stops reading when it gets to a newline without extracting it from the stream. You could then peek() or get() it to see if indeed it is a newline. The catch is that you have to know the maximum length of a line coming from the other application. Example (untested) code follows below:
char buf[81]; // assumes an 80-char line length + null char
memset(buf, 0, 81);
if (cin.get(buf, 81))
{
if (cin.peek() == EOF) // You ran out of data before hitting end of line
{
cout << "Partial line found!\n";
}
}

I have to take issue with one statement you made here:
However, I can't easily reproduce the problem so I'm not sure I'm detecting it with this code.
It seems like from what you said it would be extremely easy to replicate your problem, if it is what you said. You can easily create a text file in some text editor - just make sure that the last like ends in an EOF instead of going on to a new line. Then point your program at that file and see what results.

Even if the other program isn't done writing the file, in the file that's where the line ends, so there's no way to tell the difference other than waiting to see if the other program writes something new.
edit: If you just want to tell if the line ends in a newline or not, you could write your own getline function that reads until it hits a newline but doesn't strip it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

strange eof flags in stream - c++

The ASCII character 26 is the SUB control character, which in caret notation is ^Z. This might be recognizable to you as the Windows end of file character. So assuming ASCII and Windows, there you go.

Here you go: Getline and 16h (26d) character Looks like you have to write your own getline function. Seems there is no way around it :p That I know of, and it seems no one else knows. If anyone knows a better way, chime in.

Related

Is there a character that can be used as a delimiter for EOF?

How Can I Detect That a Binary File Has Been Completely Consumed?

std::getline deal with \n, \r and \r\n [duplicate]

Having problems with 0x0A character in C++ even in binary mode. (interprets it as new file)

std::getline and eol vs eof

Categories

Resources