In the following block of code I have created a numbers.txt document which has the number 1 written on it shouldn't this program spit the word OK back infinite number of times because it's going past the eof marker
while (!sample.eof())
{
char ch;
sample.get(ch);
sample.seekp(-1L, ios::cur);
sample >> initialnumber;
sample.seekp(2L, ios::cur);
cout << "OK";
}
There is no such thing as an "EOF marker"1. EOF is simply a file condition defined by going at or past the end of the file. Whether you seek 1 byte or seek 100000 bytes past the end makes no difference: if your file position pointer is past the end, you are at or beyond the End Of File.
Your code reads a character and then backs up (essentially negating the character read). It then reads an integer and skips two characters past that. This has the effect of always moving forward in the file (even if the integer read fails). Thus, you will eventually hit EOF: there is no infinite loop here.
1 In DOS days, files could contain the 0x1A byte ("ASCII EOF") which would cause certain text readers to stop at that byte. The file contents could physically extend beyond this byte, but text utilities could refuse to read past it. However, the standard C++ libraries treat 0x1A like any other character, and will happily read past it.
Related
(this is possibly a duplicate of Why does std::basic_istream::ignore() extract more characters than specified?, however my specific case doesn't deal with the delim)
From cppreference, the description of istream::ignore is the following:
Extracts and discards characters from the input stream until and including delim.
ignore behaves as an UnformattedInputFunction. After constructing and checking the sentry object, it extracts characters from the stream and discards them until any one of the following conditions occurs:
count characters were extracted. This test is disabled in the special case when count equals std::numeric_limitsstd::streamsize::max()
end of file conditions occurs in the input sequence, in which case the function calls setstate(eofbit)
the next available character c in the input sequence is delim, as determined by Traits::eq_int_type(Traits::to_int_type(c), delim). The delimiter character is extracted and discarded. This test is disabled if delim is Traits::eof()
However, let's say I've got the following program:
#include <iostream>
int main(void) {
int x;
char p;
if (std::cin >> x) {
std::cout << x;
} else {
std::cin.clear();
std::cin.ignore(2);
std::cout << "________________";
std::cin >> p;
std::cout << p;
}
Now, let's say I input something like p when my program starts. I expect cin to 'fail', then clear to be called and ignore to discard 2 characters from the buffer. So 'p' and '\n' that are left in the buffer should be discarded. However, the program still expects input after ignore gets called, so in reality it's only get to the final std::cin>>p after I've given it more than 2 characters to discard.
My issue:
Inputting something like 'b' and hitting Enter immediately after the first input (so 2 after the characters get discarded, 'p' and '\n') keeps 'b' in the buffer and immediately passes it to cin, without first printing the message. How can I make it so that the message gets printed immediately after the two characters are discarded and then << is called?
After a lot of back and forth in the comments (and reproducing the problem myself), it's clear the problem is that:
You enter p<Enter>, which isn't parsable
You try to discard exactly two characters with ignore
You output the underscores
You prompt for the next input
but in fact things seem to stop at step 2 until you give it more input, and the underscores only appear later. Well, bad news, you're right, the code is blocking at step 2 in ignore. ignore is blocking waiting for a third character to be entered (really, checking if it's EOF after those two characters), and by the spec, this is apparently the correct thing to do, I think?
The problem here is the same basic issue as the problem you linked just a different manifestation. When ignore terminates because it's read the number of characters requested, it always attempts to reads one more character, because it needs to know if condition 2 might also be true (it happened to read the last character so it can take the appropriate action, putting cin in EOF state, or leaving the next character in the buffer for the next read otherwise):
Effects: Behaves as an unformatted input function (as described above). After constructing a sentry object, extracts characters and discards them. Characters are extracted until any of the following occurs:
n != numeric_limits::max() (18.3.2) and n characters have been extracted so far
end-of-file occurs on the input sequence (in which case the function calls setstate(eofbit), which may throw ios_base::failure (27.5.5.4));
traits::eq_int_type(traits::to_int_type(c), delim) for the next available input character c (in which case c is extracted).
Since you didn't provide an end character for ignore, it's looking for EOF, and if it doesn't find it after two characters, it must read one more to see if it shows up after the ignored characters (if it does, it'll leave cin in EOF state, if not, the character it peeked at will be the next one you read).
Simplest solution here is to not try to specifically discard exactly two characters. You want to get rid of everything through the newline, so do that with:
std::cin.ignore(std::numeric_limits<std::stringsize>::max(), '\n');
instead of std::cin.ignore(2);; that will read any and all characters until the newline (or EOF), consume the newline, and it won't ever overread (in the sense that it continues forever until the delimiter or EOF is found, there is no condition under which it finishes reading a count of characters and needs to peek further).
If for some reason you want to specifically ignore exactly two characters (how do you know they entered p<Enter> and not pabc<Enter>?), just call .get() on it a couple times or .read(&two_byte_buffer, 2) or the like, so you read the raw characters without the possibility of trying to peek beyond them.
For the record, this seems a little from the cppreference spec (which may be wrong); condition 2 in the spec doesn't specify it needs to verify if it is at EOF after reading count characters, and cppreference claims condition 3 (which would need to peek) is explicitly not checked if the "delimiter" is the default Traits::eof(). But the spec quote found in your other answer doesn't include that line about condition 3 not applying for Traits::eof(), and condition 2 might allow for checking if you're at EOF, which would end up with the observed behavior.
Your problem is related to your terminal. When you press ENTER, you are most likely getting two characters -- '\r' and '\n'. Consequently, there is still one character left in the input stream to read from. Change that line to:
std::cin.ignore(10, '\n'); // 10 is not magical. You may use any number > 2
to see the behavior you are expecting.
Passing exact number of characters in buffer will do the trick:
std::cin.ignore(std::cin.rdbuf()->in_avail());
I want to use openMP to read a big file which contains lots of lines from disk. One way to do it seems to use seekg() function. But the headache part is seekg() only support to move the file index to a particular byte.
This works fine if the size of each line is exactly the same. But I have no idea that how to do it if the size of each line is totally different.
So could you give me some hint?
One possibility:
Divide the file into equal-sized chunks based on bytes, one for each parallel task, without regard to line endings.
Have each task seek to the beginning of its chunk, then read and ignore characters until it finds a line ending, so that it can start processing the file at the beginning of a line. (As a special case, the task that starts at offset 0 should not do this, because it's already at the beginning of a line.)
When a task reaches the end of its chunk (i.e. the byte offset where the next chunk begins), continue reading past that point to the end of the current line. (As a special case, the end of the last chunk is also the end of the file, so there's nothing to read past that point.)
Basically, you initially choose boundaries based on byte offsets, but then move them forward to coincide with line endings. Each task skips some characters at the beginning of its chunk, and those characters are instead handled by another task reading past the end of the preceding chunk.
(I believe this is how Hadoop splits text-based input files by default, BTW.)
It seems well accepted that the istream::peek operation is blocking.
The standard, though arguably a bit ambiguous, leans towards nonblocking behavior. peek calls sgetc in turn, whose behavior is:
"The character at the current position of the controlled input sequence, as a value of type int.
If there are no more characters to read from the controlled input sequence, the function returns the end-of-file value (EOF)."
It doesn't say "If there are no more characters.......wait until there are"
Am I missing something here? Or are the peek implementations we use just kinda wrong?
The controlled input sequence is the file (or whatever) from which you're reading. So if you're at end of file, it returns EOF. Otherwise it returns the next character from the file.
I see nothing here that's ambiguous at all--if it needs a character that hasn't been read from the file, then it needs to read it (and wait till it's read, and return it).
If you're reading from something like a socket, then it's going to wait until data arrives (or the network stack detects EOF, such as the peer disconnecting).
The description from cppreference.com might be clearer than the one in your question:
Ensures that at least one character is available in the input area by [...] reading more data in from the input sequence (if applicable)."
"if applicable" does apply in this case; and "reading data from the input sequence" entails waiting for more data if there is none and the stream is not in an EOF or other error state.
When I get confused about console input I remind myself that console input can be redirected to come from a file, so the behavior of the keyboard more or less mimics the behavior of a file. When you try to read a character from file, you can get one of two results: you get a character, or you get EOF because you've reached the end of the file -- there are no more characters to be read. Same thing for keyboard input: either you get a character, or you get EOF because you've reached the end of the file. With a file, there is no notion of waiting for more characters: either a file has unread characters or it doesn't. Same thing for the keyboard. So if you have't reached EOF on the keyboard, reading a character returns the next character. You reach EOF on the keyboard by typing whatever character your system recognizes as EOF; on Unix systems that's ctrl-D, on Windows (if I remember correctly) that's ctrl-C. If you haven't reached EOF, there are more characters to be read.
I'm a beginner in C++ and trying to better understand feof(). I've read that feof() flag is set to true only after trying to read past the end of a file so many times beginners will read once more than they were expecting if they do something like while(!feof(file)). What I'm trying to understand though, is how does it actually interpret that an attempt has been made to read past the end of the file? Is the entire file already read in and the number of characters already known or is there some other mechanism at work?
I realize this may be a duplicate question somewhere, but I've been unable to find it, probably because I don't know the best way to word what I'm asking. If there is an answer already out there a link would be much appreciated. Thanks.
Whatever else the C++ library does, eventually it has to read from the file. Somewhere in the operating system, there is a piece of code that eventually handles that read. It obtains from the filesystem the length of the file, stored the same way the filesystem stores everything else. Knowing the length of the file, the position of the read, and the number of bytes to be read, it can make the determination that the low-level read hits the end of the file.
When that determination is made, it is passed up the stack. Eventually, it gets to the standard library which records internally that the end of file has been reached. When a read request into the library tries to go past that recorded end, the EOF flag is set and feof will start returning true.
feof() is a part of the standard C library buffered I/O. Since it's buffered, fread() pre-reads some data (definitely not the whole file, though). If, while buffering, fread() detects EOF (the underlying OS routine returns a special value, usually -1), it sets a flag on the FILE structure. feof() simply checks for that flag. So feof() returning true essentially means “a previous read attempt encountered end of file”.
How EOF is detected is OS/FS-specific and has nothing to do whatsoever with the C library/language. The OS has some interface to read data from files. The C library is just a bridge between the OS and the program, so you don't have to change your program if you move to another OS. The OS knows how the files are stored in its filesystem, so it knows how to detect EOF. My guess is that typically it is performed by comparing the current position to the length of the file, but it may be not that easy and may involve a lot of low-level details (for example, what if the file is on a network drive?).
An interesting question is what happens when the stream is at the end, but it was not yet detected by any reads. For example, if you open an empty file. Does the first call to feof() before any fread() return true or false? The answer is probably false. The docs aren't terribly clear on this subject:
This indicator is generally set by a previous operation on the stream
that attempted to read at or past the end-of-file.
It sounds as if a particular implementation may choose some other unusual ways to set this flag.
Most file system maintain meta information about the file (including it's size), and an attempt to read past the end of results in the feof flag being set. Others, for instance, old or lightweight file systems, set feof when they come to the last byte of the last block in the chain.
How does feof() actually know when the end of file is reached?
When code attempts to read passed the last character.
Depending on the file type, the last character is not necessarily known until a attempt to read past it occurs and no character is available.
Sample code demonstrating feof() going from 0 to 1
#include <stdio.h>
void ftest(int n) {
FILE *ostream = fopen("tmp.txt", "w");
if (ostream) {
while (n--) {
fputc('x', ostream);
}
fclose(ostream);
}
FILE *istream = fopen("tmp.txt", "r");
if (istream) {
char buf[10];
printf("feof() %d\n", feof(istream));
printf("fread %zu\n", fread(buf, 1, 10, istream));
printf("feof() %d\n", feof(istream));
printf("fread %zu\n", fread(buf, 1, 10, istream));
printf("feof() %d\n", feof(istream));
puts("");
fclose(istream);
}
}
int main(void) {
ftest(9);
ftest(10);
return 0;
}
Output
feof() 0
fread 9 // 10 character read attempted, 9 were read
feof() 1 // eof is set as previous read attempted to read passed the 9th or last char
fread 0
feof() 1
feof() 0
fread 10 // 10 character read attempted, 10 were read
feof() 0 // eof is still clear as no attempt to read passed the 10th, last char
fread 0
feof() 1
The feof() function sets the end of file indicator when the EOF character is read. So when feof() reads the last item, the EOF is not read along with it at first. Since no EOF indicator is set and feof() returns zero, the flow enters the while loop again. This time fgets comes to know that the next character is EOF, its discards it and returns NULL but also sets the EOF indicator. So feof() detects the end of file indicator and returns a non-zero value therefore breaking the while loop.
I am trying to write a simple UTF-8 decoder for my assignment. I'm fairly new with C++ so bear with me here...
I have to determine whether the encoding is valid or not, and output the value of the UTF-8 character in hexadecimal in either case. Say that I have read the first byte and used this first byte to determine the number of bytes in this UTF8 character. The problem is that after I read the first byte, i'm having trouble setting the ifstream position back one byte and read the whole UTF-8 character. I've tried seekg() and putback(), but i always get BUS error or some weird output that's not my test data. Please help, thanks.
Even though i can use peek() for the first byte, but i still have to read the following bytes to determine whether the encoding is valid or not. The problem of setting back the stream position is still there.
I would suggest you use peek() to read the first byte instead. seekg() should work to rewind, but a BUS error is usually caused by your code breaking alignment issues, which points to you doing something else evil in your code.
Why do you have to seek back? Can't you simply read the rest of the UTF-8 sequence, after knowing how many more octets you're expecting?
I would read the next byte directly and add it to what I got. As Ates Goral said. It is cleaner IMHO.
Anyway, You could move the stream pointer using seekg():
char byte = 0;
unsigned int character = 0; // on every usage
ifstream file("test.txt", ios::binary);
file.get(byte);
......
file.seekg(-1, ios::cur); // cur == current position
file.get(
reinterpret_cast<char*>(&character),
numberOfBytesAndNullTerminator);
cout << hex << character;
Beware that get() in the second case writes '\0' at the end of character. So you have to give it the required number of bytes including the null terminator. So, if you want to read two bytes ==> numberOfBytesAndNullTerminator = 3.
I don't know why you need to put the character back but istream::unget() or istream::putback() should do what you want. Look them up in your compiler's documentation.
please look up :
ifstream::seekg()
ifstream::teellg()