c++ EOF running one too many times? - c++

This is my first time using EOF and/or files, and I am having an issue where my code hangs, which I believe is because my EOF is looping one too many times.
I am imputing from a file, and dynamically creating objects that way, and it hangs once the file is run through.
while( !studentFile.eof() )
{
cout << "38\n";
Student * temp = new Student();
(*temp).input( studentFile );
(*sdb).insert( (*temp) );
}
This chunk of code is the code in question. The cout >> "38\n"; is the line number and the reason I believe it is hanging from looping one too many times.
The file only contains 4 student's worth of data, yet 38 appears 5 times, which is the reason I believe it is looping one too many times; Once it gets the last bit of data, it doesn't seem to register that the file has ended, and loops in again, but there is no data to input so my code hangs.
How do I fix this? Is my logic correct?
Thank you.

Others have already pointed out the details of the problem you've noticed.
You should realize, however, that there are more problems you haven't noticed yet. One is a fairly obvious memory leak. Every iteration of the loop executes: Student * temp = new Student();, but you never execute a matching delete.
C++ makes memory management much simpler than some other languages (e.g., Java), which require you to new every object you use. In C++, you can just define an object and use it:
Student temp;
temp.input(studentFile);
This simplifies the code and eliminates the memory leak -- your Student object will be automatically destroyed at the end of each iteration, and a (conceptually) new/different one created at the beginning of the next iteration.
Although it's not really a bug per se, even that can still be simplified quite a bit. Since whatever sdb points at apparently has an insert member function, you can use it like a standard container (which it may actually be, though it doesn't really matter either way). To neaten up the code, start by writing an extraction operator for a Student:
std::istream &operator>>(std::istream &is, Student &s) {
s.input(is);
return is;
}
Then you can just copy data from the stream to the collection:
std::copy(std::istream_iterator<Student>(studentFile),
std::istream_iterator<Student>(),
std::inserter(*sdf));
Note that this automates correct handling of EOF, so you don't have to deal with problems like you started with at all (even if you wanted to cause them, it wouldn't be easy).

This is because the EOF flag is only set after you try to read and get no data. So it would go
Test for EOF -> No EOF
Try to read one line -> Good, read first line
Test for EOF -> No EOF
Try to read one line -> Good, read second line
Test for EOF -> No EOF
Try to read one line -> Good, read third line
Test for EOF -> No EOF
Try to read one line -> Good, read fourth line
Test for EOF -> No EOF
Try to read one line -> EOF
But by the Try to read one line -> EOF, you're already in the body of the while on the fifth iteration, which is why you're seeing the loop run 5 times. So you need to read before you check for EOF.

You need to check the stream status bits immediately after performing an operation on a stream. You don't show the code, but it would appear that (*temp).input(studentFile) is doing the reading from the stream. Call eof() (or other status check) after doing the read but before processing with the data you (attempted to) read.

Related

Why think in terms of buffers and not lines? And can fgets() read multiple lines at one time?

Recently I have been reading Unix Network Programming Vol.1. In section 3.9, the last two paragraphs above Figure 3.18 said, here I quote:
...But our advice is to think in terms of butters and not lines, Write your code to read butters of data, and if a line is expected, check the buffer to see if it contains that line.
And in the next paragraph, the authors gave a more specific example, here I quote:
...as we'll see in Section 6.3. System functions like select still won't know about readline's internal buffer, so a carelessly written program could easily find itself waiting in select for data already received and stored in readline's butters.
In section 6.5, the actual problem is "mixing of stdio and select()", which would make the program, here I quote the book, "error-prone". But how?
I know that the authors gave the answer later in the same section and according to my understanding to the book, it is because of the data being hidden from select() and thus select() could not know that the data that has been read is consumed or not.
The answer is literally there, but the first problem here is that I really have a hard time getting it, I cannot imagine what damage would it make to the program, maybe I need a demo program that suffers from the problem to help me understand it.
Still in section 6.5, the authors tried to explain the problem further by giving, here I quote:
... Consider the case when several lines of input are available from the standard input.
select will cause the code at line 20 to read the input using fgets and that, in turn, will read the available lines into a buffer used by stdio. But, fgets only returns a single line and leaves any remaining data sitting in the stdio buffer ...
The "line 20" mentioned above is:
if (Fgets(sendline, MAXLINE, fp) == NULL)
where sendline is an array of char and fp is a pointer to FILE. I looked up into the detailed implementation of Fgets, and it just wrapped fgets() up with some extra error-dealing logic and nothing more.
And here comes my second question, how does fgets manage to, here I quote again, read the available lines? I mean, I looked up the man-page of fgets, it says fgets normally stops on the first newline character. Doesn't this mean that only one line would be read by fgets? More specifically, if I type one line in the terminal and press the enter key, then fgets reads this exact line. I do this again, then the next new line is read by fgets, and the point is one line at a time.
Thanks for your patience in reading all the descriptions, and looking forward to your answers.
One of the main reasons to think about buffers rather than lines (when it comes to network programming) is because TCP is a streaming protocol, where data is just a stream of bytes beginning with a connection and ending with a disconnection.
There are no message boundaries, and there are no "lines", except what the application-level protocol on top of TCP have decided.
That makes it impossible to read a "line" from a TCP connection, there are no such primitive functions for it. You must read using buffers. And because of the streaming and the lack of any kind of boundaries, a single call to receive data may give your application less than you ask for, and it may be a partial application-level message. Or you might get more than a single message, including a partial message at the end.
Another note of importance is that sockets by default are blocking, so a socket that don't have any data ready to be received will cause any read call to block, and wait until there are data. The select call only tells if the read call won't block right now. If you do the read call multiple times in a loop it can (and ultimately will) block when the data to receive is exhausted.
All this makes it really hard to use high-level functions like fgets (after a fdopen call of course) to read data from TCP sockets, as it can block at any time if you use blocking socket. Or it can return with a failure if you use non-blocking sockets and the read call returns with the failure that it would block (yes that is returned as an error).
If you use your own buffering, you can use select in the same loop as read or recv, to make sure that the call won't block. Or of you use non-blocking sockets you can gather data (and append to your buffer) with single read calls, and add detection when you have a full message (either by knowing its length or by detecting the message terminator or separator, like a newline).
As for fgets reading "multiple lines", it can cause the underlying reads to fill the buffers with multiple lines, but the fgets function itself will only fill your supplied buffer with a single line.
fgets will never give you multiple lines.
select is a Linux kernel call. It will tell you if the Linux kernel has data that your process hasn't received yet.
fgets is a C library call. To reduce the number of Linux kernel calls (which are typically slower) the C library uses buffering. It will try to read a big chunk of data from the Linux kernel (typically something like 4096 bytes) and then return just the part you asked for. Next time you call it, it will see if it already read the part you asked for, and then it won't need to read it from the kernel. For example, if it's able to read 5 lines at once from the kernel, it will return the first line, and the other 4 will be stored in the C library and returned in the next 4 calls.
When fgets reads 5 lines, returns 1, and stores 4, the Linux kernel will see that all the data has been read. It doesn't know your program is using the C library to read the data. Therefore, select will say there is no data to read, and your program will get stuck waiting for the next line, even though there already is one.
So how do you resolve this? You basically have two options: don't do buffering at all, or do your own buffering so you get to control how it works.
Option 1 means you read 1 byte at a time until you get a \n and then you stop reading. The kernel knows exactly how much data you have read, and it will be able to accurately tell you whether there's more data. However, making a kernel call for every single byte is relatively slow (measure it) and also, the computer on the other end of the connection could cause your program to freeze simply by not sending a \n at all.
I want to point out that option 1 is completely viable if you are just making a prototype. It does work, it's just not very good. If you try to fix the problems with option 1, you will find the only way to fix them is to do option 2.
Option 2 means doing your own buffering. You keep an array of say 4096 bytes per connection. Whenever select says there is data, you try to fill up the array as much as possible, and you check whether there is a \n in the array. If so, you process that line, remove the line from the array*, and repeat. This means you minimize kernel calls, and you also won't freeze if the other computer doesn't send a \n since the unfinished line will just stay in the array. If all 4096 bytes are used, and there is still no \n, you can either choose to process it as a big line (if this makes sense, e.g. in a chat program) or you can disconnect the connection, since the other computer is breaking the rules. Of course you can choose to use a bigger number than 4096.
* Extra for experts: "removing the line from the array" can be fast if you implement a "circular buffer" data structure.

How does the compiler optimize getline() so effectively?

I know a lot of a compiler's optimizations can be rather esoteric, but my example is so straightforward I'd like to see if I can understand, if anyone has an idea what it could be doing.
I have a 500 mb text file. I declare and initialize an fstream:
std::fstream file(path,std::ios::in)
I need to read the file sequentially. It's tab delimited but the field lengths are not known and vary line to line. The actual parsing I need to do to each line added very little time to the total (which really surprised me since I was doing string::find on each line from getline. I thought that'd be slow).
In general I want to search each line for a string, and abort the loop when I find it. I also have it incrementing and spitting out the line numbers for my own curiosity, I confirmed this adds little time (5 seconds or so) and lets me see how it blows past short lines and slows down on long lines.
I have the text to be found as the unique string tagging the eof, so it needs to search every line. I'm doing this on my phone so I apologize for formatting issues but it's pretty straightforward. I have a function taking my fstream as a reference and the text to be found as a string and returning a std::size_t.
long long int lineNum = 0;
while (std::getline (file, line))
{
pos = line.find(text);
lineNum += 1;
std::cout << std::to_string(lineNum) << std::endl;
if (pos != -1)
return file.tellg():
}
return std::string::npos;
Edit: lingxi pointed out the to_string isn't necessary here, thanks. As mentioned, entirely omitting the line number calculation and output saves a few seconds, which in my preoptimized example is a small percent of the total.
This successfully runs through every line, and returns the end position in 408 seconds. I had minimal improvement trying to put the file in a stringstream, or omitting everything in the whole loop (just getline until the end, no checks, searches, or displaying). Also pre-reserving a huge space for the string didn't help.
Seems like the getline is entirely the driver. However...if I compile with the /O2 flag (MSVC++) I get a comically faster 26 seconds. In addition, there is no apparent slow down on long lines vs short. Clearly the compiler is doing something very different. No complaints from me, but any thoughts as to how it's achieved? As an exercise I'd like to try and get my code to execute faster before compiler optimization.
I bet it has something to do with the way the getline manipulates the string. Would it be faster (alas can't test for a while) to just reserve the whole filesize for the string, and read character by character, incrementing my line number when I pass a /n? Also, would the compiler employ things like mmap?
UPDATE: I'll post code when I get home this evening. It looks like just turning off runtime checks dropped the execution from 400 seconds to 50! I tried performing the same functionality using raw c style arrays. I'm not super experienced, but it was easy enough to dump the data into a character array, and loop through it looking for newlines or the first letter of my target string.
Even in full debug mode it gets to the end and correctly finds the string in 54 seconds. 26 seconds with checks off, and 20 seconds optimized. So from my informal, ad-hoc experiments it looks like the string and stream functions are victimized by the runtime checks? Again, I'll double check when I get home.
The reason for this dramatic speedup is that the iostream class hierarchy is based on templates (std::ostream is actually a typedef of a template called std::basic_ostream), and a lot of its code is in headers. C++ iostreams take several function calls to process every byte in the stream. However, most of those functions are fairly trivial. By turning on optimization, most of these calls are inlined, exposing to the compiler the fact that std::getline essentially copies characters from one buffer to another until it finds a newline - normally this is "hidden" under several layers of function calls. This can be further optimized, reducing the overhead per byte by orders of magnitude.
Buffering behavior actually doesn't change between the optimized and non-optimized version, otherwise the speedup would be even higher.

Is there any way to read characters that satisfy certain conditions only from stdin in C++?

I am trying to read some characters that satisfy certain condition from stdin with iostream library while leave those not satisfying the condition in stdin so that those skipped characters can be read later. Is it possible?
For example, I want characters in a-c only and the input stream is abdddcxa.
First read in all characters in a-c - abca; after this input finished, start read the remaining characters dddx. (This two inputs can't happen simultaneously. They might be in two different functions).
Wouldn't it be simpler to read everything, then split the input into the two parts you need and finally send each part to the function that needs to process it?
Keeping the data in the stdin buffer is akin to using globals, it makes your program harder to understand and leaves the risk of other code (or the user) changing what is in the buffer while you process it.
On the other hand, dividing your program into "the part that reads the data", "the part that parses the data and divides the workload" and the "part that does the work" makes for a better structured program which is easy to understand and test.
You can probably use regex to do the actual split.
What you're asking for is the putback method (for more details see: http://www.cplusplus.com/reference/istream/istream/putback/). You would have to read everything, filter the part that you don't want to keep out, and put it back into the stream. So for instance:
cin >> myString;
// Do stuff to fill putbackBuf[] with characters in reverse order to be put back
pPutbackBuf = &putbackBuf[0];
do{
cin.putback(*(pPutbackBuf++));
while(*pPutbackBuf);
Another solution (which is not exactly what you're asking for) would be to split the input into two strings and then feed the "non-inputted" string into a stringstream and pass that to whatever function needs to do something with the rest of the characters.
What you want to do is not possible in general; ungetc and putback exist, but they're not guaranteed to work for more than one character. They don't actually change stdin; they just push back on an input buffer.
What you could do instead is to explicitly keep a buffer of your own, by reading the input into a string and processing that string. Streams don't let you safely rewind in many cases, though.
No, random access is not possible for streams (except for fstream an stringstream). You will have to read in the whole line/input and process the resulting string (which you could, however, do using iostreams/std::stringstream if you think it is the best tool for that -- I don't think that but iostreams gurus may differ).

std::ios::failue (ios::badbit) problem with fstream.write()

I've been trying to debug my code for the past few hours and I couldn't figure out the problem. I eventually set my filestream to throw exceptions on failbit and I found that my filestream was setting the failbit for some reason. I have absolutely no reason why the failbit is being set, because all I'm doing is writing 2048-byte chunks of data to the stream until suddenly it fails (at the same spot each time).
I would like to show you my code to see if anyone can see a problem and possibly see what might cause a std::ios::failure to be thrown:
bool abstractBlock::encryptBlockRC4(char* key)
{//Thic encryption method can be chunked :)
getStream().seekg(0,std::ios::end);
int sLen = int(getStream().tellg())-this->headerSize;
seekg(0);//Seek to beginning of Data
seekp(0);
char* encryptionChunkBuffer = new char[2048]; //2KB chunk buffer
for (int chunkIterator =0; chunkIterator<sLen; chunkIterator+=2048)
{
if (chunkIterator+2048<=sLen)
{
getStream().read(encryptionChunkBuffer,2048);
char* encryptedData = EnDeCrypt(encryptionChunkBuffer,2048,key);
getStream().write(encryptedData,2048);
free(encryptedData);
}else{
int restLen = sLen-chunkIterator;
getStream().read(encryptionChunkBuffer,restLen);
char* encryptedData = EnDeCrypt(encryptionChunkBuffer,restLen,key);
getStream().write(encryptedData,restLen);
delete encryptedData;
}
}
delete [] encryptionChunkBuffer;
dataFlags |= DATA_ENCRYPTED_RC4; // Set the "encryted (rc4)" bit
seekp(0); //Seek tp beginning of Data
seekg(0); //Seek tp beginning of Data
return true;
}
The above code is essentially encrypting a file using 2048 chunks. It basically reads 2048 bytes, encrypts it and then writes it back to the stream (overwrites the "unencrypted" data that was previously there).
getStream() is simply returning the fstream handle to the file thats being operated on.
The error always occurs when chunkIterator==86116352 on the line getStream().write(encryptedData,2048);
I know my code may be hard to decode, but maybe you can tell me some possible things that might trigger a failbit? Currently, I think that the problem lies within the fact that I an reading/writing to a stream and it may be causing problems, but as I mentioned any ideas that can cause a failbit would maybe help me investigate the problem more.
You must seekp between changing from reads to writes (and vice versa, using seekg). It appears your implementation allows you to avoid this some of the time. (You're probably running into implicit flushes or other buffer manipulation which hide the problem sometimes.)
C++03, §27.5.1p1, Stream buffer requirements
The controlled sequences can impose limitations on how the program can read characters from a
sequence, write characters to a sequence, put characters back into an input sequence, or alter the stream
position.
This just generally states these aspects are controlled by the specific stream buffer.
C++03, §27.8.1.1p2, Class template basic_filebuf
The restrictions on reading and writing a sequence controlled by an object of class
basic_filebuf<charT,traits> are the same as for reading and writing with the Standard C library
FILEs.
fstream, ifstream, and ofstream use filebufs.
C99, §7.19.5.3p6, The fopen function
When a file is opened with update mode ('+' as the second or third character in the above list of mode argument values), both input and output may be performed on the associated stream. However, output shall not be directly followed by input without an intervening call to the fflush function or to a file positioning function (fseek, fsetpos, or rewind), and input shall not be directly followed by output without an intervening call to a file positioning function, unless the input operation encounters end-of-file.
You may need to look up these calls to translate to iostreams terminology, but it is fairly straight-forward.
You sometimes free the result of EnDeCrypt and sometimes delete it (but with single-object delete and not the array form); this most likely doesn't contribute to the problem you see, but it's either an error on your part or, less likely, on the part of the designer of EnDeCrypt.
You're using:
char* encryptionChunkBuffer = new char[2048]; //2KB chunk buffer
//...
getStream().read(encryptionChunkBuffer,2048);
//...
delete[] encryptionChunkBuffer;
But it would be better and easier to use:
vector<char> encryptionChunkBuffer (2048); //2KB chunk buffer
//...
getStream().read(&encryptionChunkBuffer[0], encryptionChunkBuffer.size());
//...
// no delete
If you don't want to type encryptionChunkBuffer.size() twice, then use a local constant for it.

Overloading operator>> to a char buffer in C++ - can I tell the stream length?

I'm on a custom C++ crash course. I've known the basics for many years, but I'm currently trying to refresh my memory and learn more. To that end, as my second task (after writing a stack class based on linked lists), I'm writing my own string class.
It's gone pretty smoothly until now; I want to overload operator>> that I can do stuff like cin >> my_string;.
The problem is that I don't know how to read the istream properly (or perhaps the problem is that I don't know streams...). I tried a while (!stream.eof()) loop that .read()s 128 bytes at a time, but as one might expect, it stops only on EOF. I want it to read to a newline, like you get with cin >> to a std::string.
My string class has an alloc(size_t new_size) function that (re)allocates memory, and an append(const char *) function that does that part, but I obviously need to know the amount of memory to allocate before I can write to the buffer.
Any advice on how to implement this? I tried getting the istream length with seekg() and tellg(), to no avail (it returns -1), and as I said looping until EOF (doesn't stop reading at a newline) reading one chunk at a time.
To read characters from the stream until the end of line use a loop.
char c;
while(istr.get(c) && c != '\n')
{
// Apped 'c' to the end of your string.
}
// If you want to put the '\n' back onto the stream
// use istr.unget(c) here
// But I think its safe to say that dropping the '\n' is fine.
If you run out of room reallocate your buffer with a bigger size.
Copy the data across and continue. No need to be fancy for a learning project.
you can use cin::getline( buffer*, buffer_size);
then you will need to check for bad, eof and fail flags:
std::cin.bad(), std::cin.eof(), std::cin.fail()
unless bad or eof were set, fail flag being set usually indicates buffer overflow, so you should reallocate your buffer and continue reading into the new buffer after calling std::cin.clear()
A side note: In the STL the operator>> of an istream is overloaded to provide this kind of functionality or (as for *char ) are global functions. Maybe it would be more wise to provide a custom overload instead of overloading the operator in your class.
Check Jerry Coffin's answer to this question.
The first method he used is very simple (just a helper class) and allow you to write your input in a std::vector<std::string> where each element of the vector represents a line of the original input.
That really makes things easy when it comes to processing afterwards!