How read file functions recognize end of a text file in C++?

How read file functions recognize end of a text file in C++? - c++

As far as you know, there are two standard to read a text file in C++ (in this case 2 numbers in every line) :
The two standard methods are:
Assume that every line consists of 2 numbers and read token by token:
#include <fstream>
std::ifstream infile("thefile.txt");
int a, b;
while (infile >> a >> b)
{
// process pair (a,b)
}
Line-based parsing, using string streams:
#include <sstream>
#include <string>
#include <fstream>
std::ifstream infile("thefile.txt");
std::string line;
while (std::getline(infile, line))
{
std::istringstream iss(line);
int a, b;
if (!(iss >> a >> b)) { break; } // error
// process pair (a,b)
}
And also I can use the below code to see if the files ends or not :
while (!infile.eof())
My question is :
Question1: how this functions understand that one line is the last
line? I mean "how eof() returns false\true?"
As far as I know, they reading a part of memory. what is the
difference between the part that belongs to the file and the parts
that not?
Question2: Is there anyway to cheat this function?! I mean, Is it
possible to add something in the middle of the text file (for example
by a Hex editor tools) and make the eof() wrongly returns True in
the middle of the text file?
Appreciate your time and consideration.

Question1: how this functions understand that one line is the last line? I mean "how eof() returns false\true?"
It doesn't. The functions know when you've tried to read past the very last character in the file. They don't necessarily know whether a line is the last line. "Files" aren't the only things that you can read with streams. Keyboard input, a special purpose device, internet sockets: All can be read with the right kind of I/O stream. When reading from standard input, the stream has no knowing of if the very next thing I type is control-Z.
With regard to files on a computer disk, most modern operating systems store metadata regarding the file separate from the file. These metadata include the length of the file (and oftentimes when the file was last modified and when it was last read). On these systems, the stream buffer than underlies the I/O stream knows the current read location within the file and knows how long the file is. The stream buffer signals EOF when the read location reaches the length of the file.
That's not universal, however. There are some not-so-common operating systems that don't use this concept of metadata stored elsewhere. End of file on a disk file is just as surprising on these systems as is end of file from user input on a keyboard.
As far as I know, they reading a part of memory. what is the difference between the part that belongs to the file and the parts that not?
Learn the difference between memory and disk files. There's a huge difference between the two. Unless you're working with an embedded computer, memory is much more limited than is disk space.
Question2: Is there anyway to cheat this function?! I mean, Is it possible to add something in the middle of the text file (for example by a Hex editor tools) and make the eof() wrongly returns True in the middle of the text file?
That depends very much on how the operating system implements files. On most modern operating systems, the answer is not just "no" but "No!". The concept of using some special signature that indicates end of file in a disk file is one of many computer science concepts that for the most part have been dumped into the pile of "that wasn't very smart" ideas. You asked your question on the internet. That most likely means you are using a Windows machine, a Linux machine, or a Mac. All of them store the length of a file as metadata separate from the contents of a file.
However, there is a need for the ability to clear the end of file indicator. One program might be writing to a file while at the same time another is reading from it. The reader might hit EOF while the writer is still active. The reader needs to clear the EOF indicator to continue reading what the writer has written. The C++ I/O streams provide the ability to do just that. Every I/O stream has a clear function. Whether it works, that's a different story. The clear will work temporarily, but the very next read might well reset the EOF bit. For example, when I type control-Z on my keyboard, that means I am done interacting with the program, period, My next action might well be to go out for lunch.

Related

Reading and writing the same file simultaneosly with c++

I'm trying to read and write a file as I loop through its lines. At each line, I will do an evaluation to determine if I want to write it into the file or skip it and move onto the next line. This is a basically a skeleton of what I have so far.
void readFile(char* fileName)
{
char line[1024];
fstream file("test.file", ios::in | ios::out);
if(file.is_open())
{
while(file.getline(line,MAX_BUFFER))
{
//evaluation
file.seekg(file.tellp());
file << line;
file.seekp(file.tellg());
}
}
}
As I'm reading in the lines, I seem to be having issues with the starting index of the string copied into the line variable. For example, I may be expecting the string in the line variable to be "000/123/FH/" but it actually goes in as "123/FH/". I suspect that I have an issue with file.seekg(file.tellp()) and file.seekp(file.tellg()) but I am not sure what it is.

It is not clear from your code [1] and problem description what is in the file and why you expect "000/123/FH/", but I can state that the getline function is a buffered input, and you don't have code to access the buffer. In general, it is not recommended to use buffered and unbuffered i/o together because it requires deep knowledge of the buffer mechanism and then relies on that mechanism not to change as libraries are upgraded.
You appear to want to do byte or character[2] level manipulation. For small files, you should read the entire file into memory, manipulate it, and then overwrite the original, requiring an open, read, close, open, write, close sequence. For large files you will need to use fread and/or some of the other lower level C library functions.
The best way to do this, since you are using C++, is to create your own class that handles reading up to and including a line separator [3] into one of the off-the-shelf circular buffers (that use malloc or a plug-in allocator as in the case of STL-like containers) or a circular buffer you develop as a template over a statically allocated array of bytes (if you want high speed an low resource utilization). The size will need to be at least as large as the longest line in the later case. [4]
Either way, you would want to add to the class to open the file in binary mode and expose the desired methods to do the line level manipulations to an arbitrary line. Some say (and I personally agree) that taking advantage of Bjarne Stroustrup's class encapsulation in C++ is that classes are easier to test carefully. Such a line manipulation class would encapsulate the random access C functions and unbuffered i/o and leave open the opportunity to maximize speed, while allowing for plug-and-play usage in systems and applications.
Notes
[1] The seeking of the current position is just testing the functions and does not yet, in the current state of the code, re-position the current file pointer.
[2] Note that there is a difference between character and byte level manipulations in today's computing environment where utf-8 or some other unicode standard is now more common than ASCII in many domains, especially that of the web.
[3] Note that line separators are dependent on the operating system, its version, and sometimes settings.
[4] The advantage of circular buffers in terms of speed is that you can read more than one line using fread at a time and use fast iteration to find the next end of line.

Taking inspiration from Douglas Daseeco's response, I resolved my issue by simply reading the existing file, writing its lines into a new file, then renaming the new file to overwrite the original file. Below is a skeleton of my solution.
char line[1024];
ifstream inFile("test.file");
ofstream outFile("testOut.file");
if(inFile.is_open() && outFile.is_open())
{
while(inFile.getline(line,1024))
{
// do some evaluation
if(keep)
{
outFile << line;
outFile << "\n";
}
}
inFile.close();
outFile.close();
rename("testOut.file","test.file");
}

You are reading and writing to the same file you might end up of having duplicate lines in the file.
You could find this very useful. Imagine your 1st time of reaching the while loop and starting from the beginning of the file you do file.getline(line, MAX_BUFFER). Now the get pointer (for reading) moves MAX_BUFFER places from the beginning of the file (your starting point).
After you've determine to write back to the file seekp() helps to specify with respect to a reference point the location you want to write to, syntax: file.seekp(num_bytes,"ref"); where ref will be ios::beg(beginning), ios::end, ios::cur (current position in file).
As in your code after reading, find a way to use MAX_BUFFER to refer to a location with respect to a reference.
while(file.good())
{
file.getline(line,MAX_BUFFER);
...
if(//for some reasone you want to write back)
{
// set put-pointer to location for writing
file.seekp(num_bytes, "ref");
file << line;
}
//set get-pointer to desired location for the next read
file.seekg(num_bytes, "ref");
}

C++ how to check if the std::cin buffer is empty

The title is misleading because I'm more interested in finding an alternate solution. My gut feeling is that checking whether the buffer is empty is not the most ideal solution (at least in my case).
I'm new to C++ and have been following Bjarne Stroustrup's Programming Principles and Practices using C++. I'm currently on Chapter 7, where we are "refining" the calculator from Chapter 6. (I'll put the links for the source code at the end of the question.)
Basically, the calculator can take multiple inputs from the user, delimited by semi-colons.
> 5+2; 10*2; 5-1;
= 7
> = 20
> = 4
>
But I'd like to get rid of the prompt character ('>') for the last two answers, and display it again only when the user input is asked for. My first instinct was to find a way to check if the buffer is empty, and if so, cout the character and if not, proceed with couting the answer. But after a bit of googling I realized the task is not as easy as I initially thought... And also that maybe that wasn't a good idea to begin with.
I guess essentially my question is how to get rid of the '>' characters for the last two answers when there are multiple inputs. But if checking the cin buffer is possible and is not a bad idea after all, I'd love to know how to do it.
Source code: https://gist.github.com/Spicy-Pumpkin/4187856492ccca1a24eaa741d7417675
Header file: http://www.stroustrup.com/Programming/PPP2code/std_lib_facilities.h
^ You need this header file. I assume it is written by the author himself.
Edit: I did look around the web for some solutions, but to be honest none of them made any sense to me. It's been like 4 days since I picked up C++ and I have a very thin background in programming, so sometimes even googling is a little tough..

As you've discovered, this is a deceptively complicated task. This is because there are multiple issues here at play, both the C++ library, and the actual underlying file.
C++ library
std::cin, and C++ input streams, use an intermediate buffer, a std::streambuf. Input from the underlying file, or an interactive terminal, is not read character by character, but rather in moderately sized chunks, where possible. Let's say:
int n;
std::cin >> n;
Let's say that when this is done and over is, n contains the number 42. Well, what actually happened is that std::cin, more than likely, did not read just two characters, '4' and '2', but whatever additional characters, beyond that, were available on the std::cin stream. The remaining characters were stored in the std::streambuf, and the next input operation will read them, before actually reading the underlying file.
And it is equally likely that the above >> did not actually read anything from the file, but rather fetched the '4' and the '2' characters from the std::streambuf, that were left there after the previous input operation.
It is possible to examine the underlying std::streambuf, and determine whether there's anything unread there. But this doesn't really help you.
If you were about to execute the above >> operator, you looked at the underlying std::streambuf, and discover that it contains a single character '4', that also doesn't tell you much. You need to know what the next character is in std::cin. It could be a space or a newline, in which case all you'll get from the >> operator is 4. Or, the next character could be '2', in which case >> will swallow at least '42', and possibly more digits.
You can certainly implement all this logic yourself, look at the underlying std::streambuf, and determine whether it will satisfy your upcoming input operation. Congratulations: you've just reinvented the >> operator. You might as well just parse the input, a character at a time, yourself.
The underlying file
You determined that std::cin does not have sufficient input to satisfy your next input operation. Now, you need to know whether or not input is available on std::cin.
This now becomes an operating system-specific subject matter. This is no longer covered by the standard C++ library.
Conclusion
This is doable, but in all practical situations, the best solution here is to use an operating system-specific approach, instead of C++ input streams, and read and buffer your input yourself. On Linux, for example, the classical approach is to set fd 0 to non-blocking mode, so that read() does not block, and to determine whether or not there's available input, just try read() it. If you did read something, put it into a buffer that you can look at later. Once you've consumed all previously-read buffered input, and you truly need to wait for more input to be read, poll() the file descriptor, until it's there.

Using Getline on a Binary File

I have read that getline behaves as an unformatted input function. Which I believe should allow it to be used on a binary file. Let's say for example that I've done this:
ofstream ouput("foo.txt", ios_base::binary);
const auto foo = "lorem ipsum";
output.write(foo, strlen(foo) + 1);
output.close();
ifstream input("foo.txt", ios_base::binary);
string bar;
getline(input, bar, '\0');
Is that breaking any rules? It seems to work fine, I think I've just traditionally seen arrays handled by writing the size and then writing the array.

No, it's not breaking any rules that I can see.
Yes, it's more common to write an array with a prefixed size, but using a delimiter to mark the end can work perfectly well also. The big difference is that (like with a text file) you have to read through data to find the next item. With a prefixed size, you can look at the size, and skip directly to the next item if you don't need the current one. Of course, you also need to ensure that if you're using something to mark the end of a field, that it can never occur inside the field (or come up with some way of detecting when it's inside a field, so you can read the rest of the field when it does).
Depending on the situation, that can mean (for example) using Unicode text. This gives you a lot of options for values that can't occur inside the text (because they aren't legal Unicode). That, on the other hand, would also mean that your "binary" file is really a text file, and has to follow some basic text-file rules to make sense.
Which is preferable depends on how likely it is that you'll want to read random pieces of the file rather than reading through it from beginning to end, as well as the difficulty (if any) of finding a unique delimiter and if you don't have one, the complexity of making the delimiter recognizable from data inside a field. If the data is only meaningful if written in order, then having to read it in order doesn't really pose a problem. If you can read individual pieces meaningfully, then being able to do so much more likely to be useful.
In the end, it comes down to a question of what you want out of your file being "binary'. In the typical case, all 'binary" really means is that what end of line markers that might be translated from a new-line character to (for example) a carriage-return/line-feed pair, won't be. Depending on the OS you're using, it might not even mean that much though--for example, on Linux, there's normally no difference between binary and text mode at all.

Well, there are no rules broken and you'll get away with that just fine, except that may miss the precision of reading binary from a stream object.
With binary input, you usually want to know how many characters were read successfully, which you can obtain afterwards with gcount()... Using std::getline will not reflect the bytes read in gcount().
Of cause, you can simply get such info from the size of the string you passed into std::getline. But the stream will no longer encapsulate the number of bytes you consumed in the last Unformatted Operation

Read/Write at the same time

What I am doing is opening my file using fstream at the start of the main and closing it at the end. In between I am writing "Hello World" and after that reading what I wrote but the result is always weired charecters and not the "Hello World". I did do a cast to char but that didnt help. Any way I can do this?

You need to interpose an fseek call when you switch from reading to writing, or viceversa. (Of course, you also need to fopen for "r+" or the like, so that both reading and writing are allowed, but I imagine you are already aware of that -- the need for seeking in order to switch between reading and writing is a lesser known fact).
As this page puts it,
For the modes where both read and
writing (or appending) are allowed
(those which include a "+" sign), the
stream should be flushed (fflush) or
repositioned (fseek, fsetpos, rewind)
between either a reading operation
followed by a writing operation or a
writing operation followed by a
reading operation.

I'd be amused if this works, because I always had to open a file twice to do that: once for reading and once for writing. Even then, I had to write the whole file out and close it (which flushed the OS buffers) before I could be sure I could read the whole file and not get an early EOF.
Nowadays, since I use Unix-style operating systems, I would just use the pipe() function. Not sure if that works in Windows (because so much doesn't, like select() on files).

Make sure you are seeking to the beginning of the file before reading, like so:
fileFStream.seekg(0, ios_base::beg);
If that doesn't work, post your code.

Can we write an EOF character ourselves?

Most of the languages like C++ when writing into a file, put an EOF character even if we miss to write statements like :
filestream.close
However is there any way, we can put the EOF character according to our requirement, in C++, for an instance.
Or any other method we may use apart from using the functions provided in C++.
If you need to ask more of information then kindly do give a comment.
EDIT:
What if, we want to trick the OS and place an EOF character in a file and write some data after the EOF so that an application like notepad.exe is not able to read after our EOF character.
I have read answers to the question related to this topic and have come to know that nowdays OS generally don't see for an EOF character rather check the length of file to get the correct idea of knowing about the length of the file but, there must be a procedure in OS which would be checking the length of file and then updating the file records.
I am sorry if I am wrong at any point in my estimation but please do help me because it can lead to a lot of new ideas.

There is no EOF character. EOF by definition "is unequal to any valid character code". Often it is -1. It is not written into the file at any point.
There is a historical EOF character value (CTRL+Z) in DOS, but it is obsolete these days.
To answer the follow-up question of Apoorv: The OS never uses the file data to determine file length (files are not 'null terminated' in any way). So you cannot trick the OS. Perhaps old, stupid programs won't read after CTRL+Z character. I wouldn't assume that any Windows application (even Notepad) would do that. My guess is that it would be easier to trick them with a null (\0) character.

Well, EOF is just a value returned by the function defined in the C stdio.h header file. Its actually returned to all the reading functions by the OS, so its system dependent. When OS reaches the end of file, it sends it to the function, which in its return value than places most commonly (-1), but not always. So, to summarize, EOF is not character, but constant returned by the OS.
EDIT: Well, you need to know more about filesystem, look at this.
Hi, to your second question:
once again, you should look better into filesystems. FAT is a very nice example because you can find many articles about it, and its principles are very similar to NTFS. Anyway, once again, EOF is NOT a character. You cannot place it in file directly. If you could do so, imagine the consequences, even "dumb" image file could not be read by the system.
Why? Because OS works like very complex structure of layers. One of the layers is the filesystem driver. It makes sure that it transfers data from every filesystem known to the driver. It provides a bridge between applications and the actual system of storing files into HDD.
To be exact, FAT filesystem uses the so-called FAT table - it is a table located close to the start of the HDD (or partition) address space, and it contains map of all clusters (little storage cells). OK, so now, when you want to save some file to the HDD, OS (filesystem driver) looks into FAT table, and searches for the value "0x0". This "0x0" value says to the OS that cluster which address is described by the location of that value in FAT table is free to write.
So it writes into it the first part of the file. Then, it looks for another "0x0" value in FAT, and if found, it writes the second part of the file into cluster which it points to. Then, it changes the value of the first FAT table record where the file is located to the physical address of the next in our case second part of the file.
When your file is all stored on HDD, now there comes the final part, it writes desired EOF value, but into FAT table, not into the "data part" of the HDD. So when the file is read next time, it knows this is the end, don´t look any further.
So, now you see, if you would want to manually write EOF value into the place it doesn't belong to, you have to write your own driver which would be able to rewrite the FAT record, but this is practically impossible to do for beginners.

I came here while going through the Kernighan & Ritchie C exercises.
Ctrl+D sends the character that matches the EOF constant from stdio.h.
(Edit: this is on Mac OS X; thanks to #markmnl for pointing out that the Windows 10 equivalent is Ctrl+Z)

Actually in C++ there is no physical EOF character written to a file using either the fprintf() or ostream mechanisms. EOF is an I/O condition to indicate no more data to read.
Some early disk operating systems like CP/M actually did use a physical 0x1A (ASCII SUB character) to indicate EOF because the file system only maintained file size in blocks so you never knew exactly how long a file was in bytes. With the advent of storing actual length counts in the directory it is no longer typical to store an "EOF" character as part of the 'in-band' file data.

Under Windows, if you encounter an ASCII 26 (EOF) in stdin, it will stop reading the rest of the data. I believe writing this character will also terminate output sent to stdout, but I haven't confirmed this. You can switch the stream to binary mode as in this SO question:
#include <io.h>
#include <fcntl.h>
...
_setmode(0, _O_BINARY)
And not only will you stop 0x0A being converted to 0x0D 0x0A, but you'll also gain the ability to read/write 0x1A as well. Note you may have to switch both stdin (0) and stdout (1).

If by the EOF character you mean something like Control-Z, then modern operating systems don't need such a thing, and the C++ runtime will not write one for you. You can of course write one yourself:
filestream.put( 26 ); // write Ctrl-Z
but there is no good reason to do so. There is also no need to do:
filesystem.close();
as the file stream will be closed for you automatically when its destructor is called, but it is (I think) good practice to do so.

There is no such thing as the "EOF" character. The fact of closing the stream in itself is the "EOF" condition.
When you press Ctrl+D in a unix shell, that simply closes the standard input stream, which in turn is recognized by the shell as "EOF" and it exits.
So, to "send" an "EOF", just close the stream to which the "EOF" needs to be sent.

Nobody has yet mentioned the [f]truncate system calls, which are how you make a file shorter without recreating it from scratch.
The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes.
If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes ('\0').
Understand that this is a distinct operation from writing any sort of data to the file. The file is a linear array of bytes, laid out on disk somehow, with metadata that says how long it is; truncate changes the metadata.

On modern filesystems EOF is not a character, so you don't have to issue it when finishing to write to a file. You just have to close the file or let the OS do it for you when your process terminates.

Yes, you can manually add EOF to a file.
1) in Mac terminan, create a new file. touch filename.txt
2) Open the file in VI
vi filename.txt
3) In Insert mode (hit i), type Control+V and then Control+D. Do not let go of the Control key on the Mac.
Alternatively, if I want other ^NewLetters, like ^N^M^O^P, etc, I could do Contorl+V and then Control+NewLetter. So for example, to do ^O, hold down control, and then type V and O, then let go of Control.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js