I'm writing a set of unit tests that write calculated values out to files. Each test produces a square matrix that holds anywhere from 50,000 to 500,000 doubles, and I have a total of 128 combinations of test cases.
Is there any significant overhead involved in writing cout statements and then piping that output to files, or would I be better off writing directly to the file using an ofstream?
This is going to be dependent on your system and environment. This likely to be very little difference, but there is only one way to be sure: try both approaches and measure them.
Since the dimensions involved are so large I'm assuming that these files are not meant to be read by a human being? Just make sure you write them out as binary and not human-readable text because that will make so much more difference than the difference between using ofstream or piping cout.
Whether this means you have to use ofstream or not I don't know. I've never written binary to cout so I can't say whether that's possible...
As Charles Bailey said, it's implementation dependent; what follows is mostly for linux implementation with gnu toolchain, but I hardly imagine it being very different in other os.
In libstdc++ 4.4.2:
An fstream contain an underlying stdio_filebuf which is a basic_filebuf. This basic_filebuf contain it's own buffer by inheriting basic_streambuf, and actually contain a __basic_file, itself containing an underlying plain C stdio abstraction (FILE* or std::__c_file*), in which it flush the buffer.
cout, which is an ostream is initialized with a stdio_sync_filebuf itself initialized with the C file abstraction stdout. stdio_sync_filebuf call plain C stdio functions.
Considering only C++, it appear that an fstream may be more efficient thanks to two layers of buffer.
Considering C only, if the process is forked with the stdout file descriptor redirected in a file, there should be no difference between writing to a new opened file (what fstream does at the end) or to stdout since the fd point to a file anyway (what cout does at the end).
If I were you, I would use an fstream since it's your intent.
Related
I'm trying to read and write a file as I loop through its lines. At each line, I will do an evaluation to determine if I want to write it into the file or skip it and move onto the next line. This is a basically a skeleton of what I have so far.
void readFile(char* fileName)
{
char line[1024];
fstream file("test.file", ios::in | ios::out);
if(file.is_open())
{
while(file.getline(line,MAX_BUFFER))
{
//evaluation
file.seekg(file.tellp());
file << line;
file.seekp(file.tellg());
}
}
}
As I'm reading in the lines, I seem to be having issues with the starting index of the string copied into the line variable. For example, I may be expecting the string in the line variable to be "000/123/FH/" but it actually goes in as "123/FH/". I suspect that I have an issue with file.seekg(file.tellp()) and file.seekp(file.tellg()) but I am not sure what it is.
It is not clear from your code [1] and problem description what is in the file and why you expect "000/123/FH/", but I can state that the getline function is a buffered input, and you don't have code to access the buffer. In general, it is not recommended to use buffered and unbuffered i/o together because it requires deep knowledge of the buffer mechanism and then relies on that mechanism not to change as libraries are upgraded.
You appear to want to do byte or character[2] level manipulation. For small files, you should read the entire file into memory, manipulate it, and then overwrite the original, requiring an open, read, close, open, write, close sequence. For large files you will need to use fread and/or some of the other lower level C library functions.
The best way to do this, since you are using C++, is to create your own class that handles reading up to and including a line separator [3] into one of the off-the-shelf circular buffers (that use malloc or a plug-in allocator as in the case of STL-like containers) or a circular buffer you develop as a template over a statically allocated array of bytes (if you want high speed an low resource utilization). The size will need to be at least as large as the longest line in the later case. [4]
Either way, you would want to add to the class to open the file in binary mode and expose the desired methods to do the line level manipulations to an arbitrary line. Some say (and I personally agree) that taking advantage of Bjarne Stroustrup's class encapsulation in C++ is that classes are easier to test carefully. Such a line manipulation class would encapsulate the random access C functions and unbuffered i/o and leave open the opportunity to maximize speed, while allowing for plug-and-play usage in systems and applications.
Notes
[1] The seeking of the current position is just testing the functions and does not yet, in the current state of the code, re-position the current file pointer.
[2] Note that there is a difference between character and byte level manipulations in today's computing environment where utf-8 or some other unicode standard is now more common than ASCII in many domains, especially that of the web.
[3] Note that line separators are dependent on the operating system, its version, and sometimes settings.
[4] The advantage of circular buffers in terms of speed is that you can read more than one line using fread at a time and use fast iteration to find the next end of line.
Taking inspiration from Douglas Daseeco's response, I resolved my issue by simply reading the existing file, writing its lines into a new file, then renaming the new file to overwrite the original file. Below is a skeleton of my solution.
char line[1024];
ifstream inFile("test.file");
ofstream outFile("testOut.file");
if(inFile.is_open() && outFile.is_open())
{
while(inFile.getline(line,1024))
{
// do some evaluation
if(keep)
{
outFile << line;
outFile << "\n";
}
}
inFile.close();
outFile.close();
rename("testOut.file","test.file");
}
You are reading and writing to the same file you might end up of having duplicate lines in the file.
You could find this very useful. Imagine your 1st time of reaching the while loop and starting from the beginning of the file you do file.getline(line, MAX_BUFFER). Now the get pointer (for reading) moves MAX_BUFFER places from the beginning of the file (your starting point).
After you've determine to write back to the file seekp() helps to specify with respect to a reference point the location you want to write to, syntax: file.seekp(num_bytes,"ref"); where ref will be ios::beg(beginning), ios::end, ios::cur (current position in file).
As in your code after reading, find a way to use MAX_BUFFER to refer to a location with respect to a reference.
while(file.good())
{
file.getline(line,MAX_BUFFER);
...
if(//for some reasone you want to write back)
{
// set put-pointer to location for writing
file.seekp(num_bytes, "ref");
file << line;
}
//set get-pointer to desired location for the next read
file.seekg(num_bytes, "ref");
}
I'm learning to write binary files in C++. I'm a bit confused with the result. Let's say I have this code:
#include<fstream>
#include<string>
using namespace std;
int main(){
ofstream file;
string text = "Some text over here";
file.open("test.bin",ios::out|ios::binary);
file.write(text.c_str(), text.length());
file.close();
return 0;
}
I'm expecting the output file test.bin to be "in binary", but when I look at it in notepad, I see normal text:
Some text over here
Is my expectation wrong? What makes things binary and what should I use to achieve it?
The most "important" definition of what the word "binary" means comes from just a situation where a number can take on one of two values. Whatever you call those doesn't strictly matter ("on"/"off", "1"/"0", "yes"/"no"). All that matters is that there are just two states.
Keep that core definition in mind. But you will find a large number of other idiomatic usages of the word "binary" in the computer world, depending on context.
As an example: Some people will refer to a file representing an executable image (such as an .EXE file on Windows) as simply "a binary" or ("the binary", when compiling a certain codebase and you know what executable you'd be talking about.)
You've tripped across another confusing distinction of how sometimes people will talk about a file format as being either "textual" or "binary". Yet today's computers are based on systems that are always binary (technically they don't have to be). So if "textual" files aren't stored ultimately as binary bits somewhere, how else would they be stored? :-/
So really what it means for a file format to be labeled as "textual" is to say that it is "stricter about what binary patterns it uses, such that it will only use those patterns which make sense in certain textual encodings". That's why those files look readable when you load them up in text editors.
So a "textual file format" is a subset of all "file formats". And sometimes when people want to refer to something that is not in that subset of textual files, they will call it a "binary file format".
Plenty of room for confusion! But the upshot is that all you do when you open a file in "textual" vs. "binary" mode in C++ is to tell the stream that you are not using only the bit patterns likely to look good in a text editor when loaded. Opening in binary asks for all bytes to be sent to the file verbatim, instead of having it try and take care of cross-platform text-file differences in newline handling "under the hood" as a convenience.
As far as you know, there are two standard to read a text file in C++ (in this case 2 numbers in every line) :
The two standard methods are:
Assume that every line consists of 2 numbers and read token by token:
#include <fstream>
std::ifstream infile("thefile.txt");
int a, b;
while (infile >> a >> b)
{
// process pair (a,b)
}
Line-based parsing, using string streams:
#include <sstream>
#include <string>
#include <fstream>
std::ifstream infile("thefile.txt");
std::string line;
while (std::getline(infile, line))
{
std::istringstream iss(line);
int a, b;
if (!(iss >> a >> b)) { break; } // error
// process pair (a,b)
}
And also I can use the below code to see if the files ends or not :
while (!infile.eof())
My question is :
Question1: how this functions understand that one line is the last
line? I mean "how eof() returns false\true?"
As far as I know, they reading a part of memory. what is the
difference between the part that belongs to the file and the parts
that not?
Question2: Is there anyway to cheat this function?! I mean, Is it
possible to add something in the middle of the text file (for example
by a Hex editor tools) and make the eof() wrongly returns True in
the middle of the text file?
Appreciate your time and consideration.
Question1: how this functions understand that one line is the last line? I mean "how eof() returns false\true?"
It doesn't. The functions know when you've tried to read past the very last character in the file. They don't necessarily know whether a line is the last line. "Files" aren't the only things that you can read with streams. Keyboard input, a special purpose device, internet sockets: All can be read with the right kind of I/O stream. When reading from standard input, the stream has no knowing of if the very next thing I type is control-Z.
With regard to files on a computer disk, most modern operating systems store metadata regarding the file separate from the file. These metadata include the length of the file (and oftentimes when the file was last modified and when it was last read). On these systems, the stream buffer than underlies the I/O stream knows the current read location within the file and knows how long the file is. The stream buffer signals EOF when the read location reaches the length of the file.
That's not universal, however. There are some not-so-common operating systems that don't use this concept of metadata stored elsewhere. End of file on a disk file is just as surprising on these systems as is end of file from user input on a keyboard.
As far as I know, they reading a part of memory. what is the difference between the part that belongs to the file and the parts that not?
Learn the difference between memory and disk files. There's a huge difference between the two. Unless you're working with an embedded computer, memory is much more limited than is disk space.
Question2: Is there anyway to cheat this function?! I mean, Is it possible to add something in the middle of the text file (for example by a Hex editor tools) and make the eof() wrongly returns True in the middle of the text file?
That depends very much on how the operating system implements files. On most modern operating systems, the answer is not just "no" but "No!". The concept of using some special signature that indicates end of file in a disk file is one of many computer science concepts that for the most part have been dumped into the pile of "that wasn't very smart" ideas. You asked your question on the internet. That most likely means you are using a Windows machine, a Linux machine, or a Mac. All of them store the length of a file as metadata separate from the contents of a file.
However, there is a need for the ability to clear the end of file indicator. One program might be writing to a file while at the same time another is reading from it. The reader might hit EOF while the writer is still active. The reader needs to clear the EOF indicator to continue reading what the writer has written. The C++ I/O streams provide the ability to do just that. Every I/O stream has a clear function. Whether it works, that's a different story. The clear will work temporarily, but the very next read might well reset the EOF bit. For example, when I type control-Z on my keyboard, that means I am done interacting with the program, period, My next action might well be to go out for lunch.
I've been trying to read up on iostreams and understand them better. Occasionally I find it stressed that inserters (<<) and extractors (>>) are meant to be used in textual serialization. It's a few places, but this article is a good example:
http://spec.winprog.org/streams/
Outside of the <iostream> universe, there are cases where the << and >> are used in a stream-like way yet do not obey any textual convention. For instance, they write binary encoded data when used by Qt's QDataStream:
http://doc.qt.nokia.com/latest/qdatastream.html#details
At the language level, the << and >> operators belong to your project to overload (hence what QDataStream does is clearly acceptable). My question would be whether it is considered a bad practice for those using <iostream> to use the << and >> operators to implement binary encodings and decodings. Is there (for instance) any expectation that if written to a file on disk that the file should be viewable and editable with a text editor?
Should one always be using other method names and base them on read() and write()? Or should textual encodings be considered merely a default behavior that classes integrating with the standard library iostream can elect to ignore?
UPDATE A key terminology issue on this seems to be the distinction of I/O that is "formatted" vs "unformatted" (as opposed to the terms "textual" vs "binary"). I found this question:
writing binary data (std::string) to an std::ofstream?
It has a comment from #TomalakGeret'kal saying "I'd not want to use << for binary data anyway, as my brain reads it as "formatted output" which is not what you're doing. Again, it's perfectly valid, but I just would not confuse my brain like that."
The accepted answer to the question says it's fine as long as you use ios::binary. That seems to bolster the "there's nothing wrong with it" side of the debate...but I still don't see any authoritative source on the issue.
Actually the operators << and >> are bit shift operators; using them for I/O is strictly speaking already a misuse. However that misuse is about as old as operator overloading itself, and I/O today is the most common usage of them, therefore they are widely regarded as I/O insertion/extraction operators. I'm pretty sure if there weren't the precedent of iostreams, nobody would use those operators for I/O (especially with C++11 which has variadic templates, solving the main problem which using those operators solved for iostreams, in a much cleaner way). On the other hand, from the language point of view, overloaded operator<< and operator>> can mean whatever you want them to mean.
So the question boils down to what would be an acceptable use of those operators. For this, I think one has to distinguish two cases: First, new overloads working on iostream classes, and second, new overloads working on other classes, possibly designed to work like iostreams.
Let's consider first new operators on iostream classes. Let me start with the observation that the iostream classes are all about formatting (and the reverse process, which could be called "deformatting"; "lexing" IMHO wouldn't be quite the right term here because the extractors don't determine the type, but only try to interpret the data according to the type given). The classes responsible for the actual I/O of raw data are the streambufs. However note that a proper binary file is not a file where you just dump internal raw data. Just like a text file (actually even more so), a binary file should have a well-specified encoding of the data it contains. Especially if the files are expected to be read on different systems. Therefore the concept of formatted output makes perfect sense also for binary files; just the formatting is different (e.g. writing a pre-determined number of bytes with the most significant one first for an integer value).
The iostreams themselves are classes which are intended to work on text files, that is, on files whose content is interpreted as textual representation of data. A lot of built-in behaviour is optimized for that, and may cause problems if used on binary files. An obvious example is that by default spaces are skipped before any input is attempted. For a binary file, this would be clearly the wrong behaviour. Also the use of locales doesn't make sense for binary files (although one might argue that there could be a "binary locale", but I don't think locales as defined for iostreams provide a suitable interface for that). Therefore I'd say writing binary operator<< or operator>> for iostream classes would be wrong.
The other case is where you define a separate class for binary input/output (possibly reusing the streambuf layer for doing the actual I/O). Since we are now speaking about different classes, the argumentation above doesn't apply any more. So the question now is: Should operator<< and operator>> on I/O be regarded as "text insertion/extraction operators" or more generally as "formatted data insertion/extraction operators"? The standard classes only use them for text, but then, there are no standard classes for binary I/O insertion/extraction at all, so the standard usage cannot distinguish between the two.
I personally would say that binary insertion/extraction is close enough to textual insertion/extraction that this usage is justified. Note that you also could make meaningful binary I/O manipulators, e.g. bigendian, littleendian and intwidth(n) to determine the format in which integers are to be output.
Beyond that there's also the use of those operators for things which are not really I/O (and where you wouldn't even think of using the streambuf layer), like reading from or inserting into a container. In my opinion, that already constitutes misuse of the operators, because there the data isn't translated into or out of a different format. It is just stored in a container.
The abstraction of the iostreams in the standard is that of a textually
formatted stream of data; there is no support for any non-text format.
That is the abstraction of iostreams. There's nothing wrong about
defining a different stream class whose abstraction is a binary format,
but doing so in an iostream will likely break existing code, and not
work.
The overloaded operators >> and << perform formatted IO. The rest IO functions (put, get, read, write, etc) perform unformatted IO. Unformatted IO means that the IO library only accepts a buffer, a sequence of unsigned character for its input. This buffer might contain textual message or a binary content. It’s the application’s responsibility to interpret the buffer. However the formatted IO would take the locale into consideration. In the case of text files, depending on the environment where the application runs, some special character conversion may occur in input/output operations to adapt them to a system-specific text file format. In many environments, such as most UNIX-based systems, it makes no difference to open a file as a text file or a binary file. Note that you could overload the operator >> and << for your own types. That means you are capable of applying the formatted IO without locale information to your own types, though that’s a bit tricky.
What I am doing is opening my file using fstream at the start of the main and closing it at the end. In between I am writing "Hello World" and after that reading what I wrote but the result is always weired charecters and not the "Hello World". I did do a cast to char but that didnt help. Any way I can do this?
You need to interpose an fseek call when you switch from reading to writing, or viceversa. (Of course, you also need to fopen for "r+" or the like, so that both reading and writing are allowed, but I imagine you are already aware of that -- the need for seeking in order to switch between reading and writing is a lesser known fact).
As this page puts it,
For the modes where both read and
writing (or appending) are allowed
(those which include a "+" sign), the
stream should be flushed (fflush) or
repositioned (fseek, fsetpos, rewind)
between either a reading operation
followed by a writing operation or a
writing operation followed by a
reading operation.
I'd be amused if this works, because I always had to open a file twice to do that: once for reading and once for writing. Even then, I had to write the whole file out and close it (which flushed the OS buffers) before I could be sure I could read the whole file and not get an early EOF.
Nowadays, since I use Unix-style operating systems, I would just use the pipe() function. Not sure if that works in Windows (because so much doesn't, like select() on files).
Make sure you are seeking to the beginning of the file before reading, like so:
fileFStream.seekg(0, ios_base::beg);
If that doesn't work, post your code.