Optimal way of reading a complete file to a string using fstream?

Optimal way of reading a complete file to a string using fstream? - c++

Many other posts, like " Read whole ASCII file into C++ std::string " explain what some of the options are but do not describe pro and cons of various methods in any depth. I want to know why one method is preferable over another?
All of these use std::fstream to read the file into a std::string. I am unsure what the costs and benefits of each method. Lets assume this is for the common case where the read files are known to be of some smallish size memory can easily accommodate, clearly reading a multi-terrabyte file into an memory is a bad idea no matter how you do it.
The most common way after a few googles searches to read a whole file into an std::string involves using std::getline and appending a newline character to it after each line. This seems needless to me, but is there some performance or compatibility reason that this is ideal?
std::string Results;
std::ifstream ResultReader("file.txt");
while(ResultReader)
{
std::getline(ResultReader, Results);
Results.push_back('\n');
}
Another way I pieced together is to change the getline delimiter so it is something not in the file. The EOF char is seems unlikely to be in the middle of the file so that seems a likely candidate. This includes a cast so there is at least one reason not to do it, but this does read a file at once with no string concatenation. Presumably there is still some cost for the delimiter checks. Are there any other good reasons not to do this?
std::string Results;
std::ifstream ResultReader("file.txt");
std::getline(ResultReader, Results, (char)std::char_traits<char>::eof());
The cast means that on systems that define std::char_traits::eof() as something other than -1 might have problems. Is this a practical reason to not choose this over other methods that use std::getline and string::push_pack('\n').
How does these compare to other ways of reading the file at once like in this question: Read whole ASCII file into C++ std::string
std::ifstream ResultReader("file.txt");
std::string Results((std::istreambuf_iterator<char>(ResultReader)),
std::istreambuf_iterator<char>());
It would seem this would be best. It offloads almost all the work onto the standard library which ought to be heavily optimized for the given platform. I see no reason for checks other than stream validity and the end of the file. Is this ideal or are there problems with this that are unseen.
Does the standard or do details of some implementation provide reasons to prefer some method over another? Have I missed some method that might prove ideal in a wide variety of circumstances?
What is a simplest, most idiomatic, best performing and standard compliant way of reading a whole file into an std::string?
EDIT - 2
This question has prompted me to write a small suite of benchmarks. They are MIT license and available on github at: https://github.com/Sqeaky/CppFileToStringExperiments
Fastest - TellSeekRead and CTellSeekRead- These have the system provide an easy to get the size and reads the file in one go.
Faster - Getline Appending and Eof - The checking of chars does not seem to impose any cost.
Fast - RdbufMove and Rdbuf - The std::move seems to make no difference in release.
Slow - Iterator, BackInsertIterator and AssignIterator - Something is wrong with iterators and input streams. The work great in memory, but not here. That said some of these are faster than others.
I have added every method suggested so far, including those in links. I would appreciate if someone could run this on windows and with other compilers. I currently do not have access to a machine with NTFS and it has been noted that this and compiler details could be important.
As for measuring simplicity and idiomatic-ness how do we measure these objectively? Simplicity seems doable, perhaps use something line LOCs and Cyclomatic complexity, but how idiomatic something is seems purely subjective.

What is a simplest, most idiomatic, best performing and standard
compliant way of reading a whole file into an std::string?
those are pertty much contradicting requests, one most likely to lessen the other. simpler code won't be the fastest, or more idiomatic.
after exploring this area for a while I've come to some conclusions:
1) the most performance penalty causing is the IO action itself - the less IO actions taken - the fastest the code
2) memory allocations also quite expensive, but not as expensive as the IO
3) reading as binary is faster than reading as text
4) using the OS API will probably be faster than C++ streams
5) std::ios_base::sync_with_stdio doesn't really effect the performence, it's an urban legend.
using std::getline is probably not the best choice if performence is needed because of these reasons: it will make N IO actions and N allocations for N lines.
A compromise which is fast, standard and elegant is to get the file size, allocate all the memory in one time, then reading the file in one time:
std::ifstream fileReader(<your path here>,std::ios::binary|std::ios::ate);
if (fileReader){
auto fileSize = fileReader.tellg();
fileReader.seekg(std::ios::beg);
std::string content(fileSize,0);
fileReader.read(&content[0],fileSize);
}
move the content around to prevent un-needed copies.

This website has a good comparison on several different methods for doing that. The one I currently use is:
std::string read_sequence() {
std::ifstream f("sequence.fasta");
std::ostringstream ss;
ss << f.rdbuf();
return ss.str();
}
If your text files are separated by newlines, this will keep them. If you want to remove that, for instance (which is my case most of the times), you can just add a call to something such as
auto s = ss.str();
s.erase(std::remove_if(s.begin(), s.end(),
[](char c) { return c == '\n'; }), s.end());

There are two big difficulties with your question. First, the Standard doesn't mandate any particular implementation (yes, nearly everybody started with the same implementation; but they've been modifying it over time, and the optimal I/O code for NTFS, say, will be different than the optimal I/O code for ext4), so it is possible (although somewhat unlikely) for a particular approach to be fastest on one platform, but not another. Second, there's a little difficulty in defining "optimal"; I assume you mean "fastest," but that's not necessarily the case.
There are approaches that are idiomatic, and perfectly fine C++, but unlikely to give wonderful performance. If your goal is to end up with a single std::string, using std::getline(std::ostream&, std::string&) very likely to be slower than necessary. The std::getline() call has to look for the '\n', and you'll occasionally reallocate and copy the destination std::string. Even so, it's ridiculously simple, and easy to understand. That could be optimal from a maintenance perspective, assuming you don't need the absolute fastest performance possible. This will also be a good approach if you don't need the whole file in one giant std::string at one time. You'll be very frugal with memory.
An approach that is likely more efficient is to manipulate the read buffer:
std::string read_the_whole_file(std::ostream& ostr)
{
std::ostringstream sstr;
sstr << ostr.rdbuf();
return sstr.str();
}
Personally, I'm just as likely to use std::fopen() and std::fread() (and std::unique_ptr<FILE>) because, on Windows at least, you'll get a better error message when std::fopen() fails than when constructing a file stream object fails. I consider the better error message an important factor when deciding which approach is optimal.

Related

A better replacement for istrstream?

istrstream was perfect for my needs - basically, take a fixed char buffer, and give me a simple way to extract lines getline() and test for eof()
I'm switching our projects to C++ 17 compliance - which has deprecated istrsteam - apparently because there are too many C++ programmers who cannot fathom fixed buffer memory management (are you serious?!)
At any rate, the istringstream provides the same use semantics, but it imposes the need to now copy the entire fixed character buffer at construction time.
This is an anti-pattern.
What I am looking for is either a way to use a string_view in place of a string for the istringstream, or alternately a better replacement for stringstream which itself handles externally managed fixed buffer (it need only point into it, it never need worry about managing that resource, just as strstream did).
Currently, in VS 2017, this is illegal, and if I understand things correctly, is illegal everywhere in the current state-of-art of C++ (I'm sure you'll correct me if I'm wrong!)
std::string_view raw_view(reinterpret_cast<const char *>(raw_buffer.get()), raw_buffer.size());
std::istringstream raw_stream(raw_view);
So - ideas?
Note: Peter Sommerlad has a proposal for this exact idea here for the C++ standards body:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0448r1.pdf

Continue using istrstream for the time being. It likely won't be removed until either P0448 (using std::span<char> as the source/destination of a stream buffer) or P0408 (the ability to move data into/outof stringstreams) is adopted by the standard. Either of those would serve your needs well.
That being said, if all you're trying to do is get substrings between \ns, it would be far more efficient (even with the above proposals) to just use a regex search. Or just a regular search, since you're just looking for \n. That would give you a pair of iterators that represents a line. Using iostreams for line-by-line processing of an already-loaded character buffer is overkill and will never be as efficient as the alternative.

Fast file parsing C++

I am currently doing a project where I have to read a couple of large files. I would like to ask about some of the best practices for optimizing file parsing in C++.
After reading some benchmarks (example) regarding fread, ifstream, etc. I have decided to use ifstream for this purpose (if you believe there is a better way, please point out any improvements). The way I used it so far was like this:
std::ifstream input_file ("some_file.txt");
input_file.seekg (0, input_file.end);
int length = input_file.tellg(); // Get the size of the buffer
input_file.seekg (0, input_file.beg);
std::vector<char> buffer (length);
input_file.read(&buffer[0], length);
Then I would use stringstream to parse the file like this:
std::stringstream parser;
parser.rdbuf()->pubsetbuf(&buffer[0], length);
and continue parsing with stringstream parser.
The questions which I have are as fallows:
Does the above code copy the buffer to stringstream or are they sharing the same buffer? (I am not quite shure what pubsetbuf does or how efficient it is)
Is there a better way to do this instead of using stringstream?
When we know the length of some irrelevant information, i.e. "irrelevant information, important information", and we wish to get the important information, we could do something like this:
std::string container;
parser.seekg(irrelevant_size, parser.cur); // irrelevant_size is the size
// of irrelevant data
std::getline(parser, container);
How efficient is this compared to doing
parser.get(temp_char_array, irrelevant_size + 1);
to collect all the irrelevant data?

pubsetbuf won´t make a copy. See following link for more details:
http://www.cplusplus.com/reference/streambuf/streambuf/pubsetbuf/
And seeking forward in a file is (much) faster than reading everything between. Strictly speeking it´s not required to be faster, but on all usual OS it´s pretty much constant time (not really, but not proportional to the seek length in any way). Maybe the difference isn´t that much if only some bytes are skipped, but it gets more important with bigger distances.
Depending on how important a little more speed is,
your OS has some faster (but OS-dependent) functions.
And if there is a better parsing way depends on your data.
You should ask this in separate questions.

What is the most performant way to write a string to a file?

I have code which runs lots of loops updating a single string.
Finally I want that string to be stored in a file.
Currently I am printing that string to the console.
I can use a ofstream and write that to a file instead of console.
Instead using a string to be updated, use directly the file stream
Use string stream instead and finally copy that string stream to file
stream and write to a file.
After update of the string is complete I should write a file stream
at once.
The std::string::max_size in my compiler is : 4294967257
And the maximum size of the string that I could generate is approximately half of the max_size of the compiler.
Note: I am using Solaris Unix.
What is the most performant way to write this string to a file?

There's only one way to know the answer. You have to profile it for your case. You can easily do this by measuring the timings how long it takes to generate the file.

Consider all the scenarios and Benchmark the timings.
NOTE : The fastest would be the one closest to the memory.

Try to reuse the same std::string object as much as possible, using reserve and clear. The string will cache its memory allocation. Strings inside {} will make a new allocation each time you enter the block.
Be careful of hidden temporary string objects, for example a + b when a is std::string will create a temporary std::string object with new allocation. Prefer += to concatenate strings.
Use C code to perform conversions. Create a local char buffer and use sprintf etc. They are faster than stringstreams but it's easier to make a mistake so be careful.
Use "\n" instead of std::endl when writing to a file, as the latter causes a flush.
Avoid stringstreams like the plague. They are slow, at least Visual Studio implementation. I've tried them for hardcore text processing and I know.
Of course this assumes performance IS an issue for you. If you are doing light work, stringstreams could be the easiest solution. I prefer them when I am not doing hardcore work as there is much less chance of a bug.

What's the difference between istringstream, ostringstream and stringstream? / Why not use stringstream in every case?

When would I use std::istringstream, std::ostringstream and std::stringstream and why shouldn't I just use std::stringstream in every scenario (are there any runtime performance issues?).
Lastly, is there anything bad about this (instead of using a stream at all):
std::string stHehe("Hello ");
stHehe += "stackoverflow.com";
stHehe += "!";

Personally, I find it very rare that I want to perform streaming into and out of the same string stream.
Usually I want to either initialize a stream from a string and then parse it; or stream things to a string stream and then extract the result and store it.
If you're streaming to and from the same stream, you have to be very careful with the stream state and stream positions.
Using 'just' istringstream or ostringstream better expresses your intent and gives you some checking against silly mistakes such as accidental use of << vs >>.
There might be some performance improvement but I wouldn't be looking at that first.
There's nothing wrong with what you've written. If you find it doesn't perform well enough, then you could profile other approaches, otherwise stick with what's clearest. Personally, I'd just go for:
std::string stHehe( "Hello stackoverflow.com!" );

A stringstream is somewhat larger, and might have slightly lower performance -- multiple inheritance can require an adjustment to the vtable pointer. The main difference is (at least in theory) better expressing your intent, and preventing you from accidentally using >> where you intended << (or vice versa). OTOH, the difference is sufficiently small that especially for quick bits of demonstration code and such, I'm lazy and just use stringstream. I can't quite remember the last time I accidentally used << when I intended >>, so to me that bit of safety seems mostly theoretical (especially since if you do make such a mistake, it'll almost always be really obvious almost immediately).
Nothing at all wrong with just using a string, as long as it accomplishes what you want. If you're just putting strings together, it's easy and works fine. If you want to format other kinds of data though, a stringstream will support that, and a string mostly won't.

In most cases, you won't find yourself needing both input and output on the same stringstream, so using std::ostringstream and std::istringstream explicitly makes your intention clear. It also prevents you from accidentally typing the wrong operator (<< vs >>).
When you need to do both operations on the same stream you would obviously use the general purpose version.
Performance issues would be the least of your concerns here, clarity is the main advantage.
Finally there's nothing wrong with using string append as you have to construct pure strings. You just can't use that to combine numbers like you can in languages such as perl.

istringstream is for input, ostringstream for output. stringstream is input and output.
You can use stringstream pretty much everywhere.
However, if you give your object to another user, and it uses operator >> whereas you where waiting a write only object, you will not be happy ;-)
PS:
nothing bad about it, just performance issues.

std::ostringstream::str() creates a copy of the stream's content, which doubles memory usage in some situations. You can use std::stringstream and its rdbuf() function instead to avoid this.
More details here: how to write ostringstream directly to cout

To answer your third question: No, that's perfectly reasonable. The advantage of using streams is that you can enter any sort of value that's got an operator<< defined, while you can only add strings (either C++ or C) to a std::string.

Presumably when only insertion or only extraction is appropriate for your operation you could use one of the 'i' or 'o' prefixed versions to exclude the unwanted operation.
If that is not important then you can use the i/o version.
The string concatenation you're showing is perfectly valid. Although concatenation using stringstream is possible that is not the most useful feature of stringstreams, which is to be able to insert and extract POD and abstract data types.

Why open a file for read/write access if you only need to read from it, for example?
What if multiple processes needed to read from the same file?

Is there a 'catch' with FastFormat?

I just read about the FastFormat C++ i/o formatting library, and it seems too good to be true: Faster even than printf, typesafe, and with what I consider a pleasing interface:
// prints: "This formats the remaining arguments based on their order - in this case we put 1 before zero, followed by 1 again"
fastformat::fmt(std::cout, "This formats the remaining arguments based on their order - in this case we put {1} before {0}, followed by {1} again", "zero", 1);
// prints: "This writes each argument in the order, so first zero followed by 1"
fastformat::write(std::cout, "This writes each argument in the order, so first ", "zero", " followed by ", 1);
This looks almost too good to be true. Is there a catch? Have you had good, bad or indifferent experiences with it?

Is there a 'catch' with FastFormat?
Last time I checked, there was one annoying catch:
You can only use either the narrow string version or the wide string version of this library. (The functions for wchar_t and char are the same -- which type is used is a compile time switch.)
With iostreams, stdio or Boost.Format you can use both.

Found one "catch", though for most people it will never manifest. From the project page:
Atomic operation. It doesn't write out statement elements one at a time, like the IOStreams, so has no atomicity issues
The only way I can see this happening is if it buffers the whole write() call's output itself, then writes it out to the ostream in one step. This means it needs to allocate memory, and if an object passed into the write() call produces a lot of output (several megabytes or more), it can consume up to twice that much memory in internal buffers (assuming it uses the grow-a-buffer-by-doubling-its-size-each-time trick).
If you're just using it for logging, and not, say, dumping huge amounts of XML, you'll never see this problem.
The only other "catch" I'm seeing is:
Highly portable. It will work with all good modern C++ compilers; it even works with Visual C++ 6!
So it won't work with an old C++ compiler, like cfront, whereas iostreams is backward compatible to the late 80's. Again, I'd be surprised if anyone ever had a problem with this.

Although FastFormat is a good library there are a number of issues with it:
Limited formatting support, in particular the following features are not supported:
Leading zeros (or any other non-space padding)
Octal/hexadecimal encoding
Runtime width/alignment specification
The library is quite big for a relatively small task of formatting and has even bigger dependency (STLSoft).

It looks pretty interesting indeed! Good tip regardless, and +1 for that!
I've been playing with it for a bit. The main drawback I see is that FastFormat supports less formatting options for the output. This is I think a direct consequence of the way the higher typesafety is achieved, and a good tradeoff depending on your circumstances.

If you look in detail at his performance benchmark page, you'll notice that good old C printf-family functions are still winning on Linux. In fact, the only test case where they perform poorly is the test case that should be static string concatenations, where I would expect printf to be wasteful. Moreover, GCC provides static type-checking on printf-style function calls, so the benefit of type-safety is reduced. So: if you are running on Linux and if you need the absolute best performance, FastFormat is probably not the optimal solution.

The library depends on a couple of environment variables, as mentioned in the docs.
That might be no biggie to some people, but I'd prefer my code to be as self-contained as possible. If I check it out from source control, it should work and compile. It won't, if it requires you to set environment variables.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js