Optimize file reading C++

Optimize file reading C++ - c++

string str1, str2;
vector<string> vec;
ifstream infile;
infile.open("myfile.txt");
while (! infile.eof() )
{
getline(infile,str1);
istringstream is;
is >> str1;
while (is >> str2)
{
vec.push_back(str2);
}
}
What the code does is read the string from the file and stores it into a vector.
Performance needs to be given priority. How can i optimize this code, to make the reading performance faster?

As others have already pointed out (see for example herohuyongtao's answer), the loop condition and how you put str1 into the istringstream must be fixed.
However, there is an important issue here that everybody has missed so far: you don't need the istringstream at all!
vec.reserve(the_number_of_words_you_exptect_at_least);
while (infile >> str1) {
vec.push_back(str1);
}
It gets rid of the inner loop which you didn't need in the first place, and doesn't create an istringstream in each iteration.
If you need to parse further each line and you do need an istringstream, create it outside of the loop and set its string buffer via istringstream::str(const string& s).
I can easily imagine your loops being very slow: Heap allocation on Windows is outrageously slow (compared to Linux); I got bitten once.
Andrei Alexandrescu presents (in some sense) a similar example in his talk Writing Quick Code in C++, Quickly. The surprising thing is that doing unnecessary heap allocations in a tight loop like the one above can be slower than the actual file IO. I was surprised to see that.
You didn't tag your question as C++11 but here is what I would do in C++11.
while (infile >> str1) {
vec.emplace_back(std::move(str1));
}
This move constructs the string at the back of the vector, without copying in. We can do it because we don't need the contents of str1 after we have put it into the vector. In other words, there is no need to copy it into a brand new string at the back of the vector, it is sufficient to just move its contents there. The first loop with the vec.push_back(str1); may potentially copy the contents of str1 which is really unnecessary.
The string implementation in gcc 4.7.2 is currently copy on write, so the two loops have identical performance; it doesn't matter which one you use. For now.
Unfortunately, copy on write strings are now forbidden by the standard. I don't know when the gcc developers are going to change the implementation. If the implementation changes, it may make a difference in performance whether you move (emplace_back(std::move(s))) or you copy (push_back(s)).
If C++98 compatibility is important for you, then go with push_back(). Even if the worst thing happens in the future and your string is copied (it isn't copied now), that copy can be turned into a memmove() / memcpy() which is blazing fast, most likely faster than reading the contents of the file from the hard disk so file IO will most likely remain the bottleneck.

Before any optimization, you need to change
while (! infile.eof() ) // problem 1
{
getline(infile,str1);
istringstream is;
is >> str1; // problem 2
while (is >> str2){
vec.push_back(str2);
}
}
to
while ( getline(infile,str1) ) // 1. don't use eof() in a while-condition
{
istringstream is(str1); // 2. put str1 to istringstream
while (is >> str2){
vec.push_back(str2);
}
}
to make it work as you expected.
P.S. For the optimization part, you don't need to think too much on it unless it becomes a bottleneck. Premature optimization is the root of all evil. However, if you do want to speed it up, check out #Ali's answer for further info.

Loop condition is wrong. Not a performance issue. Assuming this IO loop is indeed your application's bottleneck. But even if not, it can be a good educational exercise or just a weekend fun.
You have quite a few temporaries and cases of dynamic memory allocation in the loop.
Calling std::vector::reserve() in front of the loop will improve it a bit. Reallocating it manually to emulate x1.2 grow factor opposed to 2x after some size will help as well. std::list may though be more appropriate if the file size is unpredictable.
Using std::istringstream as a tokenizer is very inoptimal. Switching to iterator-based "view" tokenizer (Boost has one) should improve the speed a lot.
If you need it to be very fast and have enough RAM you can memory map the file before reading it. Boost::iostreams can let you get there quick. Generally, though, without Boost you can be twice as fast (Boost is not bad but it has to be generic and work on a dozen of compilers this is why).
If you are a blessed person using Unix/Linux as your development environment run your program under valgrind --tool=cachegrind and you will see all problematic places and how bad they are relative to one other. Also, valgrind --tool=massif will let you identify nubmerous small heap allocated objects which is generally not tolerable in the high performance code.

The fastest, though not fully portable, approach is to load the file into a memory mapped region (see wiki mmap.
Given that you know the size of the file, you now can define forward iterators (possibly pointer to const char) on that memory region which you can use to find the tokens which separate your file into "strings".
Essentially, you repeatedly get a pair of pointers pointing to the first character respectively the end of each "string". From that pair of iterators create your std::string.
This approach has subtle issues though:
You need to take care of the character encoding of the file, possibly convert from this character encoding to your desired encoding which is used by your std::string (presumable UTF-8).
The "token" to separate strings (usually \n, may be platform dependent, or may depend on which program created the file.

Related

Reading large strings in C++ -- is there a safe fast way?

http://insanecoding.blogspot.co.uk/2011/11/how-to-read-in-file-in-c.html reviews a number of ways of reading an entire file into a string in C++. The key code for the fastest option looks like this:
std::string contents;
in.seekg(0, std::ios::end);
contents.resize(in.tellg());
in.seekg(0, std::ios::beg);
in.read(&contents[0], contents.size());
Unfortunately, this is not safe as it relies on the string being implemented in a particular way. If, for example, the implementation was sharing strings then modifying the data at &contents[0] could affect strings other than the one being read. (More generally, there's no guarantee that this won't trash arbitrary memory -- it's unlikely to happen in practice, but it's not good practice to rely on that.)
C++ and the STL are designed to provide features that are efficient as C, so one would expect there to be a version of the above that was just as fast but guaranteed to be safe.
In the case of vector<T>, there are functions which can be used to access the raw data, which can be used to read a vector efficiently:
T* vector::data();
const T* vector::data() const;
The first of these can be used to read a vector<T> efficiently. Unfortunately, the string equivalent only provides the const variant:
const char* string::data() const noexcept;
So this cannot be used to read a string efficiently. (Presumably the non-const variant is omitted to support the shared string implementation.)
I have also checked the string constructors, but the ones that accept a char* copy the data -- there's no option to move it.
Is there a safe and fast way of reading the whole contents of a file into a string?
It may be worth noting that I want to read a string rather than a vector<char> so that I can access the resulting data using a istringstream. There's no equivalent of that for vector<char>.

If you really want to avoid copies, you can slurp the file into a std::vector<char>, and then roll your own std::basic_stringbuf to pull data from the vector.
You can then declare a std::istringstream and use std::basic_ios::rdbuf to replace the input buffer with your own one.
The caveat is that if you choose to call istringstream::str it will invoke std::basic_stringbuf::str and will require a copy. But then, it sounds like you won't be needing that function, and can actually stub it out.
Whether you get better performance this way would require actual measurement. But at least you avoid having to have two large contiguous memory blocks during the copy. Additionally, you could use something like std::deque as your underlying structure if you want to cope with truly huge files that cannot be allocated in contiguous memory.
It's also worth mentioning that if you're really just streaming that data you are essentially double-buffering by reading it into a string first. Unless you also require the contents in memory for some other purpose, the buffering inside std::ifstream is likely to be sufficient. If you do slurp the file, you may get a boost by turning buffering off.

I think using &string[0] is just fine, and it should work with the widely used standard library implementations (even if it is technically UB).
But since you mention that you want to put the data into an istringstream, here's an alternative:
Read the data into a char array (new char[in.tellg()])
Construct a stringstream (without the leading 'i')
Insert the data with stringstream::write
The istringstream would have to copy the data anyway, because a std::stringstream doesn't store a std::string internally as far as I'm aware, so you can leave the std::string away and put the data into it directly.
EDIT: Actually, instead of the manual allocation (or make_unique), this way you could also use the vector<char> you mentioned.

Is it recommended to std::move a string into containers that is going to be overwritten?

I have the following code
std::vector<std::string> lines;
std::string currentLine;
while(std::getline(std::cin, currentLine)) {
// // option 1
// lines.push_back(std::move(currentLine));
// // option 2
// lines.push_back(currentLine);
}
I see different costs for the two
The first approach will clear currentLine, making the getline need to allocate a new buffer for the string. But it will use the buffer for the vector instead.
The second approach will make getline be able to reuse the buffer, and require a new buffer allocation for the in-vector string.
In such situations, is there a "better" way? Can the compiler optimize the one or other approach more efficiently? Or are there clever string implementations that make one option way more performant than the other?

Given the prevalence of the short string optimization, my immediate guess is that in many cases none of this will make any difference at all -- with SSO, a move ends up copying the contained data anyway (even if the source is an rvalue so it's eligible as the source for a move).
Between the two you've given, I think I'd tend to favor the non-moving version, but I doubt it's going to make a big difference either way. Given that (most of the time) you're going to be re-using the source immediately after the move, I doubt that moving is really going to do a lot of good (even at best). Assuming SSO isn't involved, your choice is being creating a new string in the vector to hold a copy of the string you read, or move from the string you read and (in essence) create a new string to hold the next line in the next iteration. Either way, the expensive part (allocating a buffer to hold the string, copy data into that buffer) is going to be pretty much the same either way.
As far as: "is there a better way" goes, I can think of at least a couple possibilities. The most obvious would be to memory map the file, then walk through that buffer, find the ends of lines, and use emplace_back to create strings in the vector directly from the data in the buffer, with no intermediate strings at all.
That does have the minor disadvantage of memory mapping not being standardized -- if you can't live with that level of non-portability, you can read the whole file into a buffer instead of memory mapping.
The next possibility after that would be to create a class with an interface like a const string's, that just maintains a pointer to the data in the big buffer instead of making a copy of it (e.g., CLang uses something like this). This will typically reduce total allocation, heap fragmentation, etc., but if you (for example) need to modify the strings afterward, it's unlikely to be of much (if any) use.

Is using strings this way inefficient?

I am a newbee to c++ and am running into problems with my teacher using strings in my code. Though it is clear to me that I have to stop doing that in her class, I am curious as to why it is wrong. In this program the five strings I assigned were going to be reused no less than 4 to 5 times, therefore I put the text into strings. I was told to stop doing it as it is inefficient. Why? In c++ are textual strings supposed to be typed out as opposed to being stored into strings, and if so why? Below is some of the program, please tell me why it is bad.
string Bry = "berries";
string Veg = "vegetables";
string Flr = "flowers";
string AllStr;
float Tmp1, Precip;
int Tmp, FlrW, VegW, BryW, x, Selct;
bool Cont = true;
AllStr = Flr + ", " + Bry + ", " + "and " + Veg;

Answering whether using strings is inefficient is really something that very much depends on how you're using them.
First off, I would argue that you should be using C++ strings as a default - only going to raw C strings if you actually measure and find C++ strings to be too slow. The advantages (primarily for security) are just too great - it's all too easy to screw up buffer management with raw C strings. So I would disagree with your teacher that this is overly inefficient.
That said, it's important to understand the performance implications of using C++ strings. Since they are always dynamically allocated, you may end up spending a lot of time copying and reallocating buffers. This is usually not a problem; usually there are other things which take up much more time. However, if you're doing this right in the middle of a loop that's critical to your program's performance, you may need to find another method.
In short, premature optimization is usually a bad idea. Write code that is obviously correct, even if it takes ever-so-slightly longer to run. But be aware of the costs and trade-offs you're making at the same time; that way, if it turns out that C++ strings are actually slowing your program down a lot, you'll know what to change to fix that.

Yes, it's fairly inefficient, for following reasons:
When you construct a std::string object, it has to allocate a storage space for the string content (which may or may not be a separate dynamic memory allocation, depending on whether small-string optimization is in effect) and copy the literal string that is parameter of the constructor. For example, when you say: string Bry = "berries" it allocates a separate memory block (potentially from the dynamic memory), then copies "berries" to that block.
So you potentially have an extra dynamic memory allocation (costing time),
have to perform the copy (costing more time),
and end-up with 2 copies of the same string (costing space).
Using std::string::operator+ produces a new string that is the result of concatenation. So when you write several + operators in a row, you have several temporary concatenation results and a lot of unnecessary copying.
For your example, I recommend:
Using string literals unless you actually need the functionality only available in std::string.
Using std::stringstream to concatenate several strings together.
Normally, code readability is preferred over micro-optimizations of this sort, but luckily you can have both performance and readability in this case.

Your teacher is both right and wrong. S/he's right that building up strings from substrings at runtime is less CPU-efficient than simply providing the fully pre-built string in the code to start with -- but s/he's wrong in thinking that efficiency is necessarily an important factor to worry about in this case.
In a lot of cases, efficiency simply doesn't matter. At all. For example, if your code above is only going to be executed rarely (e.g. no more than once per second), then it's going to be literally impossible to measure any difference between the "most efficient version" and your not-so-efficient version. Given that, it's quite justifiable to decide that other factors (such as code readability and maintainability) are more important than maximizing efficiency.
Of course, if your program is going to be reconstructing these strings thousands or millions of times per second, then making sure your code is maximally efficient, even at the expense of readability/maintainability, is a good tradeoff to make. But I doubt that is the case here.

Your approach is almost perfect - try and declare everything only once. But if it is not used more than once - dont wast you fingers typing it :-) ie a 10 line program
The only change I would suggest is to make the strings const to help the compiler optimize you program.
If you instructor still disagrees - get a new instructor.

it is inefficient. doing that last line right would be 4-5 times faster.
at the very least you should use +=
+= means that you would avoid creating new strings with the + operator.
The instructor knows that when you do a string = string + string C++ creates a new string that is immediately destroyed.

Efficiency is probably not is good argument to not use string in school assignments but yes, if I am a teacher and the topic is not about some very high level applications, I don't want my students using string.
The real reason is string hides the low level memory management. A student coming out of college should have the basic memory management skill. Though nowadays in working environment, programmers don't deal with the memory management in most of the time but there are always situations where you need to understand what's happening under the hood to be able to reason the problem you are encountering.

With the context given, it looks like you should just be able to declare AllString as a const or string literal without all the substrings and addition. Assuming there's more to it, declaring them as literal string objects allocates memory at runtime. (And, not that there is any practical impact here, but you should be aware that stl container objects sometimes allocate a default minimum of space that is larger than the number of things initially in it, as part of its optimizations in anticipation of later modifying operations. I'm not sure if std::string does so on an declare/assign or not.) If you are only ever going to use them as literals, declaring them as a const char* or in a #define is easier on both memory and runtime performance, and you can still use them as r-values in string operations. If you are using them other ways in code you are not showing us, then its up to whether they need to ever be changed or manipulated as to whether they need to be strings or not.
If you are trying to learn coding, inefficiencies that don't matter in practice are still things you should be aware of and avoid if unnecessary. In production code, there are sometimes reasons to do something like this, but if it is not for any good reason, it's just sloppy. She's right to point it out, but what she should be doing is using that as a starting point for a conversation about the various tradeoffs involved - memory, speed, readability, maintainability, etc. If she's a teacher, she should be looking for "teaching moments" like this rather than just an opportunity to scold.

you can use string.append() ;
its better than + or +=

NULL terminated string and its length

I have a legacy code that receives some proprietary, parses it and creates a bunch of static char arrays (embedded in class representing the message), to represent NULL strings. Afterwards pointers to the string are passed all around and finally serialized to some buffer.
Profiling shows that str*() methods take a lot of time.
Therefore I would like to use memcpy() whether it's possible. To achive it I need a way to associate length with pointer to NULL terminating string. I though about:
Using std::string looks less efficient, since it requires memory allocation and thread synchronization.
I can use std::pair<pointer to string, length>. But in this case I need to maintain length "manually".
What do you think?

use std::string

Profiling shows that str*() methods
take a lot of time
Sure they do ... operating on any array takes a lot of time.
Therefore I would like to use memcpy()
whether it's possible. To achive it I
need a way to associate length with
pointer to NULL terminating string. I
though about:
memcpy is not really any slower than strcpy. In fact if you perform a strlen to identify how much you are going to memcpy then strcpy is almost certainly faster.
Using std::string looks less
efficient, since it requires memory
allocation and thread synchronization
It may look less efficient but there are a lot of better minds than yours or mine that have worked on it
I can use std::pair. But in this case I need to
maintain length "manually".
thats one way to save yourself time on the length calculation. Obviously you need to maintain the length manually. This is how windows BSTRs work, effectively (though the length is stored immediately prior, in memory, to the actual string data). std::string. for example, already does this ...
What do you think?
I think your question is asked terribly. There is no real question asked which makes answering next to impossible. I advise you actually ask specific questions in the future.

Use std::string. It's an advice already given, but let me explain why:
One, it uses a custom memory allocation scheme. Your char* strings are probably malloc'ed. That means they are worst-case aligned, which really isn't needed for a char[]. std::string doesn't suffer from needless alignment. Furthermore, common implementatios of std::string use the "Small String Optimization" which eliminates a heap allocation altogether, and improves locality of reference. The string size will be on the same cache line as the char[] itself.
Two, it keeps the string length, which is indeed a speed optimization. Most str* functions are slower because they don't have this information up front.
A second option would be a rope class, e.g. from SGI. This be more efficient by eliminating some string copies.

Your post doesn't explain where the str*() function calls are coming from; passing around char * certainly doesn't invoke them. Identify the sites that actually do the string manipulation and then try to find out if they're doing so inefficiently. One common pitfall is that strcat first needs to scan the destination string for the terminating 0 character. If you call strcat several times in a row, you can end up with a O(N^2) algorithm, so be careful about this.
Replacing strcpy by memcpy doesn't make any significant difference; strcpy doesn't do an extra pass to find the length of the string, it's simply (conceptually!) a character-by-character copy that stops when it encounters the terminating 0. This is not much more expensive than memcpy, and always cheaper than strlen followed by memcpy.
The way to gain performance on string operations is to avoid copies where possible; don't worry about making the copying faster, instead try to copy less! And this holds for all string (and array) implementations, whether it be char *, std::string, std::vector<char>, or some custom string / array class.

What do I think? I think that you should do what everyone else obsessed with pre-optimization does. You should find the most obscure, unmaintainable, yet intuitively (to you anyway) high-performance way you can and do it that way. Sounds like you're onto something with your pair<char*,len> with malloc/memcpy idea there.
Whatever you do, do NOT use pre-existing, optimized wheels that make maintenence easier. Being maintainable is simply the least important thing imaginable when you're obsessed with intuitively measured performance gains. Further, as you well know, you're quite a bit smarter than those who wrote your compiler and its standard library implementation. So much so that you'd be seriously silly to trust their judgment on anything; you should really consider rewriting the entire thing yourself because it would perform better.
And ... the very LAST thing you'll want to do is use a profiler to test your intuition. That would be too scientific and methodical, and we all know that science is a bunch of bunk that's never gotten us anything; we also know that personal intuition and revelation is never, ever wrong. Why waste the time measuring with an objective tool when you've already intuitively grasped the situation's seemingliness?
Keep in mind that I'm being 100% honest in my opinion here. I don't have a sarcastic bone in my body.

Should I preallocate std::stringstream?

I use std::stringstream extensively to construct strings and error messages in my application. The stringstreams are usually very short life automatic variables.
Will such usage cause heap reallocation for every variable? Should I switch from temporary to class-member stringstream variable?
In latter case, how can I reserve stringstream buffer? (Should I initialize it with a large enough string or is there a more elegant method?)

Have you profiled your execution, and found them to be a source of slow down?
Consider their usage. Are they mostly for error messages outside the normal flow of your code?
As far as reserving space...
Some implementations probably reserve a small buffer before any allocation takes place for the stringstream. Many implementations of std::string do this.
Another option might be (untested!)
std::string str;
str.reserve(50);
std::stringstream sstr(str);
You might find some more ideas in this gamedev thread.
edit:
Mucking around with the stringstream's rdbuf might also be a solution. This approach is probably Very Easy To Get Wrong though, so please be sure it's absolutely necessary. Definitely not elegant or concise.

Although "mucking around with the stringstream's rdbuf...is probably Very Easy To Get Wrong", I went ahead and hacked together a proof-of-concept anyway for fun, as it has always bugged me that there is no easy way to reserve storage for stringstream. Again, as #luke said, you are probably better off optimizing what your profiler tells you needs optimizing, so this is just to address "What if I want to do it anyway?".
Instead of mucking around with stringstream's rdbuf, I made my own, which does pretty much the same thing. It implements only the minimum, and uses a string as a buffer. Don't ask me why I called it a VECTOR_output_stream. This is just a quickly-hacked-together thing.
constexpr auto preallocated_size = 256;
auto stream = vector_output_stream(preallocated_size);
stream << "My parrot ate " << 3 << " cookies.";
cout << stream.str() << endl;

The Bad
This is an old question, but even as of C++1z/C++2a in Visual Studio 2019, stringstream has no ideal way of reserving a buffer.
The other answers to this question do not work at all and for the following reasons:
calling reserve on an empty string yields an empty string, so stringstream constructor doesn't need to allocate to copy the contents of that string.
seekp on a stringstream still seems to be undefined behavior and/or does nothing.
The Good
This code segment works as expected, with ss being preallocated with the requested size.
std::string dummy(reserve, '\0');
std::stringstream ss(dummy);
dummy.clear();
dummy.shrink_to_fit();
The code can also be written as a one-liner std::stringstream ss(std::string(reserve, '\0'));.
The Ugly
What really happens in this code segment is the following:
dummy is preallocated with the reserve, and the buffer is subsequently filled with null bytes (required for the constructor).
stringstream is constructed with dummy. This copies the entire string's contents into an internal buffer, which is preallocated.
dummy is then cleared and then erased, freeing up its allocation.
This means that in order to preallocate a stringstream, two allocations, one fill, and one copy takes place. The worst part is that during the expression, twice as much memory is needed for the desired allocation. Yikes!
For most use cases, this might not matter at all and it's OK to take the extra fill and copy hit to have fewer reallocations.

I'm not sure, but I suspect that stringbuf of stringstream is tightly related with resulted string. So I suspect that you can use ss.seekp(reserved-1); ss.put('\0'); to reserve reserved amount of bytes inside of underlying string of ss. Actually I'd like to see something like ss.seekp(reserved); ss.trunc();, but there is no trunc() method for streams.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Optimize file reading C++ - c++

Related

Reading large strings in C++ -- is there a safe fast way?

Is it recommended to std::move a string into containers that is going to be overwritten?

Is using strings this way inefficient?

NULL terminated string and its length

Should I preallocate std::stringstream?

Categories

Resources