Reading large strings in C++ -- is there a safe fast way? - c++

http://insanecoding.blogspot.co.uk/2011/11/how-to-read-in-file-in-c.html reviews a number of ways of reading an entire file into a string in C++. The key code for the fastest option looks like this:
std::string contents;
in.seekg(0, std::ios::end);
contents.resize(in.tellg());
in.seekg(0, std::ios::beg);
in.read(&contents[0], contents.size());
Unfortunately, this is not safe as it relies on the string being implemented in a particular way. If, for example, the implementation was sharing strings then modifying the data at &contents[0] could affect strings other than the one being read. (More generally, there's no guarantee that this won't trash arbitrary memory -- it's unlikely to happen in practice, but it's not good practice to rely on that.)
C++ and the STL are designed to provide features that are efficient as C, so one would expect there to be a version of the above that was just as fast but guaranteed to be safe.
In the case of vector<T>, there are functions which can be used to access the raw data, which can be used to read a vector efficiently:
T* vector::data();
const T* vector::data() const;
The first of these can be used to read a vector<T> efficiently. Unfortunately, the string equivalent only provides the const variant:
const char* string::data() const noexcept;
So this cannot be used to read a string efficiently. (Presumably the non-const variant is omitted to support the shared string implementation.)
I have also checked the string constructors, but the ones that accept a char* copy the data -- there's no option to move it.
Is there a safe and fast way of reading the whole contents of a file into a string?
It may be worth noting that I want to read a string rather than a vector<char> so that I can access the resulting data using a istringstream. There's no equivalent of that for vector<char>.

If you really want to avoid copies, you can slurp the file into a std::vector<char>, and then roll your own std::basic_stringbuf to pull data from the vector.
You can then declare a std::istringstream and use std::basic_ios::rdbuf to replace the input buffer with your own one.
The caveat is that if you choose to call istringstream::str it will invoke std::basic_stringbuf::str and will require a copy. But then, it sounds like you won't be needing that function, and can actually stub it out.
Whether you get better performance this way would require actual measurement. But at least you avoid having to have two large contiguous memory blocks during the copy. Additionally, you could use something like std::deque as your underlying structure if you want to cope with truly huge files that cannot be allocated in contiguous memory.
It's also worth mentioning that if you're really just streaming that data you are essentially double-buffering by reading it into a string first. Unless you also require the contents in memory for some other purpose, the buffering inside std::ifstream is likely to be sufficient. If you do slurp the file, you may get a boost by turning buffering off.

I think using &string[0] is just fine, and it should work with the widely used standard library implementations (even if it is technically UB).
But since you mention that you want to put the data into an istringstream, here's an alternative:
Read the data into a char array (new char[in.tellg()])
Construct a stringstream (without the leading 'i')
Insert the data with stringstream::write
The istringstream would have to copy the data anyway, because a std::stringstream doesn't store a std::string internally as far as I'm aware, so you can leave the std::string away and put the data into it directly.
EDIT: Actually, instead of the manual allocation (or make_unique), this way you could also use the vector<char> you mentioned.

Related

What are the factors to use std::string vs std::stringstream? [duplicate]

I know one of the advantages of std::stringstream is that it is a std::istream so it may accept input from any type that defines operator<< to std::istream, and also from primitives types.
I am not going to use operator<<; instead I am just going to concatenate many strings. Does the implementation of std::stringstream make it faster than std::string for concatenating many strings?
There's no reason to expect std::string's appending functions to be slower than stringstream's insertion functions. std::string will generally be nothing more than a possible memory allocation/copy plus copying of the data into the memory. stringstream has to deal with things like locales, etc, even for basic write calls.
Also, std::string provides ways to minimize or eliminate anything but the first memory allocation. If you reserve sufficient space, every insertion is little more than a memcpy. That's not really possible with stringstream.
Even if it were faster than std::string's appending functions, you still have to copy the string out of the stringstream to do something with it. So that's another allocation + copy, which you won't need with std::string. Though at least C++20 looks set to remove that particular need.
You should use std::stringstream if you need formatting, not just for sticking some strings together.

std::stringstream vs std::string for concatenating many strings

I know one of the advantages of std::stringstream is that it is a std::istream so it may accept input from any type that defines operator<< to std::istream, and also from primitives types.
I am not going to use operator<<; instead I am just going to concatenate many strings. Does the implementation of std::stringstream make it faster than std::string for concatenating many strings?
There's no reason to expect std::string's appending functions to be slower than stringstream's insertion functions. std::string will generally be nothing more than a possible memory allocation/copy plus copying of the data into the memory. stringstream has to deal with things like locales, etc, even for basic write calls.
Also, std::string provides ways to minimize or eliminate anything but the first memory allocation. If you reserve sufficient space, every insertion is little more than a memcpy. That's not really possible with stringstream.
Even if it were faster than std::string's appending functions, you still have to copy the string out of the stringstream to do something with it. So that's another allocation + copy, which you won't need with std::string. Though at least C++20 looks set to remove that particular need.
You should use std::stringstream if you need formatting, not just for sticking some strings together.

Is it recommended to std::move a string into containers that is going to be overwritten?

I have the following code
std::vector<std::string> lines;
std::string currentLine;
while(std::getline(std::cin, currentLine)) {
// // option 1
// lines.push_back(std::move(currentLine));
// // option 2
// lines.push_back(currentLine);
}
I see different costs for the two
The first approach will clear currentLine, making the getline need to allocate a new buffer for the string. But it will use the buffer for the vector instead.
The second approach will make getline be able to reuse the buffer, and require a new buffer allocation for the in-vector string.
In such situations, is there a "better" way? Can the compiler optimize the one or other approach more efficiently? Or are there clever string implementations that make one option way more performant than the other?
Given the prevalence of the short string optimization, my immediate guess is that in many cases none of this will make any difference at all -- with SSO, a move ends up copying the contained data anyway (even if the source is an rvalue so it's eligible as the source for a move).
Between the two you've given, I think I'd tend to favor the non-moving version, but I doubt it's going to make a big difference either way. Given that (most of the time) you're going to be re-using the source immediately after the move, I doubt that moving is really going to do a lot of good (even at best). Assuming SSO isn't involved, your choice is being creating a new string in the vector to hold a copy of the string you read, or move from the string you read and (in essence) create a new string to hold the next line in the next iteration. Either way, the expensive part (allocating a buffer to hold the string, copy data into that buffer) is going to be pretty much the same either way.
As far as: "is there a better way" goes, I can think of at least a couple possibilities. The most obvious would be to memory map the file, then walk through that buffer, find the ends of lines, and use emplace_back to create strings in the vector directly from the data in the buffer, with no intermediate strings at all.
That does have the minor disadvantage of memory mapping not being standardized -- if you can't live with that level of non-portability, you can read the whole file into a buffer instead of memory mapping.
The next possibility after that would be to create a class with an interface like a const string's, that just maintains a pointer to the data in the big buffer instead of making a copy of it (e.g., CLang uses something like this). This will typically reduce total allocation, heap fragmentation, etc., but if you (for example) need to modify the strings afterward, it's unlikely to be of much (if any) use.

Is there a way to pass ownership of an existing char* in heap to a std::string? [duplicate]

I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
http://www.sgi.com/tech/stl/Rope.html
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.

initializing std::string from char* without copy

I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
http://www.sgi.com/tech/stl/Rope.html
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.