std::stringstream vs std::string for concatenating many strings - c++

I know one of the advantages of std::stringstream is that it is a std::istream so it may accept input from any type that defines operator<< to std::istream, and also from primitives types.
I am not going to use operator<<; instead I am just going to concatenate many strings. Does the implementation of std::stringstream make it faster than std::string for concatenating many strings?

There's no reason to expect std::string's appending functions to be slower than stringstream's insertion functions. std::string will generally be nothing more than a possible memory allocation/copy plus copying of the data into the memory. stringstream has to deal with things like locales, etc, even for basic write calls.
Also, std::string provides ways to minimize or eliminate anything but the first memory allocation. If you reserve sufficient space, every insertion is little more than a memcpy. That's not really possible with stringstream.
Even if it were faster than std::string's appending functions, you still have to copy the string out of the stringstream to do something with it. So that's another allocation + copy, which you won't need with std::string. Though at least C++20 looks set to remove that particular need.
You should use std::stringstream if you need formatting, not just for sticking some strings together.

Related

What are the factors to use std::string vs std::stringstream? [duplicate]

I know one of the advantages of std::stringstream is that it is a std::istream so it may accept input from any type that defines operator<< to std::istream, and also from primitives types.
I am not going to use operator<<; instead I am just going to concatenate many strings. Does the implementation of std::stringstream make it faster than std::string for concatenating many strings?
There's no reason to expect std::string's appending functions to be slower than stringstream's insertion functions. std::string will generally be nothing more than a possible memory allocation/copy plus copying of the data into the memory. stringstream has to deal with things like locales, etc, even for basic write calls.
Also, std::string provides ways to minimize or eliminate anything but the first memory allocation. If you reserve sufficient space, every insertion is little more than a memcpy. That's not really possible with stringstream.
Even if it were faster than std::string's appending functions, you still have to copy the string out of the stringstream to do something with it. So that's another allocation + copy, which you won't need with std::string. Though at least C++20 looks set to remove that particular need.
You should use std::stringstream if you need formatting, not just for sticking some strings together.

Reading large strings in C++ -- is there a safe fast way?

http://insanecoding.blogspot.co.uk/2011/11/how-to-read-in-file-in-c.html reviews a number of ways of reading an entire file into a string in C++. The key code for the fastest option looks like this:
std::string contents;
in.seekg(0, std::ios::end);
contents.resize(in.tellg());
in.seekg(0, std::ios::beg);
in.read(&contents[0], contents.size());
Unfortunately, this is not safe as it relies on the string being implemented in a particular way. If, for example, the implementation was sharing strings then modifying the data at &contents[0] could affect strings other than the one being read. (More generally, there's no guarantee that this won't trash arbitrary memory -- it's unlikely to happen in practice, but it's not good practice to rely on that.)
C++ and the STL are designed to provide features that are efficient as C, so one would expect there to be a version of the above that was just as fast but guaranteed to be safe.
In the case of vector<T>, there are functions which can be used to access the raw data, which can be used to read a vector efficiently:
T* vector::data();
const T* vector::data() const;
The first of these can be used to read a vector<T> efficiently. Unfortunately, the string equivalent only provides the const variant:
const char* string::data() const noexcept;
So this cannot be used to read a string efficiently. (Presumably the non-const variant is omitted to support the shared string implementation.)
I have also checked the string constructors, but the ones that accept a char* copy the data -- there's no option to move it.
Is there a safe and fast way of reading the whole contents of a file into a string?
It may be worth noting that I want to read a string rather than a vector<char> so that I can access the resulting data using a istringstream. There's no equivalent of that for vector<char>.
If you really want to avoid copies, you can slurp the file into a std::vector<char>, and then roll your own std::basic_stringbuf to pull data from the vector.
You can then declare a std::istringstream and use std::basic_ios::rdbuf to replace the input buffer with your own one.
The caveat is that if you choose to call istringstream::str it will invoke std::basic_stringbuf::str and will require a copy. But then, it sounds like you won't be needing that function, and can actually stub it out.
Whether you get better performance this way would require actual measurement. But at least you avoid having to have two large contiguous memory blocks during the copy. Additionally, you could use something like std::deque as your underlying structure if you want to cope with truly huge files that cannot be allocated in contiguous memory.
It's also worth mentioning that if you're really just streaming that data you are essentially double-buffering by reading it into a string first. Unless you also require the contents in memory for some other purpose, the buffering inside std::ifstream is likely to be sufficient. If you do slurp the file, you may get a boost by turning buffering off.
I think using &string[0] is just fine, and it should work with the widely used standard library implementations (even if it is technically UB).
But since you mention that you want to put the data into an istringstream, here's an alternative:
Read the data into a char array (new char[in.tellg()])
Construct a stringstream (without the leading 'i')
Insert the data with stringstream::write
The istringstream would have to copy the data anyway, because a std::stringstream doesn't store a std::string internally as far as I'm aware, so you can leave the std::string away and put the data into it directly.
EDIT: Actually, instead of the manual allocation (or make_unique), this way you could also use the vector<char> you mentioned.

Is there a way to pass ownership of an existing char* in heap to a std::string? [duplicate]

I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
http://www.sgi.com/tech/stl/Rope.html
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.

Why do you prefer char* instead of string, in C++?

I'm a C programmer trying to write c++ code. I heard string in C++ was better than char* in terms of security, performance, etc, however sometimes it seems that char* is a better choice. Someone suggested that programmers should not use char* in C++ because we could do all things that char* could do with string, and it's more secure and faster.
Did you ever used char* in C++? What are the specific conditions?
It's safer to use std::string because you don't need to worry about allocating / deallocating memory for the string. The C++ std::string class is likely to use a char* array internally. However, the class will manage the allocation, reallocation, and deallocation of the internal array for you. This removes all the usual risks that come with using raw pointers, such as memory leaks, buffer overflows, etc.
Additionally, it's also incredibly convenient. You can copy strings, append to a string, etc., without having to manually provide buffer space or use functions like strcpy/strcat. With std::string it's as simple as using the = or + operators.
Basically, it's:
std::string s1 = "Hello ";
std::string s2 = s1 + "World";
versus...
const char* s1 = "Hello";
char s2[1024]; // How much should I really even allocate here?
strcpy(s2, s1);
strcat(s2, " World ");
Edit:
In response to your edit regarding the use of char* in C++: Many C++ programmers will claim you should never use char* unless you're working with some API/legacy function that requires it, in which case you can use the std::string::c_str() function to convert an std::string to const char*.
However, I would say there are some legitimate uses of C-arrays in C++. For example, if performance is absolutely critical, a small C-array on the stack may be a better solution than std::string. You may also be writing a program where you need absolute control over memory allocation/deallocation, in which case you would use char*. Also, as was pointed out in the comments section, std::string isn't guaranteed to provide you with a contiguous, writable buffer *, so you can't directly write from a file into an std::string if you need your program to be completely portable. However, in the event you need to do this, std::vector would still probably be preferable to using a raw C-array.
* Although in C++11 this has changed so that std::string does provide you with a contiguous buffer
Ok, the question changed a lot since I first answered.
Native char arrays are a nightmare of memory management and buffer overruns compared to std::string. I always prefer to use std::string.
That said, char array may be a better choice in some circumstances due to performance constraints (although std::string may actually be faster in some cases -- measure first!) or prohibition of dynamic memory usage in an embedded environment, etc.
In general, std::string is a cleaner, safer way to go because it removes the burden of memory management from the programmer. The main reason it can be faster than char *'s, is that std::string stores the length of the string. So, you don't have to do the work of iterating through the entire character array looking for the terminating NULL character each time you want to do a copy, append, etc.
That being said, you will still find a lot of c++ programs that use a mix of std::string and char *, or have even written their own string classes from scratch. In older compilers, std::string was a memory hog and not necessarily as fast as it could be. This has gotten better over time, but some high-performance applications (e.g., games and servers) can still benefit from hand-tuned string manipulations and memory-management.
I would recommend starting out with std::string, or possibly creating a wrapper for it with more utility functions (e.g., starts_with(), split(), format(), etc.). If you find when benchmarking your code that string manipulation is a bottleneck, or uses too much memory, you can then decide if you want to accept the extra risks and testing that a custom string library demands.
TIP: One way of getting around the memory issues and still use std::string is to use an embedded database such as SQLite. This is particularly useful when generating and manipulating extremely large lists of strings, and performance is better than what you might expect.
C char * strings cannot contain '\0' characters. C++ string can handle null characters without a problem. If users enter strings containing \0 and you use C strings, your code may fail. There are also security issues associated with this.
Implementations of std::string hide the memory usage from you. If you're writing performance-critical code, or you actually have to worry about memory fragmentation, then using char* can save you a lot of headaches.
For anything else though, the fact that std::string hides all of this from you makes it so much more usable.
String may actually be better in terms of performance. And innumerable other reasons - security, memory management, convenient string functions, make std::string an infinitely better choice.
Edit: To see why string might be more efficient, read Herb Sutter's books - he discusses a way to internally implement string to use Lazy Initialization combined with Referencing.
Use std::string for its incredible convenience - automatic memory handling and methods / operators. With some string manipulations, most implementations will have optimizations in place (such as delayed evaluation of several subsequent manipulations - saves memory copying).
If you need to rely on the specific char layout in memory for other optimizations, try std::vector<char> instead. If you have a non-empty vector vec, you can get a char* pointer using &vec[0] (the vector has to be nonempty).
Short answer, I don't. The exception is when I'm using third party libraries that require them. In those cases I try to stick to std::string::c_str().
In all my professional career I've had an opportunity to use std::string at only two projects. All others had their own string classes :)
Having said that, for new code I generally use std::string when I can, except for module boundaries (functions exported by dlls/shared libraries) where I tend to expose C interface and stay away from C++ types and issues with binary incompatibilities between compilers and std library implementations.
Compare and contrast the following C and C++ examples:
strlen(infinitelengthstring)
versus
string.length()
std::string is almost always preferred. Even for speed, it uses small array on the stack before dynamically allocating more for larger strings.
However, char* pointers are still needed in many situations for writing strings/data into a raw buffer (e.g. network I/O), which can't be done with std::string.
The only time I've recently used a C-style char string in a C++ program was on a project that needed to make use of two C libraries that (of course) used C strings exclusively. Converting back and forth between the two string types made the code really convoluted.
I also had to do some manipulation on the strings that's actually kind of awkward to do with std::string, but I wouldn't have considered that a good reason to use C strings in the absence of the above constraint.

initializing std::string from char* without copy

I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
http://www.sgi.com/tech/stl/Rope.html
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.