x = x+x
x+=x
x.append(x)
Are the three ways above the same (thinking of time and space complexity)
I’m thinking using “+” creates a new string while append just adds to the existing string, so append is faster and less space usage.
This is a questions that is more theory/knowledge based...
You're calling three (actually four) different functions: operator=(s), operator+(s), operator+=(s) and append(s), and the time and space requirements (and concerns) will depend on the exact code of these functions. There's no generic answer to your question. Even for a given class, like std::string, it would be implementation dependant.
Please note that in the first case (x=x+x), you're doing two operations (that is, calling two functions). And also note that operator=(s), operator+=(s) and append(s) act directly on a reference to the main object (I'm talking about the this, and not about the additional parameter). The reference manual for operator+=(s) and append(s) states "Complexity: There are no standard complexity guarantees, typical implementations behave similar to std::vector::insert."
The idea of append() being faster because it "just adds to the existing string" could be right if the "existing string" had reserved space (which is not the case).
Related
Before I begin, I need to state that my application uses lots of strings, which are on average quite small, and which do not change once created.
In Visual Studio 2010, I noticed that the capacity of std::string is at least 30. Even if I write std::string str = "test";, the capacity of str is 30. The function str.shrink_to_fit() does nothing about this although a function with the same name exists for std::vector and works as expected, namely decreasing the capacity so that capacity == size.
Why does std::string::shrink_to_fit() not work at expected?
How can I ensure that the string allocates the least amount of memory?
Your std::string implementation most likely uses some form of the short string optimization resulting in a fixed size for smaller strings and no effect for shrink_to_fit. Note that shrink_to_fit is non-binding for the implementation, so this is actually conforming.
You could use a vector<char> to get more precise memory management, but would loose some of the additional functionality of std::string. You could also write your own string wrapper which uses a vector internally.
One reason that std::string::shrink_to_fit() does nothing is that it is not required to by the standard
Remarks: shrink_to_fit is a non-binding request to reduce capacity() to size(). [ Note: The request is non-binding to allow latitude for implementation-specific optimizations. —end note ]
If you want to make sure the string shrinks then you can use the swap() trick like
std::string(string_to_shrink).swap(string_to_shrink)
Another reason this may not work is that the implementer of std::string is allowed to implement short string optimization so you could always have a minimum size of 30 on your implementation.
What you observe is a result of SSO (short string optimization), as pointed out by others.
What you could do about it depends on the usage pattern:
If you strings are parts of one big string, which is typical for parsing, you can use classes like std::experimental::string_view, GSL string_span, Google's StringPiece, LLVM's StringRef etc. which do not store data themselves but only refer to a piece of some other string, while providing interface similar to std::string.
If there are multiple copies of the same strings (especially long ones), it may make sense to use CoW (copy-on-write) strings, where copies share the same buffer using reference counter mechanism until modified. (But be aware of downsides)
If the strings are very short (just a few chars) it may make sense to write your own specialized class, something in line with Handling short codes by Andrzej
Whatever case you chose, it is important to establish good benchmarking procedure to clearly see what effect (if any) you get.
Upd: after rereading the introduction to the question, I think the third approach is the best for you.
If you are using a lot of small strings in your application then you might want to take a look at fbstring (https://github.com/facebook/folly/blob/master/folly/docs/FBString.md).
I was wondering if there was a limit on the number of parameters you can pass to a function.
I'm just wondering because I have to maintain functions of 5+ arguments here at my jobs.
And is there a critical threshold in nbArguments, talking about performance, or is it linear?
Neither the C nor C++ standard places an absolute requirement on the number of arguments/parameters you must be able to pass when calling a function, but the C standard suggests that an implementation should support at least 127 parameters/arguments (§5.2.4.1/1), and the C++ standard suggests that it should support at least 256 parameters/arguments (§B/2).
The precise wording from the C standard is:
The implementation shall be able to translate and execute at least one program that
contains at least one instance of every one of the following limits.
So, one such function must be successfully translated, but there's no guarantee that if your code attempts to do so that compilation will succeed (but it probably will, in a modern implementation).
The C++ standard doesn't even go that far, only going so far as to say that:
The bracketed number following each quantity is recommended as the minimum for that quantity. However, these quantities are only guidelines and do not determine compliance.
As far as what's advisable: it depends. A few functions (especially those using variadic parameters/variadic templates) accept an arbitrary number of arguments of (more or less) arbitrary types. In this case, passing a relatively large number of parameters can make sense because each is more or less independent from the others (e.g., printing a list of items).
When the parameters are more...interdependent, so you're not just passing a list or something on that order, I agree that the number should be considerably more limited. In C, I've seen a few go as high as 10 or so without being terribly unwieldy, but that's definitely starting to push the limit even at best. In C++, it's generally enough easier (and more common) to aggregate related items into a struct or class that I can't quite imagine that many parameters unless it was in a C-compatibility layer or something on that order, where a more...structured approach might force even more work on the user.
In the end, it comes down to this: you're going to either have to pass a smaller number of items that are individually larger, or else break the function call up into multiple calls, passing a smaller number of parameters to each.
The latter can tend to lead toward a stateful interface, that basically forces a number of calls in a more or less fixed order. You've reduced the complexity of a single call, but may easily have done little or nothing to reduce the overall complexity of the code.
In the other direction, a large number of parameters may well mean that you've really defined the function to carry out a large number of related tasks instead of one clearly defined task. In this case, finding more specific tasks for individual functions to carry out, and passing a smaller set of parameters needed by each may well reduce the overall complexity of the code.
It seems like you're veering into subjective territory, considering that C varargs are (usually) passed mechanically the same way as other arguments.
The first few arguments are placed in CPU registers, under most ABIs. How many depends on the number of architectural registers; it may vary from two to ten. In C++, empty classes (such as overload dispatch tags) are usually omitted entirely. Loading data into registers is usually "cheap as free."
After registers, arguments are copied onto the stack. You could say this takes linear time, but such operations are not all created equal. If you are going to be calling a series of functions on the same arguments, you might consider packaging them together as a struct and passing that by reference.
To literally answer your question, the maximum number of arguments is an implementation-defined quantity, meaning that the ISO standard requires your compiler manual to document it. The C++ standard also recommends (Annex B) that no implementation balk at less than 256 arguments, which should be Enough For Anyone™. C requires (§5.2.4.1) support for at least 127 arguments, although that requirement is normatively qualified such as to weaken it to only a recommendation.
It is not really dirty, sometimes you can't avoid using 4+ arguments while maintaining stability and efficiency. If possible it should be minimized for sake of clarity (perhaps by use of structs), especially if you think that some function is becoming a god construct (function that runs most of the program, they should be avoided for sake of stability). If this is the case, functions that take larger numbers of arguments are pretty good indicators of such constructs.
Consider we have declared a string like this: string x; and a vector of chars like this: vector<char> x_vec;
I was thinking if there is any advantage of doing
cout<<x;
Over
for(int i=0;i<x.length();i++)
cout<<x[i];
Or
for(int i=0;i<x_vec.size();i++)
cout<<x_vec[i];
in performance? My point is because very often we get to the point where we must choose between strings and vectors of chars. Is the first example actually treated or approached by the program differently from the other examples?
My point is because very often we get to the point where we must choose between strings and vectors of chars.
Very often? I don't think so.
If something is fundamentally a string, just use std::string.
If and when you can prove that performance is suboptimal (usually by profiling your program on real data), then consider alternatives. std::vector<char> is one such alternative, but there are others. Which, if any, would be preferable depends on the actual use case.
In all likelihood it'll be a while before you encounter a compelling real-world case for replacing std::string with std::vector<char>.
There is a distinct advantage of using
out << str;
over the loops writing characters invidivually: the formatted output operators, including those for char create a std::ostream::sentry object for each output. Also, since the stream doesn't know that you just wrote a character the stream, it needs to recheck its internal state. If you want to profile writing sequences of characters compared to the above formatted output, you should use something like
out.write(str.c_str(), str.size());
or
std::copy(str.begin(), str.end(), std::ostreambuf_iterator<char>(out));
I would expect that the formatted output and the version using write() are about the same performance and the vesion using std::copy() is probably slower although there is no good reason that it has to be slower other than standard C++ libraries not bothering with creating a fast implementation: I know that it can be done efficiently mainly because I did it for my experimental standard C++ library implementation.
There is a loop in all three cases - in the first case, it's inside the implementation of operator << which calls the OS which does looping, while in the other two cases it is in your code.
The last two cases are identical in terms of performance, if not in terms of generated code: both strings and vectors use contiguous storage, so their operator []s are extremely fast.
The first case, where the loop belongs in the implementation of the operator, may be optimized better when the implementation calls through to the underlying operating system. The most important point, however, is readability: a single line with a simple statement always reads better than even a simple loop.
In general, the biggest difference between strings and vectors of chars is the set of primitives supported by the two containers: strings are geared toward conveying string-like semantics (making substrings, simple searches), while vectors are better to convey array-like semantics (sequential collections of items with fast access to an arbitrary index). In terms of performance, the two structures are very similar.
I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
http://www.sgi.com/tech/stl/Rope.html
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.
I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
http://www.sgi.com/tech/stl/Rope.html
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.