Consider we have declared a string like this: string x; and a vector of chars like this: vector<char> x_vec;
I was thinking if there is any advantage of doing
cout<<x;
Over
for(int i=0;i<x.length();i++)
cout<<x[i];
Or
for(int i=0;i<x_vec.size();i++)
cout<<x_vec[i];
in performance? My point is because very often we get to the point where we must choose between strings and vectors of chars. Is the first example actually treated or approached by the program differently from the other examples?
My point is because very often we get to the point where we must choose between strings and vectors of chars.
Very often? I don't think so.
If something is fundamentally a string, just use std::string.
If and when you can prove that performance is suboptimal (usually by profiling your program on real data), then consider alternatives. std::vector<char> is one such alternative, but there are others. Which, if any, would be preferable depends on the actual use case.
In all likelihood it'll be a while before you encounter a compelling real-world case for replacing std::string with std::vector<char>.
There is a distinct advantage of using
out << str;
over the loops writing characters invidivually: the formatted output operators, including those for char create a std::ostream::sentry object for each output. Also, since the stream doesn't know that you just wrote a character the stream, it needs to recheck its internal state. If you want to profile writing sequences of characters compared to the above formatted output, you should use something like
out.write(str.c_str(), str.size());
or
std::copy(str.begin(), str.end(), std::ostreambuf_iterator<char>(out));
I would expect that the formatted output and the version using write() are about the same performance and the vesion using std::copy() is probably slower although there is no good reason that it has to be slower other than standard C++ libraries not bothering with creating a fast implementation: I know that it can be done efficiently mainly because I did it for my experimental standard C++ library implementation.
There is a loop in all three cases - in the first case, it's inside the implementation of operator << which calls the OS which does looping, while in the other two cases it is in your code.
The last two cases are identical in terms of performance, if not in terms of generated code: both strings and vectors use contiguous storage, so their operator []s are extremely fast.
The first case, where the loop belongs in the implementation of the operator, may be optimized better when the implementation calls through to the underlying operating system. The most important point, however, is readability: a single line with a simple statement always reads better than even a simple loop.
In general, the biggest difference between strings and vectors of chars is the set of primitives supported by the two containers: strings are geared toward conveying string-like semantics (making substrings, simple searches), while vectors are better to convey array-like semantics (sequential collections of items with fast access to an arbitrary index). In terms of performance, the two structures are very similar.
Related
Before I begin, I need to state that my application uses lots of strings, which are on average quite small, and which do not change once created.
In Visual Studio 2010, I noticed that the capacity of std::string is at least 30. Even if I write std::string str = "test";, the capacity of str is 30. The function str.shrink_to_fit() does nothing about this although a function with the same name exists for std::vector and works as expected, namely decreasing the capacity so that capacity == size.
Why does std::string::shrink_to_fit() not work at expected?
How can I ensure that the string allocates the least amount of memory?
Your std::string implementation most likely uses some form of the short string optimization resulting in a fixed size for smaller strings and no effect for shrink_to_fit. Note that shrink_to_fit is non-binding for the implementation, so this is actually conforming.
You could use a vector<char> to get more precise memory management, but would loose some of the additional functionality of std::string. You could also write your own string wrapper which uses a vector internally.
One reason that std::string::shrink_to_fit() does nothing is that it is not required to by the standard
Remarks: shrink_to_fit is a non-binding request to reduce capacity() to size(). [ Note: The request is non-binding to allow latitude for implementation-specific optimizations. —end note ]
If you want to make sure the string shrinks then you can use the swap() trick like
std::string(string_to_shrink).swap(string_to_shrink)
Another reason this may not work is that the implementer of std::string is allowed to implement short string optimization so you could always have a minimum size of 30 on your implementation.
What you observe is a result of SSO (short string optimization), as pointed out by others.
What you could do about it depends on the usage pattern:
If you strings are parts of one big string, which is typical for parsing, you can use classes like std::experimental::string_view, GSL string_span, Google's StringPiece, LLVM's StringRef etc. which do not store data themselves but only refer to a piece of some other string, while providing interface similar to std::string.
If there are multiple copies of the same strings (especially long ones), it may make sense to use CoW (copy-on-write) strings, where copies share the same buffer using reference counter mechanism until modified. (But be aware of downsides)
If the strings are very short (just a few chars) it may make sense to write your own specialized class, something in line with Handling short codes by Andrzej
Whatever case you chose, it is important to establish good benchmarking procedure to clearly see what effect (if any) you get.
Upd: after rereading the introduction to the question, I think the third approach is the best for you.
If you are using a lot of small strings in your application then you might want to take a look at fbstring (https://github.com/facebook/folly/blob/master/folly/docs/FBString.md).
I am a newbee to c++ and am running into problems with my teacher using strings in my code. Though it is clear to me that I have to stop doing that in her class, I am curious as to why it is wrong. In this program the five strings I assigned were going to be reused no less than 4 to 5 times, therefore I put the text into strings. I was told to stop doing it as it is inefficient. Why? In c++ are textual strings supposed to be typed out as opposed to being stored into strings, and if so why? Below is some of the program, please tell me why it is bad.
string Bry = "berries";
string Veg = "vegetables";
string Flr = "flowers";
string AllStr;
float Tmp1, Precip;
int Tmp, FlrW, VegW, BryW, x, Selct;
bool Cont = true;
AllStr = Flr + ", " + Bry + ", " + "and " + Veg;
Answering whether using strings is inefficient is really something that very much depends on how you're using them.
First off, I would argue that you should be using C++ strings as a default - only going to raw C strings if you actually measure and find C++ strings to be too slow. The advantages (primarily for security) are just too great - it's all too easy to screw up buffer management with raw C strings. So I would disagree with your teacher that this is overly inefficient.
That said, it's important to understand the performance implications of using C++ strings. Since they are always dynamically allocated, you may end up spending a lot of time copying and reallocating buffers. This is usually not a problem; usually there are other things which take up much more time. However, if you're doing this right in the middle of a loop that's critical to your program's performance, you may need to find another method.
In short, premature optimization is usually a bad idea. Write code that is obviously correct, even if it takes ever-so-slightly longer to run. But be aware of the costs and trade-offs you're making at the same time; that way, if it turns out that C++ strings are actually slowing your program down a lot, you'll know what to change to fix that.
Yes, it's fairly inefficient, for following reasons:
When you construct a std::string object, it has to allocate a storage space for the string content (which may or may not be a separate dynamic memory allocation, depending on whether small-string optimization is in effect) and copy the literal string that is parameter of the constructor. For example, when you say: string Bry = "berries" it allocates a separate memory block (potentially from the dynamic memory), then copies "berries" to that block.
So you potentially have an extra dynamic memory allocation (costing time),
have to perform the copy (costing more time),
and end-up with 2 copies of the same string (costing space).
Using std::string::operator+ produces a new string that is the result of concatenation. So when you write several + operators in a row, you have several temporary concatenation results and a lot of unnecessary copying.
For your example, I recommend:
Using string literals unless you actually need the functionality only available in std::string.
Using std::stringstream to concatenate several strings together.
Normally, code readability is preferred over micro-optimizations of this sort, but luckily you can have both performance and readability in this case.
Your teacher is both right and wrong. S/he's right that building up strings from substrings at runtime is less CPU-efficient than simply providing the fully pre-built string in the code to start with -- but s/he's wrong in thinking that efficiency is necessarily an important factor to worry about in this case.
In a lot of cases, efficiency simply doesn't matter. At all. For example, if your code above is only going to be executed rarely (e.g. no more than once per second), then it's going to be literally impossible to measure any difference between the "most efficient version" and your not-so-efficient version. Given that, it's quite justifiable to decide that other factors (such as code readability and maintainability) are more important than maximizing efficiency.
Of course, if your program is going to be reconstructing these strings thousands or millions of times per second, then making sure your code is maximally efficient, even at the expense of readability/maintainability, is a good tradeoff to make. But I doubt that is the case here.
Your approach is almost perfect - try and declare everything only once. But if it is not used more than once - dont wast you fingers typing it :-) ie a 10 line program
The only change I would suggest is to make the strings const to help the compiler optimize you program.
If you instructor still disagrees - get a new instructor.
it is inefficient. doing that last line right would be 4-5 times faster.
at the very least you should use +=
+= means that you would avoid creating new strings with the + operator.
The instructor knows that when you do a string = string + string C++ creates a new string that is immediately destroyed.
Efficiency is probably not is good argument to not use string in school assignments but yes, if I am a teacher and the topic is not about some very high level applications, I don't want my students using string.
The real reason is string hides the low level memory management. A student coming out of college should have the basic memory management skill. Though nowadays in working environment, programmers don't deal with the memory management in most of the time but there are always situations where you need to understand what's happening under the hood to be able to reason the problem you are encountering.
With the context given, it looks like you should just be able to declare AllString as a const or string literal without all the substrings and addition. Assuming there's more to it, declaring them as literal string objects allocates memory at runtime. (And, not that there is any practical impact here, but you should be aware that stl container objects sometimes allocate a default minimum of space that is larger than the number of things initially in it, as part of its optimizations in anticipation of later modifying operations. I'm not sure if std::string does so on an declare/assign or not.) If you are only ever going to use them as literals, declaring them as a const char* or in a #define is easier on both memory and runtime performance, and you can still use them as r-values in string operations. If you are using them other ways in code you are not showing us, then its up to whether they need to ever be changed or manipulated as to whether they need to be strings or not.
If you are trying to learn coding, inefficiencies that don't matter in practice are still things you should be aware of and avoid if unnecessary. In production code, there are sometimes reasons to do something like this, but if it is not for any good reason, it's just sloppy. She's right to point it out, but what she should be doing is using that as a starting point for a conversation about the various tradeoffs involved - memory, speed, readability, maintainability, etc. If she's a teacher, she should be looking for "teaching moments" like this rather than just an opportunity to scold.
you can use string.append() ;
its better than + or +=
I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
http://www.sgi.com/tech/stl/Rope.html
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.
I can see that almost all modern APIs are developed in the C language. There are reasons for that: processing speed, low level language, cross platform and so on.
Nowadays, I program in C++ because of its Object Orientation, the use of string, the STL but, mainly because it is a better C.
However when my C++ programs need to interact with C APIs I really get upset when I need to convert char[] types to C++ strings, then operate on these strings using its powerful methods, and finally convert from theses strings to char[] again (because the API needs to receive char[]).
If I repeat these operations for millions of records the processing times are higher because of the conversion task.
For that simple reason, I feel that char[] is an obstacle in the moment to assume the C++ as a better c.
I would like to know if you feel the same, if not (I hope so!) I really would like to know which is the best way for C++ to coexist with char[] types without doing those awful conversions.
Thanks for your attention.
The C++ string class has a lot of problems, and yes, what you're describing is one of them.
More specifically, there is no way to do string processing without creating a copy of the string, which may be expensive.
And because virtually all string processing algorithms are implemented as class members, they can only be used on the string class.
A solution you might want to experiment with is the combination of Boost.Range and Boost.StringAlgo.
Range allows you to create sequences out of a pair of iterators. They don't take ownership of the data, so they don't copy the string. they just point to the beginning and end of your char* string.
And Boost.StringAlgo implements all the common string operations as non-member functions, that can be applied to any sequence of characters. Such as, for example, a Boost range.
The combination of these two libraries pretty much solve the problem. They let you avoid having to copy your strings to process them.
Another solution might be to store your string data as std::string's all the time. When you need to pass a char* to some API functoin, simply pass it the address of the first character. (&str[0]).
The problem with this second approach is that std::string doesn't guarantee that its string buffer is null-terminated, so you either have to rely on implementation details, or manually add a null byte as part of the string.
If you use std::vector<char> instead of std::string, the underlying storage will be a C array that can be accessed with &someVec[0]. However, you do lose a lot of std::string conveniences such as operator+.
That said, I'd suggest just avoiding C APIs that mutate strings as much as possible. If you need to pass an immutable string to a C function, you can use c_str(), which is fast and non-copying on most std::string implementations.
I'm not sure what you mean by "conversion", but won't the following suffice for moving between char*, char[], and std::string?
char[] charString = {'a', 'b', 'c', '\0'};
std::string standardString(&charString[0]);
const char* stringPointer(standardString.c_str());
I don't think it's as bad as you make it out to be.
There is a cost converting a char[] to a std::string, but if you're going to be modifying the string, you have to pay that cost anyway whether converting to a std::string or copying to another char[] buffer.
The conversion going the other way (via string.c_str()) is usually trivial. It's usually returning a pointer to an internal buffer (just don't give that buffer to code that will modify it).
I'm not sure why you would be constrained to using C strings and still have an environment that runs C++ code but if you really don't want the overhead of conversion, than don't convert. Just write routines that operate on the C strings.
Another reason for converting to C++ style strings is for bound safety.
"... because it is a better C."
Baloney. C++ is a vastly inferior dialect of C. The problems it solves are trivial, the problems it brings, much worse than those it solves.
I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
http://www.sgi.com/tech/stl/Rope.html
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.