Difference between std::string(size, '\0') and s.resize(size)? - c++

Unlike std::vector, std::string does not provide a unary constructor that takes a size:
std::string s(size); // ERROR
Is there any difference between:
std::string s(size, '\0');
and
std::string s;
s.resize(size);
in terms of their performance on common implementations?
Will resize initialize the string to all zero characters or will it leave them an unspecified value?
If all zero, is there any way to construct a string of a given size, but leave the characters with an unspecified value?

There is a difference, as in std::string s(size, '\0');, all of the memory needed for the string can be allocated at once. However, with the second example, if size is greater than the amount of characters stored for small string optimization, an extra allocation may have to be performed, although this is implementation defined, and will definitely not be more performant in that regard in a standard-compliant C++ 17 implementation. However, the first example is more consise, and may be more performant, so it is probably preferable. When calling s.resize(size);, all new characters will be initialized with char's default constructor, aka '\0'. There is no way to initialize a string with unspecified values.

The actual answer would be implementation-based, but I'm fairly sure that std::string s(size, '\0'); is faster.
std::string s;
s.resize(size);
According to the documentation for std::string.
1) Default constructor. Constructs empty string (zero size and unspecified capacity).
The default constructor will create a string with an "unspecified capacity". My sense here is that the implementation is free to determine a default capacity, probably in the realm of 10-15 characters (totally speculation).
Then in the next line, you will reallocate the memory (resize) with the new size if the size is greater than the current capacity. This is probably not what you want!
If you really want to find out definitively, you can run a profiler on the two methods.

There is already a good answer from DeepCoder.
For the records however, I'd like to point out that strings (as for vectors) there are two distinct notions:
the size(): it's the number of actual (i.e. meaningful) characters in the string. You can change it using resize() (to which you can provide a second parameter to say what char you want to use as filler if it should be other than '\0')
the capacity(): it's the number of characters allocated to the string. Its at least the size but can be more. You can increase it with reserve()
If you're worried about allocation performance, I believe it's better to play with the capacity. The size should really be kept for real chars in the string not for padding chars.
By the way, more generally, s.resize(n) is the same as s.resize(n, char()). So if you'd like to fill it on the same way at construction, you could consider string s(n, char()). But as long as you don't use basic_string<T> for T being different from characters, your '\0' just does the trick.

Resize does not leave elements uninitialized. According to the documentation: http://en.cppreference.com/w/cpp/string/basic_string/resize
s.resize(size) will value-initialize each appended character. That will cause each element of the resized string to be initialized to '\0'.
You would have to measure the performance difference of your specific C++ implementation to really decide if there's a worthwhile difference or not.
After looking at the machine generated by Visual C++ for an optimized build, I can tell you the amount of code for either version is similar. What seems counter intuitive is that the resize() version measured faster for me. Still, you should check your own compiler and standard library.

Related

Are two heap allocations more expensive than a call to std::string fill ctor?

I want to have a string with a capacity of 131 chars (or bytes). I know two simple ways of achieving that. So which of these two code blocks is faster and more efficient?
std::string tempMsg( 131, '\0' ); // constructs the string with a 131 byte buffer from the start
tempMsg.clear( ); // clears those '\0' chars to free space for the actual data
tempMsg += "/* some string literals that are appended */";
or this one:
std::string tempMsg; // default constructs the string with a 16 byte buffer
tempMsg.reserve( 131 ); // reallocates the string to increase the buffer size to 131 bytes??
tempMsg += "/* some string literals that are appended */";
I guess the first approach only uses 1 allocation and then sets all those 131 bytes to 0 ('\0') and then clears the string (std::string::clear is generally constant according to: https://www.cplusplus.com/reference/string/string/clear/).
The second approach uses 2 allocations but on the other hand, it doesn't have to set anything to '\0'. But I've also heard about compilers allocating 16 bytes on the stack for a string object for optimization purposes. So the 2nd method might use only 1 heap allocation as well.
So is the first method faster than the other one? Or are there any other better methods?
The most accurate answer is that it depends. The most probable answer is the second being faster or as fast. Calling the fill ctor requires not only a heap allocation but a fill (typically translates to a memset in my experience).
clear usually won't do anything with a POD char besides setting a first pointer or size integer to zero because char is a trivially-destructible type. There's no loop involved with clear usually unless you create std::basic_string with a non-trivial UDT. It's constant-time otherwise and dirt-cheap in practically every standard library implementation.
Edit: An Important Note:
I never encountered a standard lib implementation that does this or it has slipped my memory (very possible as I think I'm turning senile), but there is something very important that Viktor Sehl pointed out to me that I was very ignorant about in the comments:
Please note that std::string::clear() on some implementations free the allocated memory (if there are any), unlike a std::vector. –
That would actually make your first version involve two heap allocations. But the second should still only be one (opposite of what you thought).
Resumed:
But I've also heard about compilers allocating 16 bytes on the stack for a string object for optimization purposes. So the 2nd method might use only 1 heap allocation as well.
Small Buffer Optimizations
The first allocation is a small-buffer stack optimization for implementations that use it (technically not always stack, but it'll avoid additional heap allocations). It's not separately heap-allocated and you can't avoid it with a fill ctor (the fill ctor will still allocate the small buffer). What you can avoid is filling the entire array with '\0' before you fill it with what you actually want, and that's why the second version is likely faster (marginally or not depending on how many times you invoke it from a loop). That's needless overhead unless the optimizer eliminates it for you, and it's unlikely in my experience that optimizers will do that in loopy cases that can't be optimized with something like SSA.
I just pitched in here because your second version is also clearer in intent than filling a string with something as an attempted optimization (in this case a very possibly misguided one if you ask me) only to throw it out and replace it with what you actually want. The second is at least clearer in intent and almost certainly as fast or faster in most implementations.
On Profiling
I would always suggest measuring though if in doubt, and especially before you start attempting funny things like in your first example. I can't recommend the profiler enough if you're in working in performance-critical fields. The profiler will not only answer this question for you but it'll also teach you to refrain from writing such counter-intuitive code like in the first example except in places where it makes a real positive difference (in this case I think the difference is actually negative or neutral). From my perspective, the use of both profiler and debugging should be something ideally taught in CS 101. The profiler helps mitigate the dangerous tendency for people to optimize the wrong things very counter-productively. They tend to be very easy to use; you just run them and make your code perform the expensive operation you want to optimize and you get back nice results like so:
If the small buffer optimization confuses you a bit, a simple illustration is like this:
struct SomeString
{
// Pre-allocates (always) some memory in advance to avoid additional
// heap allocs.
char small_buffer[some_small_fixed_size] = {};
// Will point to small buffer until string gets large.
char* ptr = small_buffer;
};
The allocation of the small buffer is unavoidable, but it doesn't require separate calls to malloc/new/new[]. And it's not allocated separately on the heap from the string object itself (if it is allocated on heap). So both of the examples that you showed involve, at most, a single heap allocation (unless your standard library implementation is FUBAR -- edit: or one that Viktor is using). What the first example has conceptually on top of that is a fill/loop (could be implemented as a very efficient intrinsic in assembly but loopy/linear time stuff nevertheless) unless the optimizer eliminates it.
String Optimization
So is the first method faster than the other one? Or are there any other better methods?
You can write your own string type which uses an SBO with, say, 256 bytes for the small buffer which is typically going to be much larger than any std::string optimization. Then you can avoid heap allocations entirely for your 131-length case.
template <class Char, size_t SboSize=256>
class TempString
{
private:
// Stores the small buffer.
Char sbo[SboSize] = {};
// Points to the small buffer until num > SboSize.
Char* ptr = sbo;
// Stores the length of the string.
size_t num = 0;
// Stores the capacity of the string.
size_t cap = SboSize;
public:
// Destroys the string.
~TempString()
{
if (ptr != sbo)
delete[] ptr;
}
// Remaining implementation left to reader. Note that implementing
// swap requires swapping the contents of the SBO if the strings
// point to them rather than swapping pointers (swapping is a
// little bit tricky with SBOs involved, so be wary of that).
};
That would be ill-suited for persistent storage though because it would blow up memory use (ex: requiring 256+ bytes just to store a string with one character in it) if you stored a bunch of strings persistently in a container. It's well-suited for temporary strings though you transfer into and out of function calls. I'm primarily a gamedev so rolling our own alternatives to the standard C++ library is quite normal here given our requirements for real-time feedback with high graphical fidelity. I wouldn't recommend it for the faint-hearted though, and definitely not without a profiler. This is a very practical and viable option in my field although it might be ridiculous in yours. The standard lib is excellent but it's tailored for the needs of the entire world. You can usually beat it if you can tailor your code very specifically to your needs and produce more narrowly-applicable code.
Actually, even std::string with SBOs is rather ill-suited for persistent storage anyway and not just TempString above because if you store like std::unordered_map<std::string, T> and std::string uses a 16-byte SBO inflating sizeof(std::string) to 32 bytes or more, then your keys will require 32 bytes even if they just store one character fitting only two strings or less in a single cache line on traversal of the hash table. That's a downside to using SBOs. They can blow up your memory use for persistent storage that's part of your application state. But they're excellent for temporaries whose memory is just pushed and popped to/from stack in a LIFO alloc/dealloc pattern which only requires incrementing and decrementing a stack pointer.
If you want to optimize the storage of many strings though from a memory standpoint, then it depends a lot on your access patterns and needs. However, a fairly simple solution is like so if you want to just build a dictionary and don't need to erase specific strings dynamically:
// Just using a struct for simplicity of illustration:
struct MyStrings
{
// Stores all the characters for all the null-terminated strings.
std::vector<char> buffer;
// Stores the starting index into the buffer for the nth string.
std::vector<std::size_t> string_start;
// Inserts a null-terminated string to the buffer.
void insert(const std::string_view str)
{
string_start.push_back(buffer.size());
buffer.insert(buffer.end(), str.begin(), str.end());
buffer.push_back('\0');
}
// Returns the nth null-terminated string.
std::string_view operator[](int32_t n) const
{
return {buffer.data() + string_start[n]};
}
};
Another common solution that can be very useful if you store a lot of duplicate strings in an associative container or need fast searches for strings that can be looked up in advance is to use string interning. The above solution can also be combined to implement an efficient way to store all the interned strings. Then you can store lightweight indices or pointers to your interned strings and compare them immediately for equality, e.g., without involving any loops, and store many duplicate references to strings that only cost the size of an integer or pointer.

Construct std::string from up to X characters, stopping at null char

I am reading strings from a structure in a file where each string has a fixed length, with '\0' padding. They are not zero-terminated if the stored string needs the whole length.
I'm currently constructing std::strings out of those like this:
// char MyString[1000];
std::string stdmystring(MyString, ARRAYSIZE(MyString));
However, this copies the padding, too. I could trim the string now, but is there an elegant and quick way to prevent the copying in the first place?
Speed is more important than space, because this runs in a loop.
Simple solutions are:
Just calculate the correct length first
either use strnlen as Dieter suggested
or std::find(MyString,MyString+ARRAYSIZE(MyString),'\0') which IME isn't any slower
note that if your string fits in cache, that will likely dominate the extra loop cost
reserve the max string size (you did say space was less important), and write a loop appending characters until you exhaust the width or hit a nul (like copy_until)
actually create a max-size string initialized with nuls, strncpy into it, and optionally erase unused nuls if you want the size to be correct
The second option uses only a single loop, while the third notionally uses two (it in the string ctor, and then in the copy). However, the push_back of each character seems more expensive than the simple character assignment, so I wouldn't be surprised if #3 were faster in reality. Profile and see!
Well If size is not a problem one potential way to do it is to create an empty std::string then use reserve() to pre-allocate the space potentially needed and then add each char until you come across '\0'.
std::string stdmystring;
stdmystring.reserve(MyString_MAX_SIZE) ;
for(size_t i=0;i<MyString_MAX_SIZE && MyString[i]!='\0';++i);
stdmystring+=MyString[i];
reserve() garanties you one memory allocation since you know the max_size and the string will never get larger than that.
The calls to += operator function will probably be inlined but it still has to check that the string has the needed capacity which is wasteful in your case. Infact this could be the same or worse than simply using strlen to find the exact length of the string first so you have to test it.
I think the most straightforward way is to overallocate your internal MyString array by one byte, always null terminate that final byte, and use the C-string constructor of std::string. (Keep in mind that most likely your process will be I/O bound on the file so whatever algorithm the C-string constructor uses should be fine).

Does the compiler copy a std::string into stack while passing it to a function in C++?

I have a simple question. I have a long std::string that I want to pass it to a function.
I wanna know that this string will be copy to stack then a copy of that will be passed or something like pointer will be passed and no additional space will be required?
(C++)
I have another little question: How much memory does an element of a string take?Just like char?
Yes, it will be deep copied, so use const reference is recommended.
void fun(const std::string & arg)
Typically std::string has 2 fields, a pointer pointing to dynamic allocated memory and the length, so it is 16+actual length on 64bit machines.
Spoiler Alert: My answer wont be that relevant, just an optimization technique.
If you dont want to duplicate the string, write your customized string class, which has two pointers or one pointer with size. In the past it has reduced me a lot of duplicates. This will work only as read-only and do a copy_on_write, i.e duplicate only if you encounter a write.
When passing an argument by value in C++ it is conceptually copied. Whether this copy really happens is another question, though, and depends on how the argument is passed and, to some extend, on the compiler: the compiler is explicitly allowed to elide certain copies, in particular copies of temporary objects. For example, when you return an object from a function and it us clear that the object will be returned, the copy is likely to be elided. Similarily, when passing the result of a function directly on to another function, it is likely not to be copied.
Beyond this C++ 2011 added another dimension of possibilities by supporting move constructors. These cover to some extend similar ground but also allow you to have better control: you can explicitly indicate that it would be acceptable for an object to be moved rather than being copied. Still, in no event will an object passed by reference.
With respect to the used bytes per element, the std::string uses just sizeof(cT) bytes (where cT is the character template argument of the std::basic_string). However, the string will overallocate the space in many cases and certainly when characters are added to the string. You can determine the overallocation by comparing size() and capacity() and control it to some extend with reserve() although this function isn't required of getting rid of any overallocation but the capacity() has to be at least as much as was last reserve()d. If the string is small (e.g. at most 15 characters) modern implementations won't make any allocation. This is called the string optimization.
With respect to the actual represention of the string: unless it is small it will use one word for the address of the storage, one word each for the the size and the capacity, and for strings with stateful allocators the size of the allocator (typically another word). Given alignment requirements this effectively means that in most cases the string will take four words in addition to the elements. Typically the small string optimization uses these words to store characters if the string firs there unless, of course, it needs to store a stateful allocator.

Default capacity of std::string?

When I create a std::string using the default constructor, is ANY memory allocated on the heap? I'm hoping the answer does not depend on the implementation and is standardized. Consider the following:
std::string myString;
Unfortunately, the answer is no according to N3290.
Table 63 Page 643 says:
data() a non-null pointer that is copyable and can have 0 added to it
size() 0
capacity() an unspecified value
The table is identical for C++03.
No, but, and I don't know of any implementation that does allocate memory on the heap by default. Quite a few do, however, include what's called the short string optimization (SSO), where they allocate some space as part of the string object itself, so as long as you don't need more than that length (seems to be between 10 and 20 characters as a rule) it can avoid doing a separate heap allocation at all.
That's not standardized either though.
It is implementation dependent. Some string implementations use a small amount of automatically allocated storage for small strings, and then dynamically allocate more for larger strings.
It depends on the compiler. Take a look here, there is a good explanation:
http://www.learncpp.com/cpp-tutorial/17-3-stdstring-length-and-capacity/
Generally, yes they allocate memory on the heap. I'll give an example: c_str() requires a NULL trailing character '\0'. Most implementations allocate this NUL \0 ahead of time, as part of the string. So you'll get at least one byte allocated, often more.
If you really need specific behavior I'd advise writing your own class. Buffer/string classes are not that hard to write.

Uses for the capacity value of a string

In the C++ Standard Library, std::string has a public member function capacity() which returns the size of the internal allocated storage, a value greater than or equal to the number of characters in the string (according to here). What can this value be used for? Does it have something to do with custom allocators?
You are more likely to use the reserve() member function, which sets the capacity to at least the supplied value.
The capacity() member function itself might be used to avoid allocating memory. For instance, you could recycle used strings through a pool, and put each one in a different size bucket based on its capacity. A client of the pool could then ask for a string that already has some minimum capacity.
The string::capacity() function return the number of total characters the std::string may contain before it has to reallocate memory, which is quite an expensive operation.
A std::vector works in the same way so I suggest you look up std::vector here for an in detail explanation of the difference between allocated and initialized memory.
Update
Hmm I can see I misread the question, actually I think I've never used capacity() myself on either std::string or std::vector, seems to seldomely be of any reason as you have to call reserve anyway.
It gives you the number of characters the string could contain without having to re-allocate. I suppose this might be important in a situation where allocation was expensive, and you wanted to avoid it, but I must say this is one string member function I've never used in real code.
It could be used for some performance tuning if you are about to add a lot of characters to the string. Before starting the string manipulation, you can check for the capacity and if it is too small, reserve the desired length in a single step (instead of letting it reallocate successively bigger chunks of memory several times, which would be a performance hog).
Strings have a capacity and a size. The capacity indicates how many characters that the string can hold before it will have to allocate more memory. The size indicates how many characters that it currently holds. reserve() can be used to set the minimum capacity of the string (it will allocate memory for at least that number of characters but could allocate more).
This is primarily of importance if you're increasing the size of the string. When you concatenate onto the string with += or append(), the characters from the given string will be added to the end of the current one. If increasing the string to that size does not exceed the capacity, then it's just going to use the capacity that it has. However, if the new size would exceed the current capacity, then the string will have to reallocate memory internally and copy its internals into the new memory. If you're going to be doing that a lot, it can get expensive (though it is done in amortized constant time), so in such a case, you could use reserve() to preallocate enough memory to reduce how often reallocations have to take place.
vector functions in basically the same way with the same functions.
Personally, while I've dealt with capacity() and reserve() with vector from time to time, I've never seen much need to do so with string - probably because I don't generally do enough string concatenations in my code for it to be worth it. In most cases, a particular string might get a few concatenations but not enough to worry about its capacity. Worrying about capacity is generally something you do when trying to optimize your code.
There's hardly any relevant use. It is similar to std::vector::capacity. However, one of the most common uses of strings is assignment. When assigning to a std::string, its .capacity may change. This means that an implementation has the right to ignore the old capacity and allocate precisely enough memory.
It genuinely isn't very useful, and is probably there only for symmetry with vector (under the assumption that both will operate internally in the same way).
The capacity of a vector is guaranteed to affect the behaviour of a resize. Resizing a vector to a value less than or equal to the capacity will not induce a reallocation, and will not therefore invalidate iterators or pointers referring to elements in the vector. This means you can pre-allocate some storage by calling reserve on a vector, then (with care) add elements to it by resizing or pushing back (etc.), safe in the knowledge that the underlying buffer won't move.
There is no such guarantee for string, though. It seems that the capacity is for informational purposes only -- though even that's a stretch, as it doesn't look like there's any useful information to be taken from it anyway. (Worse yet, contiguity of string chars isn't guaranteed either, so the only way you can get at the string as a linear buffer is c_str() -- which may induce a reallocation.)
At a guess, string was presumably originally intended to be implemented as some kind of a special case of vector, but over time the two grew apart...