Why moving a std::string landed on a different address? [duplicate] - c++

I want to provide zero-copy, move based API. I want to move a string from thread A into thread B. Ideologically it seems that move shall be able to simply pass\move data from instance A into new instance B with minimal to none copy operations (mainly for addresses). So all data like data pointers will be simply copied no new instance (constructed via move). So does std::move on std::string garantee that .c_str() returns same result on instance before move and instance created via move constructor?

No. There's no requirement for std::string to use dynamic allocation or to do anything specific with such an allocation if it has one. In fact, modern implementations usually put short strings into the string object itself and don't allocate anything; then moving is the same as copying.
It's important to keep in mind that std::string is not a container, even though it looks very similar to one. Containers make stronger guarantees with respect to their elements than std::string does.

No, it's not guaranteed.
Guaranteeing it would basically prohibit (for one example) the short string optimization, in which the entire body of a short string is stored in the string object itself, rather than being allocated separately on the heap.
At least for now, I think SSO is regarded as important enough that the committee would be extremely reluctant to prohibit it (but that could change--when the original C++98 standard was written, they went to considerable trouble to allow copy-on-write strings, but they are now prohibited).

No,
but if that is needed, an option is to put the string in std::unique_ptr. Personally I would typically not rely on the c_str() value for more than the local scope.
Example, on request:
#include <iostream>
#include <string>
#include <memory>
int main() {
std::string ss("hello");
auto u_str = std::make_unique<std::string>(ss);
std::cout << u_str->c_str() <<std::endl;
std::cout << *u_str <<std::endl;
return 0;
}
if you don't have make_unique (new in C++14).
auto u_str = std::unique_ptr<std::string>(new std::string(ss));
Or just copy the whole implementation from the proposal by S.T.L.:
Ideone example on how to do that

It is documented here, so you can assume that the c_str() result is stable under some conditions.
You cannot however assume that c_str() will remain the same after move.
In practice it will stay in case of long string, but it won't stay for short strings.

Related

Advantages of string_view literals vs string literals in simple scenarios

So I am reading code written for newer versions of CPP and frequently see string_view literals used almost exclusively, even in simple use cases.
For example:
std::cout<<"Hello world"sv<<std::endl;
Is there any particular reason for that? It is obviously not about storage duration, because string_view only wraps string literal. Do the string_view's have lower overhead or something?
Thanks you for your time.
Shown example do not introduce any significant gains.
The only case where there might be a difference is when you use zero inside literal:
std::cout << "zerro \0insde"sv << std::endl;
std::cout << "zerro \0insde" << std::endl;
https://godbolt.org/z/54v1a518f
Creating a std::string can be costly as it often involves allocating memory dynamically. When the cost of creating a std::string is a concern, using const char* and length parameters as an alternative may reduce the expense, but it can also make the code less readable and harder to use.
std::string_view, introduced in C++17, is a non-owning, read-only reference to a sequence of characters. Its purpose is to provide a way for functions to take a read-only reference to an object resembling std::string, without needing to specify the exact type. The downside of using a const std::string& in such cases is that it creates a std::string object.
std::string_view is a lightweight object that holds a pointer to the original string, and its length. Because it doesn't own the memory it points to, it doesn't need to manage the memory itself, which can make it more efficient than std::string. However, it can also have more overhead in certain cases. For example, if a std::string_view is frequently copied, it will need to create a new object each time, which can be more expensive than copying a std::string.
Additionally, because it doesn't own the memory it points to, it must ensure that the original string remains valid as long as the std::string_view is being used, which can also add some overhead.
P.S. be careful and use it with caution, as you do not own it
Additionally really great article to read https://quuxplusone.github.io/blog/2021/11/09/pass-string-view-by-value/

Does std::move on std::string garantee that .c_str() returns same result?

I want to provide zero-copy, move based API. I want to move a string from thread A into thread B. Ideologically it seems that move shall be able to simply pass\move data from instance A into new instance B with minimal to none copy operations (mainly for addresses). So all data like data pointers will be simply copied no new instance (constructed via move). So does std::move on std::string garantee that .c_str() returns same result on instance before move and instance created via move constructor?
No. There's no requirement for std::string to use dynamic allocation or to do anything specific with such an allocation if it has one. In fact, modern implementations usually put short strings into the string object itself and don't allocate anything; then moving is the same as copying.
It's important to keep in mind that std::string is not a container, even though it looks very similar to one. Containers make stronger guarantees with respect to their elements than std::string does.
No, it's not guaranteed.
Guaranteeing it would basically prohibit (for one example) the short string optimization, in which the entire body of a short string is stored in the string object itself, rather than being allocated separately on the heap.
At least for now, I think SSO is regarded as important enough that the committee would be extremely reluctant to prohibit it (but that could change--when the original C++98 standard was written, they went to considerable trouble to allow copy-on-write strings, but they are now prohibited).
No,
but if that is needed, an option is to put the string in std::unique_ptr. Personally I would typically not rely on the c_str() value for more than the local scope.
Example, on request:
#include <iostream>
#include <string>
#include <memory>
int main() {
std::string ss("hello");
auto u_str = std::make_unique<std::string>(ss);
std::cout << u_str->c_str() <<std::endl;
std::cout << *u_str <<std::endl;
return 0;
}
if you don't have make_unique (new in C++14).
auto u_str = std::unique_ptr<std::string>(new std::string(ss));
Or just copy the whole implementation from the proposal by S.T.L.:
Ideone example on how to do that
It is documented here, so you can assume that the c_str() result is stable under some conditions.
You cannot however assume that c_str() will remain the same after move.
In practice it will stay in case of long string, but it won't stay for short strings.

Why have move semantics?

Let me preface by saying that I have read some of the many questions already asked regarding move semantics. This question is not about how to use move semantics, it is asking what the purpose of it is - if I am not mistaken, I do not see why move semantics is needed.
Background
I was implementing a heavy class, which, for the purposes of this question, looked something like this:
class B;
class A
{
private:
std::array<B, 1000> b;
public:
// ...
}
When it came time to make a move assignment operator, I realized that I could significantly optimize the process by changing the b member to std::array<B, 1000> *b; - then movement could just be a deletion and pointer swap.
This lead me to the following thought: now, shouldn't all non-primitive type members be pointers to speed up movement (corrected below [1] [2]) (there is a case to be made for cases where memory should not be dynamically allocated, but in these cases optimizing movement is not an issue since there is no way to do so)?
Here is where I had the following realization - why create a class A which really just houses a pointer b so swapping later is easier when I can simply make a pointer to the entire A class itself. Clearly, if a client expects movement to be significantly faster than copying, the client should be OK with dynamic memory allocation. But in this case, why does the client not just dynamically allocate the whole A class?
The Question
Can't the client already take advantage of pointers to do everything move semantics gives us? If so, then what is the purpose of move semantics?
Move semantics:
std::string f()
{
std::string s("some long string");
return s;
}
int main()
{
// super-fast pointer swap!
std::string a = f();
return 0;
}
Pointers:
std::string *f()
{
std::string *s = new std::string("some long string");
return s;
}
int main()
{
// still super-fast pointer swap!
std::string *a = f();
delete a;
return 0;
}
And here's the strong assignment that everyone says is so great:
template<typename T>
T& strong_assign(T *&t1, T *&t2)
{
delete t1;
// super-fast pointer swap!
t1 = t2;
t2 = nullptr;
return *t1;
}
#define rvalue_strong_assign(a, b) (auto ___##b = b, strong_assign(a, &___##b))
Fine - the latter in both examples may be considered "bad style" - whatever that means - but is it really worth all the trouble with the double ampersands? If an exception might be thrown before delete a is called, that's still not a real problem - just make a guard or use unique_ptr.
Edit [1] I just realized this wouldn't be necessary with classes such as std::vector which use dynamic memory allocation themselves and have efficient move methods. This just invalidates a thought I had - the question below still stands.
Edit [2] As mentioned in the discussion in the comments and answers below this whole point is pretty much moot. One should use value semantics as much as possible to avoid allocation overhead since the client can always move the whole thing to the heap if needed.
I thoroughly enjoyed all the answers and comments! And I agree with all of them. I just wanted to stick in one more motivation that no one has yet mentioned. This comes from N1377:
Move semantics is mostly about performance optimization: the ability
to move an expensive object from one address in memory to another,
while pilfering resources of the source in order to construct the
target with minimum expense.
Move semantics already exists in the current language and library to a
certain extent:
copy constructor elision in some contexts
auto_ptr "copy"
list::splice
swap on containers
All of these operations involve transferring resources from one object
(location) to another (at least conceptually). What is lacking is
uniform syntax and semantics to enable generic code to move arbitrary
objects (just as generic code today can copy arbitrary objects). There
are several places in the standard library that would greatly benefit
from the ability to move objects instead of copy them (to be discussed
in depth below).
I.e. in generic code such as vector::erase, one needs a single unified syntax to move values to plug the hole left by the erased valued. One can't use swap because that would be too expensive when the value_type is int. And one can't use copy assignment as that would be too expensive when value_type is A (the OP's A). Well, one could use copy assignment, after all we did in C++98/03, but it is ridiculously expensive.
shouldn't all non-primitive type members be pointers to speed up movement
This would be horribly expensive when the member type is complex<double>. Might as well color it Java.
Your example gives it away: your code is not exception-safe, and it makes use of the free-store (twice), which can be nontrivial. To use pointers, in many/most situations you have to allocate stuff on the free store, which is much slower than automatic storage, and does not allow for RAII.
They also let you more efficiently represent non-copyable resources, like sockets.
Move semantics aren't strictly necessary, as you can see that C++ has existed for 40 years a while without them. They are simply a better way to represent certain concepts, and an optimization.
Can't the client already take advantage of pointers to do everything move semantics gives us? If so, then what is the purpose of move semantics?
Your second example gives one very good reason why move semantics is a good thing:
std::string *f()
{
std::string *s = new std::string("some long string");
return s;
}
int main()
{
// still super-fast pointer swap!
std::string *a = f();
delete a;
return 0;
}
Here, the client has to examine the implementation to figure out who is responsible for deleting the pointer. With move semantics, this ownership issue won't even come up.
If an exception might be thrown before delete a is called, that's still not a real problem just make a guard or use unique_ptr.
Again, the ugly ownership issue shows up if you don't use move semantics. By the way, how
would you implement unique_ptr without move semantics?
I know about auto_ptr and there are good reasons why it is now deprecated.
is it really worth all the trouble with the double ampersands?
True, it takes some time to get used to it. After you are familiar and comfortable with it, you will be wondering how you could live without move semantics.
Your string example is great. The short string optimization means that short std::strings do not exist in the free store: instead they exist in automatic storage.
The new/delete version means that you force every std::string into the free store. The move version only puts large strings into the free store, and small strings stay (and are possibly copied) in automatic storage.
On top of that your pointer version lacks exception safety, as it has non-RAII resource handles. Even if you do not use exceptions, naked pointer resource owners basically forces single exit point control flow to manage cleanup. On top of that, use of naked pointer ownership leads to resource leaks and dangling pointers.
So the naked pointer version is worse in piles of ways.
move semantics means you can treat complex objects as normal values. You move when you do not want duplicate state, and copy otherwise. Nearly normal types that cannot be copied can expose move only (unique_ptr), others can optimize for it (shared_ptr). Data stored in containers, like std::vector, can now include abnormal types because it is move aware. The std::vector of std::vector goes from ridiculously inefficient and hard to use to easy and fast at the stroke of a standard version.
Pointers place the resource management overhead into the clients, while good C++11 classes handle that problem for you. move semantics makes this both easier to maintain, and far less error prone.

Would this optimization in the implementation of std::string be allowed?

I was just thinking about the implementation of std::string::substr. It returns a new std::string object, which seems a bit wasteful to me. Why not return an object that refers to the contents of the original string and can be implicitly assigned to a std::string? A kind of lazy evaluation of the actual copying. Such a class could look something like this:
template <class Ch, class Tr, class A>
class string_ref {
public:
// not important yet, but *looks* like basic_string's for the most part
private:
const basic_string<Ch, Tr, A> &s_;
const size_type pos_;
const size_type len_;
};
The public interface of this class would mimic all of the read-only operations of a real std::string, so the usage would be seamless. std::string could then have a new constructor which takes a string_ref so the user would never be the wiser. The moment you try to "store" the result, you end up creating a copy, so no real issues with the reference pointing to data and then having it modified behind its back.
The idea being that code like this:
std::string s1 = "hello world";
std::string s2 = "world";
if(s1.substr(6) == s2) {
std::cout << "match!" << std::endl;
}
would have no more than 2 std::string objects constructed in total. This seems like a useful optimization for code which that performs a lot of string manipulations. Of course, this doesn't just apply to std::string, but to any type which can return a subset of its contents.
As far as I know, no implementations do this.
I suppose the core of the question is:
Given a class that can be implicitly converted to a std::string as needed, would it be conforming to the standard for a library writer to change the prototype of a member's to return type? Or more generally, do the library writers have the leeway to return "proxy objects" instead of regular objects in these types of cases as an optimization?
My gut is that this is not allowed and that the prototypes must match exactly. Given that you cannot overload on return type alone, that would leave no room for library writers to take advantage of these types of situations. Like I said, I think the answer is no, but I figured I'd ask :-).
This idea is copy-on-write, but instead of COW'ing the entire buffer, you keep track of which subset of the buffer is the "real" string. (COW, in its normal form, was (is?) used in some library implementations.)
So you don't need a proxy object or change of interface at all because these details can be made completely internal. Conceptually, you need to keep track of four things: a source buffer, a reference count for the buffer, and the start and end of the string within this buffer.
Anytime an operation modifies the buffer at all, it creates its own copy (from the start and end delimiters), decreases the old buffer's reference count by one, and sets the new buffer's reference count to one. The rest of the reference counting rules are the same: copy and increase count by one, destruct a string and decrease count by one, reach zero and delete, etc.
substr just makes a new string instance, except with the start and end delimiters explicitly specified.
This is a quite well-known optimization that is relatively widely used, called copy-on-write or COW. The basic thing is not even to do with substrings, but with something as simple as
s1 = s2;
Now, the problem with this optimization is that for C++ libraries that are supposed to be used on targets supporting multiple threads, the reference count for the string has to be accessed using atomic operations (or worse, protected with a mutex in case the target platform doesn't supply atomic operations). This is expensive enough that in most cases the simple non-COW string implementation is faster.
See GOTW #43-45:
http://www.gotw.ca/gotw/043.htm
http://www.gotw.ca/gotw/044.htm
http://www.gotw.ca/gotw/045.htm
To make matters worse, libraries that have used COW, such as the GNU C++ library, cannot simply revert to the simple implementation since that would break the ABI. (Although, C++0x to the rescue, as that will require an ABI bump anyway! :) )
Since substr returns std::string, there is no way to return a proxy object, and they can't just change the return type or overload on it (for the reasons you mentioned).
They could do this by making string itself capable of being a sub of another string. This would mean a memory penalty for all usages (to hold an extra string and two size_types). Also, every operation would need to check to see if it has the characters or is a proxy. Perhaps this could be done with an implementation pointer -- the problem is, now we're making a general purpose class slower for a possible edge case.
If you need this, the best way is to create another class, substring, that constructs from a string, pos, and length, and coverts to string. You can't use it as s1.substr(6), but you can do
substring sub(s1, 6);
You would also need to create common operations that take a substring and string to avoid the conversion (since that's the whole point).
Regarding your specific example, this worked for me:
if (&s1[6] == s2) {
std::cout << "match!" << std::endl;
}
That may not answer your question for a general-purpose solution. For that, you'd need sub-string CoW, as #GMan suggests.
What you are talking about is (or was) one of the core features of Java's java.lang.String class (http://fishbowl.pastiche.org/2005/04/27/the_string_memory_gotcha/). In many ways, the designs of Java's String class and C++'s basic_string template are similar, so I would imagine that writing an implementation of the basic_string template utilizing this "substring optimization" is possible.
One thing that you will need to consider is how to write the implementation of the c_str() const member. Depending on the location of a string as a substring of another, it may have to create a new copy. It definitely would have to create a new copy of the internal array if the string for which the c_str was requested is not a trailing substring. I think that this necessitates using the mutable keyword on most, if not all, of the data members of the basic_string implementation, greatly complicating the implementation of other const methods because the compiler is no longer able to assist the programmer with const correctness.
EDIT: Actually, to accommodate c_str() const and data() const, you could use a single mutable field of type const charT*. Initially set to NULL, it could be per-instance, initialized to a pointer to a new charT array whenever c_str() const or data() const are called, and deleted in the basic_string destructor if non-NULL.
If and only if you really need more performance than std::string provides then go ahead and write something that works the way you need it to. I have worked with variants of strings before.
My own preference is to use non-mutable strings rather than copy-on-write, and to use boost::shared_ptr or equivalent but only when the string is actually beyond 16 in length, so the string class also has a private buffer for short strings.
This does mean that the string class might carry a bit of weight.
I also have in my collection list a "slice" class that can look at a "subset" of a class that lives elsewhere as long as the lifetime of the original object is intact. So in your case I could slice the string to see a substring. Of course it would not be null-terminated, nor is there any way of making it such without copying it. And it is not a string class.

Lifecycle of objects passed by reference to STL containers

I'm an experienced coder, but am still relatively new to the STL, and have just come across this problem:
As far as I'm aware, STL containers aren't meant to copy the objects which they contain, or otherwise affect their lifecycles, yet experimentally I'm seeing different results.
In particular, string classes, which are meant to zero out the first character of their underlying storage upon destruction, are still accessible if they are stored in a container before they go out of scope. For instance, consider the following example:
using namespace std;
queue<string> strQueue;
const char *genStr(int i)
{
ostringstream os;
os << "The number i is " << i;
strQueue.push(os.str());
return strQueue.back().data();
}
void useStr()
{
while(!strQueue.empty())
{
cout << strQueue.front() << endl;
strQueue.pop();
}
}
int main(int argc, char **argv)
{
for(int i = 0; i < 40; i++)
{
printf("Retval is: %s\n", genStr(i));
}
useStr();
return 0;
}
As the strings go out of scope when genStr() exits, I would expect the printf to just output "Retval is: ", or at the very least for the call to useStr() to give undefined results, as the memory was stomped on by the repeated allocations from the extra calls, yet both return the appropriate stored strings, without fail.
I'd like to know why this happens, but in lieu of that, I'd be happy just to know whether I can rely on this effect happening with any old object.
Thanks
As far as I'm aware, STL containers
aren't meant to copy the objects which
they contain
Okay, let's stop right there. STL containers do copy their contents, frequently. They copy them when they're inserted, they copy them when the container is resized either automatically or explicitly, and they copy them when the container itself is copied. Lots and lots of copying.
I'm not sure where you got the idea that STL containers don't copy their contents. The only thing that I can think of that's even close is that if you insert a pointer into an STL container, it will copy the pointer itself but not the pointed-to data.
Also, there are no references involved in your code whatsoever, so I'm puzzled as to what the title of this question refers to.
STL containers aren't meant to copy the objects which they contain
The STL is all about making copies. It will make them when you insert objects, and will sometimes make them if the underlying storage gets resized. You may get broken code if the object you are copying becomes invalidated when your function goes out of scope (for example if you add a pointer to a local variable, rather than copying the local variable).
In your case, you aren't copying a reference to a string, you're copying a string. This copied string then exists in the scope of strQueue, so the behavior you are seeing is completely valid and reliable.
Here is another misunderstanding to clear up:
In particular, string classes, which are meant to zero out the first character of their underlying storage upon destruction
C++ doesn't tend to ever do that sort of thing. It would be a hidden cost, and C++ hates hidden costs :) The string destructor won't touch the memory because once the destructor has exited, the object no longer exists. Accessing it is undefined behavior, so the C++ implementation will do whatever is fastest and least wasteful in well defined code.
All the "STL" (I hate that term) collections store copies of the objects passed to them, so the lifetime of the object in the collection is completely independent of the original object. Under normal circumstances, the collection's copy of an object will remain valid until you erase it from the collection or destroy the collection.
What goes into the container is a copy of the object and not the actual object. Similarly what you get back is also a copy. You can access these objects as long as your container is in scope.