Why Doesn't string::data() Provide a Mutable char*? - c++

In c++11 array, string, and vector all got the data method which:
Returns pointer to the underlying array serving as element storage. The pointer is such that range [data(); data() + size()) is always a valid range, even if the container is empty.
[Source]
This method is provided in a mutable and const version for all applicable containers, for example:
T* vector<T>::data();
const T* vector<T>::data() const;
All applicable containers, that is, except string which only provides the const version:
const char* string::data() const;
What happened here? Why did string get shortchanged, when char* string::data() would be so helpful?

The short answer is that c++17 does provide the char* string::data() method. Which is vital for the similarly c++17 data function, thus to gain mutable access to the underlying C-String I can now do this:
auto foo = "lorem ipsum"s;
for(auto i = data(foo); *i != '\0'; ++i) ++(*i);
For historical purposes it's worth chronicling string's development which c++17 is building upon: In c++11 access to string's underlying buffer is made possible possible by a new requirement that it's elements are stored contiguously such that for any given string s:
&*(s.begin() + n) == &*s.begin() + n for any n in [0, s.size()), or, equivalently, a pointer to s[0] can be passed to functions that expect a pointer to the first element of a CharT[] array.
Mutable access to this newly required underlying C-String was obtainable by various methods, for example: &s.front(), &s[0], or &*s.first() But back to the original question which would avoid the burden of using one of these options: Why hasn't access to string's underlying buffer been provided in the form of char* string::data()?
To answer that it is important to note that T* array<T>::data() and T* vector<T>::data() were an addition required by c++11. No additional requirements were incurred by c++11 against other contiguous containers such as deque. And there certainly wasn't an additional requirement for string, in fact the requirement that string was contiguous was new to c++11. Before this const char* string::data() had existed. Though it explicitly was not guaranteed to be pointing to any underlying buffer, it was the only way to obtain a const char* from a string:
The returned array is not required to be null-terminated.
This means that string was not "shortchanged" in c++11's transition to data accessors, it simply was not included thus only the const data accesor that string previously possessed persisted. There are naturally occurring examples in C++11's implementation which necessitate writing directly to the underlying buffer of a string.

I think this restriction comes from the (pre-2011) days where std::basic_string didn't have to store its internal buffer as a contiguous byte array.
While all the others (std::vector and such) had to store their elements as a contiguous sequence per the 2003 standard; so data could easily return mutable T*, because there was no problem with iterations, etc.
If std::basic_string were to return a mutable char*, that would imply that you can treat that char* as a valid C-string and perform C-string operations like strcpy, that would easily turn to undefined behavior were the string not allocated contiguously.
The C++11 standard added the rule that basic_string has to be implemented as a contiguous byte array. Needless to say, you can work-around this by using the old trick of &str[0].

Related

Is it safe to store the pointer to the data of a std::string?

My question revolves around the mechanics of copy-construction and reallocation.
I have a class, that collects strings. After adding a string to the collection, the string is copied and stored in a vector. But as I also need access to the collection of all the string as const char * const*, I also store the pointers to the data of each string via .c_str().
class MyStrings {
private:
std::vector<std::string> names;
std::vector<const char*> cStringPointers;
public:
const char *const *Data() const
{
return this->cStringPointers.data();
}
void Add(const std::string &name)
{
// copy [name] and store the copy in [this->names].
this->names.push_back(name);
// Store the pointer to the data of the copy.
this->cStringPointers.push_back(this->names.back().c_str());
}
}
I am aware, that storing pointers to elements of a vector is bad, because when the vector gets resized, i.e. has to reallocate his memory, those pointers would not be valid anymore.
But I am storing just the pointers to the data. So here is what I think:
If names gets resized, it will move-construct all the strings it contains and so those strings will not allocate new memory, but instead just use the already allocated memory and so my pointers in cStringPointers would still be valid.
My question is now simply: Have I missed something that would make this code unsafe or cause undefined behaviour?
(Assuming that I don't use any exotical architexture and or compiler.)
My question is now simply: Have I missed something that would make
this code unsafe or cause undefined behaviour?
Yes: you have missed Small String Optimization. It is permitted by the standard and widely implemented, and will lead to dangling pointers as the strings actually move their data to their new location.
This is not safe. Even the cStringPointers is not safe.
Note standard library for most compilers implement something called: Small String Optimization (SSO). Basically in SSO if string is small (in gcc 15 characters), memory for that string is not allocated in heap, but it is kept directly inside std::basic_string class. To achieve that std::basic_string is larger then size required for pointers (begin, end, capacity).
This means that if vector is relocated, small string will change their position.
Longer string will remain valid since they are allocated on heap an this will not be copied.
My question is now simply: Have I missed something that would make this code unsafe or cause undefined behaviour?
Yes. This particular assumption is up to the implementation and thus UB even if any common implementations of std::string would move the data of the string and keep pointers valid. Only when such a detail is actually guaranteed by the standard can you rely on it. (Commonly found in sections entitled "Iterator validity" and such.) In the documentation of std::string's move constructor (No. 2) it explicitly states:
Unlike other container move assignments, references, pointers, and iterators to str may be invalidated.
Here, this assumption actually happens to be wrong for most implementations as those use a small string optimization. That will store strings up to a certain size ("small strings") in the string object itself rather than allocating memory dynamically. Thus, when the string is moved, it can only avoid copying long strings which are dynamically allocated while small strings are actually copied. Thus c_str() will yield a different pointer after a move of small strings.
Just adding a relevant quote from the C++ Standard [string.require.4]:
References, pointers, and iterators referring to the elements of a basic_­string sequence may be invalidated by the following uses of that basic_­string object:
— Passing as an argument to any standard library function taking a reference to non-const basic_­string as an argument.
Move construction of strings during vector reallocation is exactly such a case, since the move constructor takes a reference to a non-const string as an argument.

Directly write into char* buffer of std::string

So I have an std::string and have a function which takes char* and writes into it. Since std::string::c_str() and std::string::data() return const char*, I can't use them. So I was allocating a temporary buffer, calling a function with it and copying it into std::string.
Now I plan to work with big amount of information and copying this buffer will have a noticeable impact and I want to avoid it.
Some people suggested to use &str.front() or &str[0] but does it invoke the undefined behavior?
C++98/03
Impossible. String can be copy on write so it needs to handle all reads and writes.
C++11/14
In [string.require]:
The char-like objects in a basic_string object shall be stored contiguously. That is, for any basic_string
object s, the identity &*(s.begin() + n) == &*s.begin() + n shall hold for all values of n such that 0 <= n < s.size().
So &str.front() and &str[0] should work.
C++17
str.data(), &str.front() and &str[0] work.
Here it says:
charT* data() noexcept;
Returns: A pointer p such that p + i == &operator[](i) for each i in [0, size()].
Complexity: Constant time.
Requires: The program shall not alter the value stored at p + size().
The non-const .data() just works.
The recent draft has the following wording for .front():
const charT& front() const;
charT& front();
Requires: !empty().
Effects: Equivalent to operator[](0).
And the following for operator[]:
const_reference operator[](size_type pos) const;
reference operator[](size_type pos);
Requires: pos <= size().
Returns: *(begin() + pos) if pos < size(). Otherwise, returns a reference to an object of type charT with value charT(), where modifying the object leads to undefined behavior.
Throws: Nothing.
Complexity: Constant time.
So it uses iterator arithmetic. so we need to inspect the information about iterators. Here it says:
3 A basic_string is a contiguous container ([container.requirements.general]).
So we need to go here:
A contiguous container is a container that supports random access iterators ([random.access.iterators]) and whose member types iterator and const_iterator are contiguous iterators ([iterator.requirements.general]).
Then here:
Iterators that further satisfy the requirement that, for integral values n and dereferenceable iterator values a and (a + n), *(a + n) is equivalent to *(addressof(*a) + n), are called contiguous iterators.
Apparently, contiguous iterators are a C++17 feature which was added in these papers.
The requirement can be rewritten as:
assert(*(a + n) == *(&*a + n));
So, in the second part we dereference iterator, then take address of the value it points to, then do a pointer arithmetic on it, dereference it and it's the same as incrementing an iterator and then dereferencing it. This means that contiguous iterator points to the memory where each value stored right after the other, hence contiguous. Since functions that take char* expect contiguous memory, you can pass the result of &str.front() or &str[0] to these functions.
You can simply use &s[0] for a non-empty string. This gives you a pointer to the start of the buffer
When you use it to put a string of n characters there the string's length (not just the capacity) needs to be at least n beforehand, because there's no way to adjust it up without clobbering the data.
I.e., usage can go like this:
auto foo( int const n )
-> string
{
if( n <= 0 ) { return ""; }
string result( n, '#' ); // # is an arbitrary fill character.
int const n_stored = some_api_function( &result[0], n );
assert( n_stored <= n );
result.resize( n_stored );
return result;
}
This approach has worked formally since C++11. Before that, in C++98 and C++03, the buffer was not formally guaranteed to be contiguous. However, for the in-practice the approach has worked since C++98, the first standard – the reason that the contiguous buffer requirement could be adopted in C++11 (it was added in the Lillehammer meeting, I think that was 2005) was that there were no extant standard library implementations with a non-contiguous string buffer.
Regarding
” C++17 added added non-const data() to std::string but it still says that you can't modify the buffer.
I'm not aware of any such wording, and since that would defeat the purpose of non-const data() I doubt that this statement is correct.
Regarding
” Now I plan to work with big amount of information and copying this buffer will have a noticeable impact and I want to avoid it.
If copying the buffer has a noticeable impact, then you'd want to avoid inadvertently copying the std::string.
One way is to wrap it in a class that's not copyable.
I don't know what you intend to do with that string, but if
all you need is a buffer of chars which frees its own memory automatically,
then I usually use vector<char> or vector<int> or whatever type
of buffer you need.
With v being the vector, it's guaranteed that &v[0] points to
a sequential memory which you can use as a buffer.
Note: if you consider string::front() to be the same as &string[0] then the following is a redundant answer:
According to cplusplus: In C++98, you shouldn't write to .data() or .c_str(), they are to be treated as read-only/const:
A program shall not alter any of the characters in this sequence.
But in C++11 this warning was removed, but the return values are still const, so officially it isn't allowed in C++11 either. So to avoid undefined behavior, you can use string::front(), which:
If the string object is const-qualified, the function returns a const char&. Otherwise, it returns a char&.
So if your string isn't const, then you are officially allowed to manipulate the contents returned by string::front(), which is a reference to the first element of the buffer. But the link doesn't mention which C++ standard this applies to. I assume C++11 and later.
Also, it returns the first element, not a pointer, so you'll need to take its address. It's not clear whether you are officially allowed to use that as a const char* for the whole buffer, but in combination with other answers, I'm sure it's safe. Atleast it doesn't produce any compiler warnings.

Is wstring null terminated?

What is the internal structure of std::wstring? Does it include the length? Is it null terminated? Both?
Does it include the length
Yes. It's required by the C++11 standard.
§ 21.4.4
size_type size() const noexcept;
1. Returns: A count of the number of char-like objects currently in the string.
2. Complexity: constant time.
Note however, that this is unaware of unicode.
Is it null terminated
Yes. It's also required by the C++11 standard that std::basic_string::c_str returns a valid pointer for the range of [0,size()] in which my_string[my_string.size()] will be valid, hence a null character.
§ 21.4.7.1
const charT* c_str() const noexcept;
const charT* data() const noexcept;
1. Returns: A pointer p such that p + i == &operator[](i) for
each i in [0,size()].
2. Complexity: constant time.
3. Requires: The program shall not alter any of the values
stored in the character array.
We don't know. It's completely up to the implementation. (At least up until C++03 - apparently C++11 requires the internal buffer to be 0-terminated.) You can have a look at the source code of the C++ standard library implementation if the one you are using is opensource.
Apart from that, I'd find it logical if it was NUL-terminated and it stored an explicit length as well. This is good because then it takes constant time to return the length and a valid C string:
size_t length()
{
return m_length;
}
const wchar_t *c_str()
{
return m_cstr;
}
If it didn't store an explicit length, then size() would have to count the characters up to the NUL in O(n), which is wasteful if you can avoid it.
If, however, the internal buffer wasn't NUL-terminated, but it only stored the length, then it would be tedious to create a proper NUL-terminated C string: the string would have to either reallocate its storage and append the 0 (and reallocation is an expensive operation), or it would have to copy the entire buffer over, which is again an O(n) operation.
(Warning: shameless self-promotion - in a C language project I am currently working on, I've taken exactly this approach to implement flexible string objects.)
basic_string (from which wstring is typedef) has no need for terminators.
Yes, it manages its own lengths.
If you need a null-terminated (aka C string) version of string/wstring, call c_str(). But it can contain a null character inside it, in which case pretty much every C function to handle C strings will fail to see the entire string.

Is string::c_str() no longer null terminated in C++11?

In C++11 basic_string::c_str is defined to be exactly the same as basic_string::data, which is in turn defined to be exactly the same as *(begin() + n) and *(&*begin() + n) (when 0 <= n < size()).
I cannot find anything that requires the string to always have a null character at its end.
Does this mean that c_str() is no longer guaranteed to produce a null-terminated string?
Strings are now required to use null-terminated buffers internally. Look at the definition of operator[] (21.4.5):
Requires: pos <= size().
Returns: *(begin() + pos) if pos <
size(), otherwise a reference to an object of type T with value
charT(); the referenced value shall not be modified.
Looking back at c_str (21.4.7.1/1), we see that it is defined in terms of operator[]:
Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
And both c_str and data are required to be O(1), so the implementation is effectively forced to use null-terminated buffers.
Additionally, as David Rodríguez - dribeas points out in the comments, the return value requirement also means that you can use &operator[](0) as a synonym for c_str(), so the terminating null character must lie in the same buffer (since *(p + size()) must be equal to charT()); this also means that even if the terminator is initialised lazily, it's not possible to observe the buffer in the intermediate state.
Well, in fact it is true that the new standard stipulates that .data() and .c_str() are now synonyms. However, it doesn't say that .c_str() is no longer zero-terminated :)
It just means that you can now rely on .data() being zero-terminated as well.
Paper N2668 defines c_str() and data() members of std::basic_string as
follows:
const charT* c_str() const;
const charT* data() const;
Returns: A pointer to the initial element of an array of length
size() + 1 whose first size() elements equal the corresponding
elements of the string controlled by *this and whose last element is a
null character specified by charT().
Requires: The program shall not alter any of the values stored in
the character array.
Note that this does NOT mean that any valid std::string can be treated as a C-string because std::string can contain embedded nulls, which will prematurely end the C-string when used directly as a const char*.
Addendum:
I don't have access to the actual published final spec of C++11 but it appears that indeed the wording was dropped somewhere in the revision history of the spec: e.g. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2011/n3242.pdf
§ 21.4.7 basic_string string operations [string.ops]
§ 21.4.7.1 basic_string accessors [string.accessors]
const charT* c_str() const noexcept;
const charT* data() const noexcept;
Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
Complexity: constant time.
Requires: The program shall not alter any of the values stored in the character array.
The "history" was that a long time ago when everyone worked in single threads, or at least the threads were workers with their own data, they designed a string class for C++ which made string handling easier than it had been before, and they overloaded operator+ to concatenate strings.
The issue was that users would do something like:
s = s1 + s2 + s3 + s4;
and each concatenation would create a temporary which had to implement a string.
Therefore someone had the brainwave of "lazy evaluation" such that internally you could store some kind of "rope" with all the strings until someone wanted to read it as a C-string at which point you would change the internal representation to a contiguous buffer.
This solved the problem above but caused a load of other headaches, in particular in the multi-threaded world where one expected a .c_str() operation to be read-only / doesn't change anything and therefore no need to lock anything. Premature internal-locking in the class implementation just in case someone was doing it multi-threaded (when there wasn't even a threading standard) was also not a good idea. In fact it was more costly to do anything of this than simply copy the buffer each time. Same reason "copy on write" implementation was abandoned for string implementations.
Thus making .c_str() a truly immutable operation turned out to be the most sensible thing to do, however could one "rely" on it in a standard that now is thread-aware? Therefore the new standard decided to clearly state that you can, and thus the internal representation needs to hold the null terminator.
Well spotted. This is certainly a defect in the recently adopted standard; I'm sure that there was no intent to break all of the code currently using c_str. I would suggest a defect report, or at least asking the question in comp.std.c++ (which will usually end up before the committee if it concerns a defect).

string c_str() vs. data()

I have read several places that the difference between c_str() and data() (in STL and other implementations) is that c_str() is always null terminated while data() is not.
As far as I have seen in actual implementations, they either do the same or data() calls c_str().
What am I missing here?
Which one is more correct to use in which scenarios?
The documentation is correct. Use c_str() if you want a null terminated string.
If the implementers happend to implement data() in terms of c_str() you don't have to worry, still use data() if you don't need the string to be null terminated, in some implementation it may turn out to perform better than c_str().
strings don't necessarily have to be composed of character data, they could be composed with elements of any type. In those cases data() is more meaningful. c_str() in my opinion is only really useful when the elements of your string are character based.
Extra: In C++11 onwards, both functions are required to be the same. i.e. data is now required to be null-terminated. According to cppreference: "The returned array is null-terminated, that is, data() and c_str() perform the same function."
In C++11/C++0x, data() and c_str() is no longer different. And thus data() is required to have a null termination at the end as well.
21.4.7.1 basic_string accessors [string.accessors]
const charT* c_str() const noexcept;
const charT* data() const noexcept;
1 Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
21.4.5 basic_string element access [string.access]
const_reference operator[](size_type pos) const noexcept;
1 Requires: pos <= size().
2 Returns: *(begin() + pos) if pos < size(), otherwise a reference to an object of type T
with value charT(); the referenced value shall not be modified.
Even know you have seen that they do the same, or that .data() calls .c_str(), it is not correct to assume that this will be the case for other compilers. It is also possible that your compiler will change with a future release.
2 reasons to use std::string:
std::string can be used for both text and arbitrary binary data.
//Example 1
//Plain text:
std::string s1;
s1 = "abc";
//Example 2
//Arbitrary binary data:
std::string s2;
s2.append("a\0b\0b\0", 6);
You should use the .c_str() method when you are using your string as example 1.
You should use the .data() method when you are using your string as example 2. Not because it is dangereous to use .c_str() in these cases, but because it is more explicit that you are working with binary data for others reviewing your code.
Possible pitfall with using .data()
The following code is wrong and could cause a segfault in your program:
std::string s;
s = "abc";
char sz[512];
strcpy(sz, s.data());//This could crash depending on the implementation of .data()
Why is it common for implementers to make .data() and .c_str() do the same thing?
Because it is more efficient to do so. The only way to make .data() return something that is not null terminated, would be to have .c_str() or .data() copy their internal buffer, or to just use 2 buffers. Having a single null terminated buffer always means that you can always use just one internal buffer when implementing std::string.
It has been answered already, some notes on the purpose: Freedom of implementation.
std::string operations - e.g. iteration, concatenation and element mutation - don't need the zero terminator. Unless you pass the string to a function expecting a zero terminated string, it can be omitted.
This would allow an implementation to have substrings share the actual string data: string::substr could internally hold a reference to shared string data, and the start/end range, avoiding the copy (and additional allocation) of the actual string data. The implementation would defer the copy until you call c_str or modify any of the strings. No copy would ever be made if the sub-strings involved are just read.
(copy-on-write implementation aren't much fun in multithreaded environments, plus the typical memory/allocation savings aren't worth the more complex code today, so it's rarely done).
Similarly, string::data allows a different internal representation, e.g. a rope (linked list of string segments). This can improve insert / replace operations significantly. again, the list of segments would have to be collapsed to a single segment when you call c_str or data.
Quote from ANSI ISO IEC 14882 2003 (C++03 Standard):
21.3.6 basic_string string operations [lib.string.ops]
const charT* c_str() const;
Returns: A pointer to the initial element of an array of length size() + 1 whose first size() elements
equal the corresponding elements of the string controlled by *this and whose last element is a
null character specified by charT().
Requires: The program shall not alter any of the values stored in the array. Nor shall the program treat the
returned value as a valid pointer value after any subsequent call to a non-const member function of the
class basic_string that designates the same object as this.
const charT* data() const;
Returns: If size() is nonzero, the member returns a pointer to the initial element of an array whose first
size() elements equal the corresponding elements of the string controlled by *this. If size() is
zero, the member returns a non-null pointer that is copyable and can have zero added to it.
Requires: The program shall not alter any of the values stored in the character array. Nor shall the program
treat the returned value as a valid pointer value after any subsequent call to a non- const member
function of basic_string that designates the same object as this.
All the previous commments are consistence, but I'd also like to add that starting in c++17, str.data() returns a char* instead of const char*