Considering a code like this:
std::string str = "abcdef";
const size_t num = 50;
const size_t baselen = str.length();
while (str.length() < num)
str.append(str, 0, baselen);
Is it safe to call std::basic_string<T>::append() on itself like this? Cannot the source memory get invalidated by enlarging before the copy operation?
I could not find anything in the standard specific to that method. It says the above is equivalent to str.append(str.data(), baselen), which I think might not be entirely safe unless there is another detection of such cases inside append(const char*, size_t).
I checked a few implementations and they seemed safe one way or another, but my question is if this behavior is guaranteed. E.g. "Appending std::vector to itself, undefined behavior?" says it's not for std::vector.
According to §21.4.6.2/§21.4.6.3:
The function [basic_string& append(const charT* s, size_type n);] replaces the string controlled by *this with a string of length size() + n whose first size() elements are a copy of the original string controlled by *this and whose remaining elements are a copy of the initial n elements of s.
Note: This applies to every append call, as every append can be implemented in terms of append(const charT*, size_type), as defined by the standard (§21.4.6.2/§21.4.6.3).
So basically, append makes a copy of str (let's call the copy strtemp), appends n characters of str2 to strtemp, and then replaces str with strtemp.
For the case that str2 is str, nothing changes, as the string is enlarged when the temporary copy is assigned, not before.
Even though it is not explicitly stated in the standard, it is guaranteed (if the implementation is exactly as stated in the standard) by the definition of std::basic_string<T>::append.
Thus, this is not undefined behavior.
This is complicated.
One thing that can be said for certain. If you use iterators:
std::string str = "abcdef";
str.append(str.begin(), str.end());
then you are guaranteed to be safe. Yes, really. Why? Because the specification states that the behavior of the iterator functions is equivalent to calling append(basic_string(first, last)). That obviously creates a temporary copy of the string. So if you need to insert a string into itself, you're guaranteed to be able to do it with the iterator form.
Granted, implementations don't have to actually copy it. But they do need to respect the standard specified behavior. An implementation could choose to make a copy only if the iterator range is inside of itself, but the implementation would still have to check.
All of the other forms of append are defined to be equivalent to calling append(const charT *s, size_t len). That is, your call to append above is equivalent to you doing append(str.data(), str.size()). So what does the standard say about what happens if s is inside of *this?
Nothing at all.
The only requirement of s is:
s points to an array of at least n elements of charT.
Since it does not expressly forbid s pointing into *this, then it must be allowed. It would also be exceedingly strange if the iterator version allows self-assignment, but the pointer&size version did not.
Related
So I have an std::string and have a function which takes char* and writes into it. Since std::string::c_str() and std::string::data() return const char*, I can't use them. So I was allocating a temporary buffer, calling a function with it and copying it into std::string.
Now I plan to work with big amount of information and copying this buffer will have a noticeable impact and I want to avoid it.
Some people suggested to use &str.front() or &str[0] but does it invoke the undefined behavior?
C++98/03
Impossible. String can be copy on write so it needs to handle all reads and writes.
C++11/14
In [string.require]:
The char-like objects in a basic_string object shall be stored contiguously. That is, for any basic_string
object s, the identity &*(s.begin() + n) == &*s.begin() + n shall hold for all values of n such that 0 <= n < s.size().
So &str.front() and &str[0] should work.
C++17
str.data(), &str.front() and &str[0] work.
Here it says:
charT* data() noexcept;
Returns: A pointer p such that p + i == &operator[](i) for each i in [0, size()].
Complexity: Constant time.
Requires: The program shall not alter the value stored at p + size().
The non-const .data() just works.
The recent draft has the following wording for .front():
const charT& front() const;
charT& front();
Requires: !empty().
Effects: Equivalent to operator[](0).
And the following for operator[]:
const_reference operator[](size_type pos) const;
reference operator[](size_type pos);
Requires: pos <= size().
Returns: *(begin() + pos) if pos < size(). Otherwise, returns a reference to an object of type charT with value charT(), where modifying the object leads to undefined behavior.
Throws: Nothing.
Complexity: Constant time.
So it uses iterator arithmetic. so we need to inspect the information about iterators. Here it says:
3 A basic_string is a contiguous container ([container.requirements.general]).
So we need to go here:
A contiguous container is a container that supports random access iterators ([random.access.iterators]) and whose member types iterator and const_iterator are contiguous iterators ([iterator.requirements.general]).
Then here:
Iterators that further satisfy the requirement that, for integral values n and dereferenceable iterator values a and (a + n), *(a + n) is equivalent to *(addressof(*a) + n), are called contiguous iterators.
Apparently, contiguous iterators are a C++17 feature which was added in these papers.
The requirement can be rewritten as:
assert(*(a + n) == *(&*a + n));
So, in the second part we dereference iterator, then take address of the value it points to, then do a pointer arithmetic on it, dereference it and it's the same as incrementing an iterator and then dereferencing it. This means that contiguous iterator points to the memory where each value stored right after the other, hence contiguous. Since functions that take char* expect contiguous memory, you can pass the result of &str.front() or &str[0] to these functions.
You can simply use &s[0] for a non-empty string. This gives you a pointer to the start of the buffer
When you use it to put a string of n characters there the string's length (not just the capacity) needs to be at least n beforehand, because there's no way to adjust it up without clobbering the data.
I.e., usage can go like this:
auto foo( int const n )
-> string
{
if( n <= 0 ) { return ""; }
string result( n, '#' ); // # is an arbitrary fill character.
int const n_stored = some_api_function( &result[0], n );
assert( n_stored <= n );
result.resize( n_stored );
return result;
}
This approach has worked formally since C++11. Before that, in C++98 and C++03, the buffer was not formally guaranteed to be contiguous. However, for the in-practice the approach has worked since C++98, the first standard – the reason that the contiguous buffer requirement could be adopted in C++11 (it was added in the Lillehammer meeting, I think that was 2005) was that there were no extant standard library implementations with a non-contiguous string buffer.
Regarding
” C++17 added added non-const data() to std::string but it still says that you can't modify the buffer.
I'm not aware of any such wording, and since that would defeat the purpose of non-const data() I doubt that this statement is correct.
Regarding
” Now I plan to work with big amount of information and copying this buffer will have a noticeable impact and I want to avoid it.
If copying the buffer has a noticeable impact, then you'd want to avoid inadvertently copying the std::string.
One way is to wrap it in a class that's not copyable.
I don't know what you intend to do with that string, but if
all you need is a buffer of chars which frees its own memory automatically,
then I usually use vector<char> or vector<int> or whatever type
of buffer you need.
With v being the vector, it's guaranteed that &v[0] points to
a sequential memory which you can use as a buffer.
Note: if you consider string::front() to be the same as &string[0] then the following is a redundant answer:
According to cplusplus: In C++98, you shouldn't write to .data() or .c_str(), they are to be treated as read-only/const:
A program shall not alter any of the characters in this sequence.
But in C++11 this warning was removed, but the return values are still const, so officially it isn't allowed in C++11 either. So to avoid undefined behavior, you can use string::front(), which:
If the string object is const-qualified, the function returns a const char&. Otherwise, it returns a char&.
So if your string isn't const, then you are officially allowed to manipulate the contents returned by string::front(), which is a reference to the first element of the buffer. But the link doesn't mention which C++ standard this applies to. I assume C++11 and later.
Also, it returns the first element, not a pointer, so you'll need to take its address. It's not clear whether you are officially allowed to use that as a const char* for the whole buffer, but in combination with other answers, I'm sure it's safe. Atleast it doesn't produce any compiler warnings.
Considering a code like this:
std::string str = "abcdef";
const size_t num = 50;
const size_t baselen = str.length();
while (str.length() < num)
str.append(str, 0, baselen);
Is it safe to call std::basic_string<T>::append() on itself like this? Cannot the source memory get invalidated by enlarging before the copy operation?
I could not find anything in the standard specific to that method. It says the above is equivalent to str.append(str.data(), baselen), which I think might not be entirely safe unless there is another detection of such cases inside append(const char*, size_t).
I checked a few implementations and they seemed safe one way or another, but my question is if this behavior is guaranteed. E.g. "Appending std::vector to itself, undefined behavior?" says it's not for std::vector.
According to §21.4.6.2/§21.4.6.3:
The function [basic_string& append(const charT* s, size_type n);] replaces the string controlled by *this with a string of length size() + n whose first size() elements are a copy of the original string controlled by *this and whose remaining elements are a copy of the initial n elements of s.
Note: This applies to every append call, as every append can be implemented in terms of append(const charT*, size_type), as defined by the standard (§21.4.6.2/§21.4.6.3).
So basically, append makes a copy of str (let's call the copy strtemp), appends n characters of str2 to strtemp, and then replaces str with strtemp.
For the case that str2 is str, nothing changes, as the string is enlarged when the temporary copy is assigned, not before.
Even though it is not explicitly stated in the standard, it is guaranteed (if the implementation is exactly as stated in the standard) by the definition of std::basic_string<T>::append.
Thus, this is not undefined behavior.
This is complicated.
One thing that can be said for certain. If you use iterators:
std::string str = "abcdef";
str.append(str.begin(), str.end());
then you are guaranteed to be safe. Yes, really. Why? Because the specification states that the behavior of the iterator functions is equivalent to calling append(basic_string(first, last)). That obviously creates a temporary copy of the string. So if you need to insert a string into itself, you're guaranteed to be able to do it with the iterator form.
Granted, implementations don't have to actually copy it. But they do need to respect the standard specified behavior. An implementation could choose to make a copy only if the iterator range is inside of itself, but the implementation would still have to check.
All of the other forms of append are defined to be equivalent to calling append(const charT *s, size_t len). That is, your call to append above is equivalent to you doing append(str.data(), str.size()). So what does the standard say about what happens if s is inside of *this?
Nothing at all.
The only requirement of s is:
s points to an array of at least n elements of charT.
Since it does not expressly forbid s pointing into *this, then it must be allowed. It would also be exceedingly strange if the iterator version allows self-assignment, but the pointer&size version did not.
In C++11 basic_string::c_str is defined to be exactly the same as basic_string::data, which is in turn defined to be exactly the same as *(begin() + n) and *(&*begin() + n) (when 0 <= n < size()).
I cannot find anything that requires the string to always have a null character at its end.
Does this mean that c_str() is no longer guaranteed to produce a null-terminated string?
Strings are now required to use null-terminated buffers internally. Look at the definition of operator[] (21.4.5):
Requires: pos <= size().
Returns: *(begin() + pos) if pos <
size(), otherwise a reference to an object of type T with value
charT(); the referenced value shall not be modified.
Looking back at c_str (21.4.7.1/1), we see that it is defined in terms of operator[]:
Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
And both c_str and data are required to be O(1), so the implementation is effectively forced to use null-terminated buffers.
Additionally, as David Rodríguez - dribeas points out in the comments, the return value requirement also means that you can use &operator[](0) as a synonym for c_str(), so the terminating null character must lie in the same buffer (since *(p + size()) must be equal to charT()); this also means that even if the terminator is initialised lazily, it's not possible to observe the buffer in the intermediate state.
Well, in fact it is true that the new standard stipulates that .data() and .c_str() are now synonyms. However, it doesn't say that .c_str() is no longer zero-terminated :)
It just means that you can now rely on .data() being zero-terminated as well.
Paper N2668 defines c_str() and data() members of std::basic_string as
follows:
const charT* c_str() const;
const charT* data() const;
Returns: A pointer to the initial element of an array of length
size() + 1 whose first size() elements equal the corresponding
elements of the string controlled by *this and whose last element is a
null character specified by charT().
Requires: The program shall not alter any of the values stored in
the character array.
Note that this does NOT mean that any valid std::string can be treated as a C-string because std::string can contain embedded nulls, which will prematurely end the C-string when used directly as a const char*.
Addendum:
I don't have access to the actual published final spec of C++11 but it appears that indeed the wording was dropped somewhere in the revision history of the spec: e.g. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2011/n3242.pdf
§ 21.4.7 basic_string string operations [string.ops]
§ 21.4.7.1 basic_string accessors [string.accessors]
const charT* c_str() const noexcept;
const charT* data() const noexcept;
Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
Complexity: constant time.
Requires: The program shall not alter any of the values stored in the character array.
The "history" was that a long time ago when everyone worked in single threads, or at least the threads were workers with their own data, they designed a string class for C++ which made string handling easier than it had been before, and they overloaded operator+ to concatenate strings.
The issue was that users would do something like:
s = s1 + s2 + s3 + s4;
and each concatenation would create a temporary which had to implement a string.
Therefore someone had the brainwave of "lazy evaluation" such that internally you could store some kind of "rope" with all the strings until someone wanted to read it as a C-string at which point you would change the internal representation to a contiguous buffer.
This solved the problem above but caused a load of other headaches, in particular in the multi-threaded world where one expected a .c_str() operation to be read-only / doesn't change anything and therefore no need to lock anything. Premature internal-locking in the class implementation just in case someone was doing it multi-threaded (when there wasn't even a threading standard) was also not a good idea. In fact it was more costly to do anything of this than simply copy the buffer each time. Same reason "copy on write" implementation was abandoned for string implementations.
Thus making .c_str() a truly immutable operation turned out to be the most sensible thing to do, however could one "rely" on it in a standard that now is thread-aware? Therefore the new standard decided to clearly state that you can, and thus the internal representation needs to hold the null terminator.
Well spotted. This is certainly a defect in the recently adopted standard; I'm sure that there was no intent to break all of the code currently using c_str. I would suggest a defect report, or at least asking the question in comp.std.c++ (which will usually end up before the committee if it concerns a defect).
I've been using some semi-iterators to tokenize a std::string, and I've run into a curious problem with operator[]. When constructing a new string from a position using char*, I've used something like the following:
t.begin = i;
t.end = i + 1;
t.contents = std::string(&arg.second[t.begin], &arg.second[t.end]);
where arg.second is a std::string. But, if i is the position of the last character, then arg.second[t.end] will throw a debugging assertion- even though taking a pointer of one-past-the-end is well defined behaviour and even common for primitive arrays, and since the constructor is being called using iterators I know that the end iterator will never be de-referenced. Doesn't it seem logical that arg.second[arg.second.size()] should be a valid expression, producing the equivalent of arg.second.end() as a char*?
You're not taking a pointer to one past the end, you're ACCESSING one past the end and then getting the address of that. Entirely different and while the the former is well defined and well formed, the latter is not either. I suggest using the iterator constructor, which is basically what you ARE using but do so with iterators instead of char*. See Alexandre's comment.
operator[](size_type pos) const doesn't return one-past-the-end is pos == size(); it returns charT(), which is a temporary. In the non-const version of operator[], the behavior is undefined.
21.3.4/1
const_reference operator[](size_type pos) const;
reference operator[](size_type pos);
1 Returns: If pos < size(), returns data()[pos]. Otherwise, if pos == size(), the const
version returns charT(). Otherwise, the behavior is undefined.
What is well-defined is creating an iterator one past the end. (Pointers might be iterators, too.) However, dereferencing such an iterator will yield Undefined Behavior.
Now, what you're doing is array subscription, and that is very different from forming iterators, because it returns a reference to the referred-to object (much akin to dereferencing an iterator). You are certainly not allowed to access an array one-past-the-end.
std::string is not an array. It is an object, whose interface loosely resembles an array (namely, provides operator[]). But that's when the similarity ends.
Even if we for a second assume that std::string is just a wrapper built on top of an ordinary array, then in order to obtain the one-past-the-end pointer for the stored sequence, you have to do something like &arg.second[0] + t.end, i.e. instead of going through the std::string interface first move into into the domain of ordinary pointers and use ordinary low-level pointer arithmetic.
However, even that assumption is not correct and doing something like &arg.second[0] + t.end is a recipe for disaster. std::string is not guaranteed to store its controlled sequence as an array. It is not guaranteed to be stored continuously, meaning that regardless of where your pointers point, you cannot assume that you'll be able to iterate from one to another by using pointer arithmetic.
If you want to use an std::string in some legacy pointer-based interface the only choice you have is to go through the std::string::c_str() method, which will generate a non-permanent array-based copy of the controlled sequence.
P.S. Note, BTW, that in the original C and C++ specifications it is illegal to use the &a[N] method to obtain the one-past-the-end pointer even for an ordinary built-in array. You always have to make sure that you are not using the [] operator with past-the-end index. The legal way to obtain the pointer has always been something like a + N or &a[0] + N, but not &a[N]. Recent changes legalized the &a[N] approach as well, but nevertheless originally it was not legal.
A string is not a primitive array, so I'd say the implementation is free to add some debug diagnostics if you are doing something dangerous like accessing elements outside its range. I would guess that a release build will probably work.
But...
For what you are trying to do, why not just use the basic_string( const basic_string& str, size_type index, size_type length ); constructor to create the sub strings?
I have read several places that the difference between c_str() and data() (in STL and other implementations) is that c_str() is always null terminated while data() is not.
As far as I have seen in actual implementations, they either do the same or data() calls c_str().
What am I missing here?
Which one is more correct to use in which scenarios?
The documentation is correct. Use c_str() if you want a null terminated string.
If the implementers happend to implement data() in terms of c_str() you don't have to worry, still use data() if you don't need the string to be null terminated, in some implementation it may turn out to perform better than c_str().
strings don't necessarily have to be composed of character data, they could be composed with elements of any type. In those cases data() is more meaningful. c_str() in my opinion is only really useful when the elements of your string are character based.
Extra: In C++11 onwards, both functions are required to be the same. i.e. data is now required to be null-terminated. According to cppreference: "The returned array is null-terminated, that is, data() and c_str() perform the same function."
In C++11/C++0x, data() and c_str() is no longer different. And thus data() is required to have a null termination at the end as well.
21.4.7.1 basic_string accessors [string.accessors]
const charT* c_str() const noexcept;
const charT* data() const noexcept;
1 Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
21.4.5 basic_string element access [string.access]
const_reference operator[](size_type pos) const noexcept;
1 Requires: pos <= size().
2 Returns: *(begin() + pos) if pos < size(), otherwise a reference to an object of type T
with value charT(); the referenced value shall not be modified.
Even know you have seen that they do the same, or that .data() calls .c_str(), it is not correct to assume that this will be the case for other compilers. It is also possible that your compiler will change with a future release.
2 reasons to use std::string:
std::string can be used for both text and arbitrary binary data.
//Example 1
//Plain text:
std::string s1;
s1 = "abc";
//Example 2
//Arbitrary binary data:
std::string s2;
s2.append("a\0b\0b\0", 6);
You should use the .c_str() method when you are using your string as example 1.
You should use the .data() method when you are using your string as example 2. Not because it is dangereous to use .c_str() in these cases, but because it is more explicit that you are working with binary data for others reviewing your code.
Possible pitfall with using .data()
The following code is wrong and could cause a segfault in your program:
std::string s;
s = "abc";
char sz[512];
strcpy(sz, s.data());//This could crash depending on the implementation of .data()
Why is it common for implementers to make .data() and .c_str() do the same thing?
Because it is more efficient to do so. The only way to make .data() return something that is not null terminated, would be to have .c_str() or .data() copy their internal buffer, or to just use 2 buffers. Having a single null terminated buffer always means that you can always use just one internal buffer when implementing std::string.
It has been answered already, some notes on the purpose: Freedom of implementation.
std::string operations - e.g. iteration, concatenation and element mutation - don't need the zero terminator. Unless you pass the string to a function expecting a zero terminated string, it can be omitted.
This would allow an implementation to have substrings share the actual string data: string::substr could internally hold a reference to shared string data, and the start/end range, avoiding the copy (and additional allocation) of the actual string data. The implementation would defer the copy until you call c_str or modify any of the strings. No copy would ever be made if the sub-strings involved are just read.
(copy-on-write implementation aren't much fun in multithreaded environments, plus the typical memory/allocation savings aren't worth the more complex code today, so it's rarely done).
Similarly, string::data allows a different internal representation, e.g. a rope (linked list of string segments). This can improve insert / replace operations significantly. again, the list of segments would have to be collapsed to a single segment when you call c_str or data.
Quote from ANSI ISO IEC 14882 2003 (C++03 Standard):
21.3.6 basic_string string operations [lib.string.ops]
const charT* c_str() const;
Returns: A pointer to the initial element of an array of length size() + 1 whose first size() elements
equal the corresponding elements of the string controlled by *this and whose last element is a
null character specified by charT().
Requires: The program shall not alter any of the values stored in the array. Nor shall the program treat the
returned value as a valid pointer value after any subsequent call to a non-const member function of the
class basic_string that designates the same object as this.
const charT* data() const;
Returns: If size() is nonzero, the member returns a pointer to the initial element of an array whose first
size() elements equal the corresponding elements of the string controlled by *this. If size() is
zero, the member returns a non-null pointer that is copyable and can have zero added to it.
Requires: The program shall not alter any of the values stored in the character array. Nor shall the program
treat the returned value as a valid pointer value after any subsequent call to a non- const member
function of basic_string that designates the same object as this.
All the previous commments are consistence, but I'd also like to add that starting in c++17, str.data() returns a char* instead of const char*