Directly write into char* buffer of std::string - c++

So I have an std::string and have a function which takes char* and writes into it. Since std::string::c_str() and std::string::data() return const char*, I can't use them. So I was allocating a temporary buffer, calling a function with it and copying it into std::string.
Now I plan to work with big amount of information and copying this buffer will have a noticeable impact and I want to avoid it.
Some people suggested to use &str.front() or &str[0] but does it invoke the undefined behavior?

C++98/03
Impossible. String can be copy on write so it needs to handle all reads and writes.
C++11/14
In [string.require]:
The char-like objects in a basic_string object shall be stored contiguously. That is, for any basic_string
object s, the identity &*(s.begin() + n) == &*s.begin() + n shall hold for all values of n such that 0 <= n < s.size().
So &str.front() and &str[0] should work.
C++17
str.data(), &str.front() and &str[0] work.
Here it says:
charT* data() noexcept;
Returns: A pointer p such that p + i == &operator[](i) for each i in [0, size()].
Complexity: Constant time.
Requires: The program shall not alter the value stored at p + size().
The non-const .data() just works.
The recent draft has the following wording for .front():
const charT& front() const;
charT& front();
Requires: !empty().
Effects: Equivalent to operator[](0).
And the following for operator[]:
const_reference operator[](size_type pos) const;
reference operator[](size_type pos);
Requires: pos <= size().
Returns: *(begin() + pos) if pos < size(). Otherwise, returns a reference to an object of type charT with value charT(), where modifying the object leads to undefined behavior.
Throws: Nothing.
Complexity: Constant time.
So it uses iterator arithmetic. so we need to inspect the information about iterators. Here it says:
3 A basic_string is a contiguous container ([container.requirements.general]).
So we need to go here:
A contiguous container is a container that supports random access iterators ([random.access.iterators]) and whose member types iterator and const_iterator are contiguous iterators ([iterator.requirements.general]).
Then here:
Iterators that further satisfy the requirement that, for integral values n and dereferenceable iterator values a and (a + n), *(a + n) is equivalent to *(addressof(*a) + n), are called contiguous iterators.
Apparently, contiguous iterators are a C++17 feature which was added in these papers.
The requirement can be rewritten as:
assert(*(a + n) == *(&*a + n));
So, in the second part we dereference iterator, then take address of the value it points to, then do a pointer arithmetic on it, dereference it and it's the same as incrementing an iterator and then dereferencing it. This means that contiguous iterator points to the memory where each value stored right after the other, hence contiguous. Since functions that take char* expect contiguous memory, you can pass the result of &str.front() or &str[0] to these functions.

You can simply use &s[0] for a non-empty string. This gives you a pointer to the start of the buffer
When you use it to put a string of n characters there the string's length (not just the capacity) needs to be at least n beforehand, because there's no way to adjust it up without clobbering the data.
I.e., usage can go like this:
auto foo( int const n )
-> string
{
if( n <= 0 ) { return ""; }
string result( n, '#' ); // # is an arbitrary fill character.
int const n_stored = some_api_function( &result[0], n );
assert( n_stored <= n );
result.resize( n_stored );
return result;
}
This approach has worked formally since C++11. Before that, in C++98 and C++03, the buffer was not formally guaranteed to be contiguous. However, for the in-practice the approach has worked since C++98, the first standard – the reason that the contiguous buffer requirement could be adopted in C++11 (it was added in the Lillehammer meeting, I think that was 2005) was that there were no extant standard library implementations with a non-contiguous string buffer.
Regarding
” C++17 added added non-const data() to std::string but it still says that you can't modify the buffer.
I'm not aware of any such wording, and since that would defeat the purpose of non-const data() I doubt that this statement is correct.
Regarding
” Now I plan to work with big amount of information and copying this buffer will have a noticeable impact and I want to avoid it.
If copying the buffer has a noticeable impact, then you'd want to avoid inadvertently copying the std::string.
One way is to wrap it in a class that's not copyable.

I don't know what you intend to do with that string, but if
all you need is a buffer of chars which frees its own memory automatically,
then I usually use vector<char> or vector<int> or whatever type
of buffer you need.
With v being the vector, it's guaranteed that &v[0] points to
a sequential memory which you can use as a buffer.

Note: if you consider string::front() to be the same as &string[0] then the following is a redundant answer:
According to cplusplus: In C++98, you shouldn't write to .data() or .c_str(), they are to be treated as read-only/const:
A program shall not alter any of the characters in this sequence.
But in C++11 this warning was removed, but the return values are still const, so officially it isn't allowed in C++11 either. So to avoid undefined behavior, you can use string::front(), which:
If the string object is const-qualified, the function returns a const char&. Otherwise, it returns a char&.
So if your string isn't const, then you are officially allowed to manipulate the contents returned by string::front(), which is a reference to the first element of the buffer. But the link doesn't mention which C++ standard this applies to. I assume C++11 and later.
Also, it returns the first element, not a pointer, so you'll need to take its address. It's not clear whether you are officially allowed to use that as a const char* for the whole buffer, but in combination with other answers, I'm sure it's safe. Atleast it doesn't produce any compiler warnings.

Related

Can `std::basic_string::operator[]` return a "distant" protected page nul terminator?

So, operator[] does not directly say that s[s.size()] must be the character after s[s.size()-1] in memory. It seems worded to avoid making that claim.
But s.data() states that s.data()+k == &s[k], and s.data() must return a pointer.
Ignoring the seeming standard defect of using & on CharT above and not std::addressof, is the implementation free to return a different CharT (say, one on a protected page, or in ROM) for s[s.size()] prior to the first call to s.data()? (Clearly it could arrange the buffer to end on a read-only page with a zero on it; I'm talking about a different situation)
To be explicit:
As far as I can tell, if s.data() is never called (and the compiler can prove it), then s[s.size()] need not be contiguous with the rest of the buffer.
Can std::addressof(s[s.size()]) change after a call to s.data() and the implementation be standards-compliant (so long as s.data()+k == &s[k] has .data() evaluated before [], but the compiler is free to enforce that). Or are there immutability requirements I cannot see?
Since C++11, std::string is required to be stored in contiguous memory. This is the quote from the C++11 standard (section 24.4.1.4):
The char-like objects in a
basic_string
object shall be stored contiguously. That is, for any
basic_string
object
s
, the identity
&*(s.begin() + n) == &*s.begin() + n
shall hold for all values of
n
such that
0
<= n < s.size()
.
This quote about the return value of operator[] states that it returns the same as &*(s.begin()+n) (section 21.4.5.1):
*(begin() + pos)
if
pos < size()
. Otherwise, returns a reference to an object of type
charT
with value
charT()
, where modifying the object leads to undefined behavior
Then we have this quote on the return value of data() in (section 24.4.7.1):
A pointer
p
such that
p + i == &operator[](i)
for each
i
in
[0,size()]
.
So data returns the same as you would get using the &operator[]. And any value between you retrieve using the &operator should be stored contiguously. So you can conclude both return a pointer to contiguous memory. So it will not return a pointer to a distance page.
Note that this only applies to C++11. Such guarantees were not made by the standard before C++11.

Why Doesn't string::data() Provide a Mutable char*?

In c++11 array, string, and vector all got the data method which:
Returns pointer to the underlying array serving as element storage. The pointer is such that range [data(); data() + size()) is always a valid range, even if the container is empty.
[Source]
This method is provided in a mutable and const version for all applicable containers, for example:
T* vector<T>::data();
const T* vector<T>::data() const;
All applicable containers, that is, except string which only provides the const version:
const char* string::data() const;
What happened here? Why did string get shortchanged, when char* string::data() would be so helpful?
The short answer is that c++17 does provide the char* string::data() method. Which is vital for the similarly c++17 data function, thus to gain mutable access to the underlying C-String I can now do this:
auto foo = "lorem ipsum"s;
for(auto i = data(foo); *i != '\0'; ++i) ++(*i);
For historical purposes it's worth chronicling string's development which c++17 is building upon: In c++11 access to string's underlying buffer is made possible possible by a new requirement that it's elements are stored contiguously such that for any given string s:
&*(s.begin() + n) == &*s.begin() + n for any n in [0, s.size()), or, equivalently, a pointer to s[0] can be passed to functions that expect a pointer to the first element of a CharT[] array.
Mutable access to this newly required underlying C-String was obtainable by various methods, for example: &s.front(), &s[0], or &*s.first() But back to the original question which would avoid the burden of using one of these options: Why hasn't access to string's underlying buffer been provided in the form of char* string::data()?
To answer that it is important to note that T* array<T>::data() and T* vector<T>::data() were an addition required by c++11. No additional requirements were incurred by c++11 against other contiguous containers such as deque. And there certainly wasn't an additional requirement for string, in fact the requirement that string was contiguous was new to c++11. Before this const char* string::data() had existed. Though it explicitly was not guaranteed to be pointing to any underlying buffer, it was the only way to obtain a const char* from a string:
The returned array is not required to be null-terminated.
This means that string was not "shortchanged" in c++11's transition to data accessors, it simply was not included thus only the const data accesor that string previously possessed persisted. There are naturally occurring examples in C++11's implementation which necessitate writing directly to the underlying buffer of a string.
I think this restriction comes from the (pre-2011) days where std::basic_string didn't have to store its internal buffer as a contiguous byte array.
While all the others (std::vector and such) had to store their elements as a contiguous sequence per the 2003 standard; so data could easily return mutable T*, because there was no problem with iterations, etc.
If std::basic_string were to return a mutable char*, that would imply that you can treat that char* as a valid C-string and perform C-string operations like strcpy, that would easily turn to undefined behavior were the string not allocated contiguously.
The C++11 standard added the rule that basic_string has to be implemented as a contiguous byte array. Needless to say, you can work-around this by using the old trick of &str[0].

Are end+1 iterators for std::string allowed?

Is it valid to create an iterator to end(str)+1 for std::string?
And if it isn't, why isn't it?
This question is restricted to C++11 and later, because while pre-C++11 the data was already stored in a continuous block in any but rare POC toy-implementations, the data didn't have to be stored that way.
And I think that might make all the difference.
The significant difference between std::string and any other standard container I speculate on is that it always contains one element more than its size, the zero-terminator, to fulfill the requirements of .c_str().
21.4.7.1 basic_string accessors [string.accessors]
const charT* c_str() const noexcept;
const charT* data() const noexcept;
1 Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
2 Complexity: Constant time.
3 Requires: The program shall not alter any of the values stored in the character array.
Still, even though it should imho guarantee that said expression is valid, for consistency and interoperability with zero-terminated strings if nothing else, the only paragraph I found casts doubt on that:
21.4.1 basic_string general requirements [string.require]
4 The char-like objects in a basic_string object shall be stored contiguously. That is, for any basic_string object s, the identity &*(s.begin() + n) == &*s.begin() + n shall hold for all values of n such that 0 <= n < s.size().
(All quotes are from C++14 final draft (n3936).)
Related: Legal to overwrite std::string's null terminator?
TL;DR: s.end() + 1 is undefined behavior.
std::string is a strange beast, mainly for historical reasons:
It attempts to bring C compatibility, where it is known that an additional \0 character exists beyond the length reported by strlen.
It was designed with an index-based interface.
As an after thought, when merged in the Standard library with the rest of the STL code, an iterator-based interface was added.
This led std::string, in C++03, to number 103 member functions, and since then a few were added.
Therefore, discrepancies between the different methods should be expected.
Already in the index-based interface discrepancies appear:
§21.4.5 [string.access]
const_reference operator[](size_type pos) const;
reference operator[](size_type pos);
1/ Requires: pos <= size()
const_reference at(size_type pos) const;
reference at(size_type pos);
5/ Throws: out_of_range if pos >= size()
Yes, you read this right, s[s.size()] returns a reference to a NUL character while s.at(s.size()) throws an out_of_range exception. If anyone tells you to replace all uses of operator[] by at because they are safer, beware the string trap...
So, what about iterators?
§21.4.3 [string.iterators]
iterator end() noexcept;
const_iterator end() const noexcept;
const_iterator cend() const noexcept;
2/ Returns: An iterator which is the past-the-end value.
Wonderfully bland.
So we have to refer to other paragraphs. A pointer is offered by
§21.4 [basic.string]
3/ The iterators supported by basic_string are random access iterators (24.2.7).
while §17.6 [requirements] seems devoid of anything related. Thus, strings iterators are just plain old iterators (you can probably sense where this is going... but since we came this far let's go all the way).
This leads us to:
24.2.1 [iterator.requirements.general]
5/ Just as a regular pointer to an array guarantees that there is a pointer value pointing past the last element of the array, so for any iterator type there is an iterator value that points past the last element of a corresponding sequence. These values are called past-the-end values. Values of an iterator i for which the expression *i is defined are called dereferenceable. The library never assumes that past-the-end values are dereferenceable. [...]
So, *s.end() is ill-formed.
24.2.3 [input.iterators]
2/ Table 107 -- Input iterator requirements (in addition to Iterator)
List for pre-condition to ++r and r++ that r be dereferencable.
Neither the Forward iterators, Bidirectional iterators nor Random iterator lift this restriction (and all indicate they inherit the restrictions of their predecessor).
Also, for completeness, in 24.2.7 [random.access.iterators], Table 111 -- Random access iterator requirements (in addition to bidirectional iterator) lists the following operational semantics:
r += n is equivalent to [inc|dec]rememting r n times
a + n and n + a are equivalent to copying a and then applying += n to the copy
and similarly for -= n and - n.
Thus s.end() + 1 is undefined behavior.
Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
std::string::operator[](size_type i) is specified to return "a reference to an object of type charT with value charT() when i == size(), so we know that that pointer points to an object.
5.7 states that "For the purposes of [operators + and -], a pointer to a nonarray object behaves the same as a pointer to the first element of an array of length one with the type of the object as its element type."
So we have a non-array object and the spec guarantees that a pointer one past it will be representable. So we know std::addressof(*end(str)) + 1 has to be representable.
However that's not a guarantee on std::string::iterator, and there is no such guarantee anywhere in the spec, which makes it undefined behavior.
(Note that this is not the same as 'ill-formed'. *end(str) + 1 is in fact well-formed.)
Iterators can and do implement checking logic that produce various errors when you do things like increment the end() iterator. This is in fact what Visual Studios debug iterators do with end(str) + 1.
#define _ITERATOR_DEBUG_LEVEL 2
#include <string>
#include <iterator>
int main() {
std::string s = "ssssssss";
auto x = std::end(s) + 1; // produces debug dialog, aborts program if skipped
}
And if it isn't, why isn't it?
for consistency and interoperability with zero-terminated strings if nothing else
C++ specifies some specific things for compatibility with C, but such backwards compatibility is limited to supporting things that can actually be written in C. C++ doesn't necessarily try to take C's semantics and make new constructs behave in some analogous way. Should std::vector decay to an iterator just to be consistent with C's array decay behavior?
I'd say end(std) + 1 is left as undefined behavior because there's no value in trying to constrain std::string iterators this way. There's no legacy C code that does this that C++ needs to be compatible with and new code should be prevented from doing it.
New code should be prevented from relying on it... why? [...] What does not allowing it buy you in theory, and how does that look in practice?
Not allowing it means implementations don't have to support the added complexity, complexity which provides zero demonstrated value.
In fact it seems to me that supporting end(str) + 1 has negative value since code that tries to use it will essentially be creating the same problem as C code which can't figure out when to account for the null terminator or not. C has enough off by one buffer size errors for both languages.
A std::basic_string<???> is a container over its elements. Its elements do not include the trailing null that is implicitly added (it can include embedded nulls).
This makes lots of sense -- "for each character in this string" probably shouldn't return the trailing '\0', as that is really an implementation detail for compatibility with C style APIs.
The iterator rules for containers were based off of containers that don't shove an extra element at the end. Modifying them for std::basic_string<???> without motivation is questionable; one should only break a working pattern if there is a payoff.
There is every reason to think that pointers to .data() and .data() + .size() + 1 are allowed (I could imagine a twisted interpretation of the standard that would make it not allowed). So if you really need read-only iterators into the contents of a std::string, you can use pointer-to-const-elements (which are, after all, a kind of iterator).
If you want editable ones, then no, there is no way to get a valid iterator to one-past-the-end. Neither can you get a non-const reference to the trailing null legally. In fact, such access is clearly a bad idea; if you change the value of that element, you break the std::basic_string's invariant null-termination.
For there to be an iterator to one-past-the-end, the const and non-const iterators to the container would have to have a different valid range, or a non-const iterator to the last element that can be dereferenced but not written to must exist.
I shudder at making such standard wording watertight.
std::basic_string is already a mess. Making it even stranger would lead to standard bugs and would have a non-trivial cost. The benefit is really low; in the few cases where you want access to said trailing null in an iterator range, you can use .data() and use the resulting pointers as iterators.
I can't find a definitive answer, but indirect evidence points at end()+1 being undefined.
[string.insert]/15
constexpr iterator insert(const_iterator p, charT c);
Preconditions: p is a valid iterator on *this.
It would be unreasonable to expect this to work with end()+1 as the iterator, and it indeed causes a crash on both libstdc++ and libc++.
This means end()+1 is not a valid iterator, meaning end() is not incrementable.

Is string::c_str() no longer null terminated in C++11?

In C++11 basic_string::c_str is defined to be exactly the same as basic_string::data, which is in turn defined to be exactly the same as *(begin() + n) and *(&*begin() + n) (when 0 <= n < size()).
I cannot find anything that requires the string to always have a null character at its end.
Does this mean that c_str() is no longer guaranteed to produce a null-terminated string?
Strings are now required to use null-terminated buffers internally. Look at the definition of operator[] (21.4.5):
Requires: pos <= size().
Returns: *(begin() + pos) if pos <
size(), otherwise a reference to an object of type T with value
charT(); the referenced value shall not be modified.
Looking back at c_str (21.4.7.1/1), we see that it is defined in terms of operator[]:
Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
And both c_str and data are required to be O(1), so the implementation is effectively forced to use null-terminated buffers.
Additionally, as David Rodríguez - dribeas points out in the comments, the return value requirement also means that you can use &operator[](0) as a synonym for c_str(), so the terminating null character must lie in the same buffer (since *(p + size()) must be equal to charT()); this also means that even if the terminator is initialised lazily, it's not possible to observe the buffer in the intermediate state.
Well, in fact it is true that the new standard stipulates that .data() and .c_str() are now synonyms. However, it doesn't say that .c_str() is no longer zero-terminated :)
It just means that you can now rely on .data() being zero-terminated as well.
Paper N2668 defines c_str() and data() members of std::basic_string as
follows:
const charT* c_str() const;
const charT* data() const;
Returns: A pointer to the initial element of an array of length
size() + 1 whose first size() elements equal the corresponding
elements of the string controlled by *this and whose last element is a
null character specified by charT().
Requires: The program shall not alter any of the values stored in
the character array.
Note that this does NOT mean that any valid std::string can be treated as a C-string because std::string can contain embedded nulls, which will prematurely end the C-string when used directly as a const char*.
Addendum:
I don't have access to the actual published final spec of C++11 but it appears that indeed the wording was dropped somewhere in the revision history of the spec: e.g. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2011/n3242.pdf
§ 21.4.7 basic_string string operations [string.ops]
§ 21.4.7.1 basic_string accessors [string.accessors]
const charT* c_str() const noexcept;
const charT* data() const noexcept;
Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
Complexity: constant time.
Requires: The program shall not alter any of the values stored in the character array.
The "history" was that a long time ago when everyone worked in single threads, or at least the threads were workers with their own data, they designed a string class for C++ which made string handling easier than it had been before, and they overloaded operator+ to concatenate strings.
The issue was that users would do something like:
s = s1 + s2 + s3 + s4;
and each concatenation would create a temporary which had to implement a string.
Therefore someone had the brainwave of "lazy evaluation" such that internally you could store some kind of "rope" with all the strings until someone wanted to read it as a C-string at which point you would change the internal representation to a contiguous buffer.
This solved the problem above but caused a load of other headaches, in particular in the multi-threaded world where one expected a .c_str() operation to be read-only / doesn't change anything and therefore no need to lock anything. Premature internal-locking in the class implementation just in case someone was doing it multi-threaded (when there wasn't even a threading standard) was also not a good idea. In fact it was more costly to do anything of this than simply copy the buffer each time. Same reason "copy on write" implementation was abandoned for string implementations.
Thus making .c_str() a truly immutable operation turned out to be the most sensible thing to do, however could one "rely" on it in a standard that now is thread-aware? Therefore the new standard decided to clearly state that you can, and thus the internal representation needs to hold the null terminator.
Well spotted. This is certainly a defect in the recently adopted standard; I'm sure that there was no intent to break all of the code currently using c_str. I would suggest a defect report, or at least asking the question in comp.std.c++ (which will usually end up before the committee if it concerns a defect).

Curious behaviour of std::string::operator[] in MSVC

I've been using some semi-iterators to tokenize a std::string, and I've run into a curious problem with operator[]. When constructing a new string from a position using char*, I've used something like the following:
t.begin = i;
t.end = i + 1;
t.contents = std::string(&arg.second[t.begin], &arg.second[t.end]);
where arg.second is a std::string. But, if i is the position of the last character, then arg.second[t.end] will throw a debugging assertion- even though taking a pointer of one-past-the-end is well defined behaviour and even common for primitive arrays, and since the constructor is being called using iterators I know that the end iterator will never be de-referenced. Doesn't it seem logical that arg.second[arg.second.size()] should be a valid expression, producing the equivalent of arg.second.end() as a char*?
You're not taking a pointer to one past the end, you're ACCESSING one past the end and then getting the address of that. Entirely different and while the the former is well defined and well formed, the latter is not either. I suggest using the iterator constructor, which is basically what you ARE using but do so with iterators instead of char*. See Alexandre's comment.
operator[](size_type pos) const doesn't return one-past-the-end is pos == size(); it returns charT(), which is a temporary. In the non-const version of operator[], the behavior is undefined.
21.3.4/1
const_reference operator[](size_type pos) const;
reference operator[](size_type pos);
1 Returns: If pos < size(), returns data()[pos]. Otherwise, if pos == size(), the const
version returns charT(). Otherwise, the behavior is undefined.
What is well-defined is creating an iterator one past the end. (Pointers might be iterators, too.) However, dereferencing such an iterator will yield Undefined Behavior.
Now, what you're doing is array subscription, and that is very different from forming iterators, because it returns a reference to the referred-to object (much akin to dereferencing an iterator). You are certainly not allowed to access an array one-past-the-end.
std::string is not an array. It is an object, whose interface loosely resembles an array (namely, provides operator[]). But that's when the similarity ends.
Even if we for a second assume that std::string is just a wrapper built on top of an ordinary array, then in order to obtain the one-past-the-end pointer for the stored sequence, you have to do something like &arg.second[0] + t.end, i.e. instead of going through the std::string interface first move into into the domain of ordinary pointers and use ordinary low-level pointer arithmetic.
However, even that assumption is not correct and doing something like &arg.second[0] + t.end is a recipe for disaster. std::string is not guaranteed to store its controlled sequence as an array. It is not guaranteed to be stored continuously, meaning that regardless of where your pointers point, you cannot assume that you'll be able to iterate from one to another by using pointer arithmetic.
If you want to use an std::string in some legacy pointer-based interface the only choice you have is to go through the std::string::c_str() method, which will generate a non-permanent array-based copy of the controlled sequence.
P.S. Note, BTW, that in the original C and C++ specifications it is illegal to use the &a[N] method to obtain the one-past-the-end pointer even for an ordinary built-in array. You always have to make sure that you are not using the [] operator with past-the-end index. The legal way to obtain the pointer has always been something like a + N or &a[0] + N, but not &a[N]. Recent changes legalized the &a[N] approach as well, but nevertheless originally it was not legal.
A string is not a primitive array, so I'd say the implementation is free to add some debug diagnostics if you are doing something dangerous like accessing elements outside its range. I would guess that a release build will probably work.
But...
For what you are trying to do, why not just use the basic_string( const basic_string& str, size_type index, size_type length ); constructor to create the sub strings?