I've been using some semi-iterators to tokenize a std::string, and I've run into a curious problem with operator[]. When constructing a new string from a position using char*, I've used something like the following:
t.begin = i;
t.end = i + 1;
t.contents = std::string(&arg.second[t.begin], &arg.second[t.end]);
where arg.second is a std::string. But, if i is the position of the last character, then arg.second[t.end] will throw a debugging assertion- even though taking a pointer of one-past-the-end is well defined behaviour and even common for primitive arrays, and since the constructor is being called using iterators I know that the end iterator will never be de-referenced. Doesn't it seem logical that arg.second[arg.second.size()] should be a valid expression, producing the equivalent of arg.second.end() as a char*?
You're not taking a pointer to one past the end, you're ACCESSING one past the end and then getting the address of that. Entirely different and while the the former is well defined and well formed, the latter is not either. I suggest using the iterator constructor, which is basically what you ARE using but do so with iterators instead of char*. See Alexandre's comment.
operator[](size_type pos) const doesn't return one-past-the-end is pos == size(); it returns charT(), which is a temporary. In the non-const version of operator[], the behavior is undefined.
21.3.4/1
const_reference operator[](size_type pos) const;
reference operator[](size_type pos);
1 Returns: If pos < size(), returns data()[pos]. Otherwise, if pos == size(), the const
version returns charT(). Otherwise, the behavior is undefined.
What is well-defined is creating an iterator one past the end. (Pointers might be iterators, too.) However, dereferencing such an iterator will yield Undefined Behavior.
Now, what you're doing is array subscription, and that is very different from forming iterators, because it returns a reference to the referred-to object (much akin to dereferencing an iterator). You are certainly not allowed to access an array one-past-the-end.
std::string is not an array. It is an object, whose interface loosely resembles an array (namely, provides operator[]). But that's when the similarity ends.
Even if we for a second assume that std::string is just a wrapper built on top of an ordinary array, then in order to obtain the one-past-the-end pointer for the stored sequence, you have to do something like &arg.second[0] + t.end, i.e. instead of going through the std::string interface first move into into the domain of ordinary pointers and use ordinary low-level pointer arithmetic.
However, even that assumption is not correct and doing something like &arg.second[0] + t.end is a recipe for disaster. std::string is not guaranteed to store its controlled sequence as an array. It is not guaranteed to be stored continuously, meaning that regardless of where your pointers point, you cannot assume that you'll be able to iterate from one to another by using pointer arithmetic.
If you want to use an std::string in some legacy pointer-based interface the only choice you have is to go through the std::string::c_str() method, which will generate a non-permanent array-based copy of the controlled sequence.
P.S. Note, BTW, that in the original C and C++ specifications it is illegal to use the &a[N] method to obtain the one-past-the-end pointer even for an ordinary built-in array. You always have to make sure that you are not using the [] operator with past-the-end index. The legal way to obtain the pointer has always been something like a + N or &a[0] + N, but not &a[N]. Recent changes legalized the &a[N] approach as well, but nevertheless originally it was not legal.
A string is not a primitive array, so I'd say the implementation is free to add some debug diagnostics if you are doing something dangerous like accessing elements outside its range. I would guess that a release build will probably work.
But...
For what you are trying to do, why not just use the basic_string( const basic_string& str, size_type index, size_type length ); constructor to create the sub strings?
Related
So I have an std::string and have a function which takes char* and writes into it. Since std::string::c_str() and std::string::data() return const char*, I can't use them. So I was allocating a temporary buffer, calling a function with it and copying it into std::string.
Now I plan to work with big amount of information and copying this buffer will have a noticeable impact and I want to avoid it.
Some people suggested to use &str.front() or &str[0] but does it invoke the undefined behavior?
C++98/03
Impossible. String can be copy on write so it needs to handle all reads and writes.
C++11/14
In [string.require]:
The char-like objects in a basic_string object shall be stored contiguously. That is, for any basic_string
object s, the identity &*(s.begin() + n) == &*s.begin() + n shall hold for all values of n such that 0 <= n < s.size().
So &str.front() and &str[0] should work.
C++17
str.data(), &str.front() and &str[0] work.
Here it says:
charT* data() noexcept;
Returns: A pointer p such that p + i == &operator[](i) for each i in [0, size()].
Complexity: Constant time.
Requires: The program shall not alter the value stored at p + size().
The non-const .data() just works.
The recent draft has the following wording for .front():
const charT& front() const;
charT& front();
Requires: !empty().
Effects: Equivalent to operator[](0).
And the following for operator[]:
const_reference operator[](size_type pos) const;
reference operator[](size_type pos);
Requires: pos <= size().
Returns: *(begin() + pos) if pos < size(). Otherwise, returns a reference to an object of type charT with value charT(), where modifying the object leads to undefined behavior.
Throws: Nothing.
Complexity: Constant time.
So it uses iterator arithmetic. so we need to inspect the information about iterators. Here it says:
3 A basic_string is a contiguous container ([container.requirements.general]).
So we need to go here:
A contiguous container is a container that supports random access iterators ([random.access.iterators]) and whose member types iterator and const_iterator are contiguous iterators ([iterator.requirements.general]).
Then here:
Iterators that further satisfy the requirement that, for integral values n and dereferenceable iterator values a and (a + n), *(a + n) is equivalent to *(addressof(*a) + n), are called contiguous iterators.
Apparently, contiguous iterators are a C++17 feature which was added in these papers.
The requirement can be rewritten as:
assert(*(a + n) == *(&*a + n));
So, in the second part we dereference iterator, then take address of the value it points to, then do a pointer arithmetic on it, dereference it and it's the same as incrementing an iterator and then dereferencing it. This means that contiguous iterator points to the memory where each value stored right after the other, hence contiguous. Since functions that take char* expect contiguous memory, you can pass the result of &str.front() or &str[0] to these functions.
You can simply use &s[0] for a non-empty string. This gives you a pointer to the start of the buffer
When you use it to put a string of n characters there the string's length (not just the capacity) needs to be at least n beforehand, because there's no way to adjust it up without clobbering the data.
I.e., usage can go like this:
auto foo( int const n )
-> string
{
if( n <= 0 ) { return ""; }
string result( n, '#' ); // # is an arbitrary fill character.
int const n_stored = some_api_function( &result[0], n );
assert( n_stored <= n );
result.resize( n_stored );
return result;
}
This approach has worked formally since C++11. Before that, in C++98 and C++03, the buffer was not formally guaranteed to be contiguous. However, for the in-practice the approach has worked since C++98, the first standard – the reason that the contiguous buffer requirement could be adopted in C++11 (it was added in the Lillehammer meeting, I think that was 2005) was that there were no extant standard library implementations with a non-contiguous string buffer.
Regarding
” C++17 added added non-const data() to std::string but it still says that you can't modify the buffer.
I'm not aware of any such wording, and since that would defeat the purpose of non-const data() I doubt that this statement is correct.
Regarding
” Now I plan to work with big amount of information and copying this buffer will have a noticeable impact and I want to avoid it.
If copying the buffer has a noticeable impact, then you'd want to avoid inadvertently copying the std::string.
One way is to wrap it in a class that's not copyable.
I don't know what you intend to do with that string, but if
all you need is a buffer of chars which frees its own memory automatically,
then I usually use vector<char> or vector<int> or whatever type
of buffer you need.
With v being the vector, it's guaranteed that &v[0] points to
a sequential memory which you can use as a buffer.
Note: if you consider string::front() to be the same as &string[0] then the following is a redundant answer:
According to cplusplus: In C++98, you shouldn't write to .data() or .c_str(), they are to be treated as read-only/const:
A program shall not alter any of the characters in this sequence.
But in C++11 this warning was removed, but the return values are still const, so officially it isn't allowed in C++11 either. So to avoid undefined behavior, you can use string::front(), which:
If the string object is const-qualified, the function returns a const char&. Otherwise, it returns a char&.
So if your string isn't const, then you are officially allowed to manipulate the contents returned by string::front(), which is a reference to the first element of the buffer. But the link doesn't mention which C++ standard this applies to. I assume C++11 and later.
Also, it returns the first element, not a pointer, so you'll need to take its address. It's not clear whether you are officially allowed to use that as a const char* for the whole buffer, but in combination with other answers, I'm sure it's safe. Atleast it doesn't produce any compiler warnings.
Is it valid to create an iterator to end(str)+1 for std::string?
And if it isn't, why isn't it?
This question is restricted to C++11 and later, because while pre-C++11 the data was already stored in a continuous block in any but rare POC toy-implementations, the data didn't have to be stored that way.
And I think that might make all the difference.
The significant difference between std::string and any other standard container I speculate on is that it always contains one element more than its size, the zero-terminator, to fulfill the requirements of .c_str().
21.4.7.1 basic_string accessors [string.accessors]
const charT* c_str() const noexcept;
const charT* data() const noexcept;
1 Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
2 Complexity: Constant time.
3 Requires: The program shall not alter any of the values stored in the character array.
Still, even though it should imho guarantee that said expression is valid, for consistency and interoperability with zero-terminated strings if nothing else, the only paragraph I found casts doubt on that:
21.4.1 basic_string general requirements [string.require]
4 The char-like objects in a basic_string object shall be stored contiguously. That is, for any basic_string object s, the identity &*(s.begin() + n) == &*s.begin() + n shall hold for all values of n such that 0 <= n < s.size().
(All quotes are from C++14 final draft (n3936).)
Related: Legal to overwrite std::string's null terminator?
TL;DR: s.end() + 1 is undefined behavior.
std::string is a strange beast, mainly for historical reasons:
It attempts to bring C compatibility, where it is known that an additional \0 character exists beyond the length reported by strlen.
It was designed with an index-based interface.
As an after thought, when merged in the Standard library with the rest of the STL code, an iterator-based interface was added.
This led std::string, in C++03, to number 103 member functions, and since then a few were added.
Therefore, discrepancies between the different methods should be expected.
Already in the index-based interface discrepancies appear:
§21.4.5 [string.access]
const_reference operator[](size_type pos) const;
reference operator[](size_type pos);
1/ Requires: pos <= size()
const_reference at(size_type pos) const;
reference at(size_type pos);
5/ Throws: out_of_range if pos >= size()
Yes, you read this right, s[s.size()] returns a reference to a NUL character while s.at(s.size()) throws an out_of_range exception. If anyone tells you to replace all uses of operator[] by at because they are safer, beware the string trap...
So, what about iterators?
§21.4.3 [string.iterators]
iterator end() noexcept;
const_iterator end() const noexcept;
const_iterator cend() const noexcept;
2/ Returns: An iterator which is the past-the-end value.
Wonderfully bland.
So we have to refer to other paragraphs. A pointer is offered by
§21.4 [basic.string]
3/ The iterators supported by basic_string are random access iterators (24.2.7).
while §17.6 [requirements] seems devoid of anything related. Thus, strings iterators are just plain old iterators (you can probably sense where this is going... but since we came this far let's go all the way).
This leads us to:
24.2.1 [iterator.requirements.general]
5/ Just as a regular pointer to an array guarantees that there is a pointer value pointing past the last element of the array, so for any iterator type there is an iterator value that points past the last element of a corresponding sequence. These values are called past-the-end values. Values of an iterator i for which the expression *i is defined are called dereferenceable. The library never assumes that past-the-end values are dereferenceable. [...]
So, *s.end() is ill-formed.
24.2.3 [input.iterators]
2/ Table 107 -- Input iterator requirements (in addition to Iterator)
List for pre-condition to ++r and r++ that r be dereferencable.
Neither the Forward iterators, Bidirectional iterators nor Random iterator lift this restriction (and all indicate they inherit the restrictions of their predecessor).
Also, for completeness, in 24.2.7 [random.access.iterators], Table 111 -- Random access iterator requirements (in addition to bidirectional iterator) lists the following operational semantics:
r += n is equivalent to [inc|dec]rememting r n times
a + n and n + a are equivalent to copying a and then applying += n to the copy
and similarly for -= n and - n.
Thus s.end() + 1 is undefined behavior.
Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
std::string::operator[](size_type i) is specified to return "a reference to an object of type charT with value charT() when i == size(), so we know that that pointer points to an object.
5.7 states that "For the purposes of [operators + and -], a pointer to a nonarray object behaves the same as a pointer to the first element of an array of length one with the type of the object as its element type."
So we have a non-array object and the spec guarantees that a pointer one past it will be representable. So we know std::addressof(*end(str)) + 1 has to be representable.
However that's not a guarantee on std::string::iterator, and there is no such guarantee anywhere in the spec, which makes it undefined behavior.
(Note that this is not the same as 'ill-formed'. *end(str) + 1 is in fact well-formed.)
Iterators can and do implement checking logic that produce various errors when you do things like increment the end() iterator. This is in fact what Visual Studios debug iterators do with end(str) + 1.
#define _ITERATOR_DEBUG_LEVEL 2
#include <string>
#include <iterator>
int main() {
std::string s = "ssssssss";
auto x = std::end(s) + 1; // produces debug dialog, aborts program if skipped
}
And if it isn't, why isn't it?
for consistency and interoperability with zero-terminated strings if nothing else
C++ specifies some specific things for compatibility with C, but such backwards compatibility is limited to supporting things that can actually be written in C. C++ doesn't necessarily try to take C's semantics and make new constructs behave in some analogous way. Should std::vector decay to an iterator just to be consistent with C's array decay behavior?
I'd say end(std) + 1 is left as undefined behavior because there's no value in trying to constrain std::string iterators this way. There's no legacy C code that does this that C++ needs to be compatible with and new code should be prevented from doing it.
New code should be prevented from relying on it... why? [...] What does not allowing it buy you in theory, and how does that look in practice?
Not allowing it means implementations don't have to support the added complexity, complexity which provides zero demonstrated value.
In fact it seems to me that supporting end(str) + 1 has negative value since code that tries to use it will essentially be creating the same problem as C code which can't figure out when to account for the null terminator or not. C has enough off by one buffer size errors for both languages.
A std::basic_string<???> is a container over its elements. Its elements do not include the trailing null that is implicitly added (it can include embedded nulls).
This makes lots of sense -- "for each character in this string" probably shouldn't return the trailing '\0', as that is really an implementation detail for compatibility with C style APIs.
The iterator rules for containers were based off of containers that don't shove an extra element at the end. Modifying them for std::basic_string<???> without motivation is questionable; one should only break a working pattern if there is a payoff.
There is every reason to think that pointers to .data() and .data() + .size() + 1 are allowed (I could imagine a twisted interpretation of the standard that would make it not allowed). So if you really need read-only iterators into the contents of a std::string, you can use pointer-to-const-elements (which are, after all, a kind of iterator).
If you want editable ones, then no, there is no way to get a valid iterator to one-past-the-end. Neither can you get a non-const reference to the trailing null legally. In fact, such access is clearly a bad idea; if you change the value of that element, you break the std::basic_string's invariant null-termination.
For there to be an iterator to one-past-the-end, the const and non-const iterators to the container would have to have a different valid range, or a non-const iterator to the last element that can be dereferenced but not written to must exist.
I shudder at making such standard wording watertight.
std::basic_string is already a mess. Making it even stranger would lead to standard bugs and would have a non-trivial cost. The benefit is really low; in the few cases where you want access to said trailing null in an iterator range, you can use .data() and use the resulting pointers as iterators.
I can't find a definitive answer, but indirect evidence points at end()+1 being undefined.
[string.insert]/15
constexpr iterator insert(const_iterator p, charT c);
Preconditions: p is a valid iterator on *this.
It would be unreasonable to expect this to work with end()+1 as the iterator, and it indeed causes a crash on both libstdc++ and libc++.
This means end()+1 is not a valid iterator, meaning end() is not incrementable.
If I have the end iterator to a container, but I want to get a raw pointer to that is there a way to accomplish this?
Say I have a container: foo. I cannot for example do this: &*foo.end() because it yields the runtime error:
Vector iterator not dereferencable
I can do this but I was hoping for a cleaner way to get there: &*foo.begin() + foo.size().
EDIT:
This is not a question about how to convert an iterator to a pointer in general (obviously that's in the question), but how to specifically convert the end iterator to a pointer. The answers in the "duplicate" question actually suggest dereferencing the iterator. The end iterator cannot be dereferenced without seg-faulting.
The correct way to access the end of storage is:
v.data() + v.size()
This is because *v.begin() is invalid when v is empty.
The member function data is provided for all contiguous containers (vector, string and array).
From C++17 you will also be able to use the non-member functions:
data(v) + size(v)
This works on raw arrays as well.
In general? No.
And the fact that you're asking indicates that something is wrong with your overall design.
For vectors, arrays, strings? Sure… but why?
Just get a pointer to a valid element, and advance it:
std::vector<T> foo;
const T* ptr = foo.data() + foo.size();
As long as you don't dereference such a pointer (which is almost equivalent to dereferencing the iterator, as you did in your attempt) it is valid to obtain and hold such a pointer, because it points to the special one-past-the-end location.
Note that &foo[0] + foo.size() has undefined behaviour if the vector is empty, because &foo[0] is &*(foo.data() + 0) is &*foo.data(), and (just like in your attempt) *foo.data() is disallowed if there's nothing there. So we avoid all dereferencing and simply advance foo.data() itself.
Anyway, this only works for the case of vectors1, arrays and strings, though. Other containers do not guarantee (or can be reasonably expected to provide) storage contiguity; their end pointers could be almost anything, e.g. a "sentinel" null pointer, which is unlikely to be of any use to you.
That is why the iterator abstraction is there in the first place. Stick to it if you can, instead of delving into raw pointer usage.
1. Excepting std::vector<bool>.
I've been trying to understand c++ by going over some projects and i came across this:
vector<Circle>::iterator circleIt = mCircles.end();
..
mCurrentDragCircle = &(*circleIt);
Why would you dereference and then reference it again?
The *-operator is overloaded for the iterator class. It doesn't do a simple dereference. Instead, it returns a reference to the variable it's currently pointing at. Using the reference operator on this returns a pointer towards the variable.
Using mCurrentDragCircle = circleIt; would assign the iterator to your field.
circleIt is an iterator, and iterators overload operator* so that they look like pointers. Dereferencing an iterator gives a reference to the value; & turns that into a pointer to the value. So this is converting an iterator to a pointer.
Btw., dereferencing past-the-end, i.e. *v.end(), has undefined behavior. The proper way to do this is
v.data() + v.size()
in C++11, or
&v[0] + v.size()
in C++98. Both assume the vector is non-empty.
Because of how iterators works.
*circleIt
will return instance of Circle, e.g. you can do
Circle & c = *circleIt;
Than you take address of that circle and store it into a pointer
Circle * mCurrentDragCircle = & (*circleIt);
Hope this helps.
&* is a particularly pernicious idiom for extracting an underlying pointer. It's frequently used on objects that have an overloaded dereference operator *.
One such class of objects are iterators. The author of your class is attempting to get the address of the underlying datum for some reason. (Two pitfalls here, (i) *end() gives undefined behaviour and (ii) it appears that the author is storing the pointer value; blissfully unaware that modifications on that container could invalidate that value).
You can also see it used with smart pointers to circumvent reference counting.
My advice: avoid if at all possible.
According to C++ standard (3.7.3.2/4) using (not only dereferencing, but also copying, casting, whatever else) an invalid pointer is undefined behavior (in case of doubt also see this question). Now the typical code to traverse an STL containter looks like this:
std::vector<int> toTraverse;
//populate the vector
for( std::vector<int>::iterator it = toTraverse.begin(); it != toTraverse.end(); ++it ) {
//process( *it );
}
std::vector::end() is an iterator onto the hypothetic element beyond the last element of the containter. There's no element there, therefore using a pointer through that iterator is undefined behavior.
Now how does the != end() work then? I mean in order to do the comparison an iterator needs to be constructed wrapping an invalid address and then that invalid address will have to be used in a comparison which again is undefined behavior. Is such comparison legal and why?
The only requirement for end() is that ++(--end()) == end(). The end() could simply be a special state the iterator is in. There is no reason the end() iterator has to correspond to a pointer of any kind.
Besides, even if it were a pointer, comparing two pointers doesn't require any sort of dereference anyway. Consider the following:
char[5] a = {'a', 'b', 'c', 'd', 'e'};
char* end = a+5;
for (char* it = a; it != a+5; ++it);
That code will work just fine, and it mirrors your vector code.
You're right that an invalid pointer can't be used, but you're wrong that a pointer to an element one past the last element in an array is an invalid pointer - it's valid.
The C standard, section 6.5.6.8 says that it's well defined and valid:
...if the expression P points to the
last element of an array object, the
expression (P)+1 points one past the
last element of the array object...
but cannot be dereferenced:
...if the result points one past the
last element of the array object, it
shall not be used as the operand of a
unary * operator that is evaluated...
One past the end is not an invalid value (neither with regular arrays or iterators). You can't dereference it but it can be used for comparisons.
std::vector<X>::iterator it;
This is a singular iterator. You can only assign a valid iterator to it.
std::vector<X>::iterator it = vec.end();
This is a perfectly valid iterator. You can't dereference it but you can use it for comparisons and decrement it (assuming the container has a sufficient size).
Huh? There's no rule that says that iterators need to be implemented using nothing but a pointer.
It could have a boolean flag in there, which gets set when the increment operation sees that it passes the end of the valid data, for instance.
The implementation of a standard library's container's end() iterator is, well, implementation-defined, so the implementation can play tricks it knows the platform to support.
If you implemented your own iterators, you can do whatever you want - so long as it is standard-conform. For example, your iterator, if storing a pointer, could store a NULL pointer to indicate an end iterator. Or it could contain a boolean flag or whatnot.
I answer here since other answers are now out-of-date; nevertheless, they were not quite right to the question.
First, C++14 has changed the rules mentioned in the question. Indirection through an invalid pointer value or passing an invalid pointer value to a deallocation function are still undefined, but other operations are now implemenatation-defined, see Documentation of "invalid pointer value" conversion in C++ implementations.
Second, words matter. You can't bypass the definitions while applying the rules. The key point here is the definition of "invalid". For iterators, this is defined in [iterator.requirements]. Though pointers are iterators, meanings of "invalid" to them are subtly different. Rules for pointers render "invalid" as "don't indirect through invalid value", which is a special case of "not dereferenceable" to iterators; however, "not deferenceable" is not implying "invalid" for iterators. "Invalid" is explicitly defined as "may be singular", while "singular" value is defined as "not associated with any sequence" (in the same paragraph of definition of "dereferenceable"). That paragraph even explicitly defined "past-the-end values".
From the text of the standard in [iterator.requirements], it is clear that:
Past-the-end values are not assumed to be dereferenceable (at least by the standard library), as the standard states.
Dereferenceable values are not singular, since they are associated with sequence.
Past-the-end values are not singular, since they are associated with sequence.
An iterator is not invalid if it is definitely not singular (by negation on definition of "invalid iterator"). In other words, if an iterator is associated to a sequence, it is not invalid.
Value of end() is a past-the-end value, which is associated with a sequence before it is invalidated. So it is actually valid by definition. Even with misconception on "invalid" literally, the rules of pointers are not applicable here.
The rules allowing == comparison on such values are in input iterator requirements, which is inherited by some other category of iterators (forward, bidirectional, etc). More specifically, valid iterators are required to be comparable in the domain of the iterator in such way (==). Further, forward iterator requirements specifies the domain is over the underlying sequence. And container requirements specifies the iterator and const_iterator member types in any iterator category meets forward iterator requirements. Thus, == on end() and iterator over same container is required to be well-defined. As a standard container, vector<int> also obey the requirements. That's the whole story.
Third, even when end() is a pointer value (this is likely to happen with optimized implementation of iterator of vector instance), the rules in the question are still not applicable. The reason is mentioned above (and in some other answers): "invalid" is concerned with *(indirect through), not comparison. One-past-end value is explicitly allowed to be compared in specified ways by the standard. Also note ISO C++ is not ISO C, they also subtly mismatches (e.g. for < on pointer values not in the same array, unspecified vs. undefined), though they have similar rules here.
Simple. Iterators aren't (necessarily) pointers.
They have some similarities (i.e. you can dereference them), but that's about it.
Besides what was already said (iterators need not be pointers), I'd like to point out the rule you cite
According to C++ standard (3.7.3.2/4)
using (not only dereferencing, but
also copying, casting, whatever else)
an invalid pointer is undefined
behavior
wouldn't apply to end() iterator anyway. Basically, when you have an array, all the pointers to its elements, plus one pointer past-the-end, plus one pointer before the start of the array, are valid. That means:
int arr[5];
int *p=0;
p==arr+4; // OK
p==arr+5; // past-the-end, but OK
p==arr-1; // also OK
p==arr+123456; // not OK, according to your rule