What is iterator invalidation? - c++

I see it referenced a lot but no clear answer of what exactly it is. My experience is with higher level languages, so I'm unfamiliar about the presence of invalidity in a collections framework.
What is iterator invalidation?
Why does it come up? Why is it difficult to deal with?

Iterators are glorified pointers. Iterator invalidation is a lot like pointer invalidation; it means it suddenly points to junk data.
Because it's very natural but wrong to do things like this:
for(iterator it = map.begin(); it != map.end(); ++it) {
map.erase(it->first);
// whoops, now map has been restructured and iterator
// still thinks itself is healthy
}
Because that error right there? No compiler error, no warning, you lose. You just have to be trained well enough to watch for them and prevent them. Very insidious bugs if you don't know what you're doing. One of the design philosophies of C++ is speed over safety. The runtime check that would lead iterator invalidation to an exception instead of unspecified behavior is too expensive, in the view of C++ language designers.
You should be on high alert when you are iterating over a data structure and modifying structure itself, instead of merely the objects held in them. At that point you should probably run to the documentation and check whether the operation was illegal.

Iterator invalidation is what happens when an iterator type (an object supporting the operators ++, and *) does not correctly represent the state of the object it is iterating. For example:
int *my_array = new int[15];
int *my_iterator = &my_array[2];
delete[] my_array;
std::for_each(my_iterator, my_iterator + 5, ...); // invalid
That results in undefined behavior because the memory it is pointing to has been reclaimed by the OS.
This is only one scenario, however, and many other things cause an iterator to be 'invalidated', and you must be careful to check the documentation of the objects you are using.

The problem occurs when a container that is being processed using an iterator has its shape changed during the process. (We will assume a single-threaded application; concurrent access to a mutable container is a whole 'nother can of worms which we won't get into on this page). By "having its shape change", one of the following types of mutations is meant:
An insertion into the container (at any location)
Deletion of an
element from the container
Any operation that changes a key (in an
AssociativeContainer)
Any operation which changes the order of the
elements in a sorted container.
Any more complicated operation
consisting of one or more of the above (such as splitting a container
into two).
(From: http://c2.com/cgi/wiki?IteratorInvalidationProblem)
The concept is actually fairly simple, but the side-effects can be quite annoying. I would add that this problem affects not only C/C++ but slews of other low-level or mid-level languages, as well. (In some cases, even if they don't allow direct heap allocation)

Related

Preventing getting the address of items in a container in C++

I'm writing a C++ array-like container that can dynamically grow and shrink. I'd like to prevent users of this container from taking the address of its items, because they might be reallocated when the container needs to reallocate itself. The only correct way of using this container will be by keeping track of the address of the container and the index of each item (yes, I'll be specifying it in the documentation, but it would be better if I could make the compiler trigger an error if a user of the container tries to get the address of an item?
Can this be done somehow? I searched and found some question regarding making the "address of operator" private, but it doesn't seem to be guaranteed to work, nor it's a recommended practice either. So, I wonder if there could be any alternative technique for preventing access to pointers to items...
In C++ all memory allocated to the program is more or less fair game. So there is no real way to prevent users of your array type to obtain or calculate addresses within your array.
The STL even makes a lot of effort to guarantee that iterators do not suddenly become invalid because of this.
In the real world you write in the API description that it is very unwise to work with addresses within your container because the could become invalid at any time and then it is the responsibility of the users to follow that rule, IMHO.
What you want to do is not possible easily (if at all).
If your container invalidates pointers to elements in certain situations, that does not make your container "special". Lets consider what other containers do to mitigate this problem.
std::vector:
// THIS CODE IS BROKEN !! DO NOT WRITE CODE LIKE THIS !!
std::vector<int> x(24,0);
int* ptr = &x[0];
auto it = x.begin();
x.push_back(42);
x.insert(it, 100); // *
*ptr = 5;
*it = 7;
Starting from the line marked with * everything here uses an invalid pointer or iterator. From cppreference, push_back:
If the new size() is greater than capacity() then all iterators and references (including the past-the-end iterator) are invalidated. Otherwise only the past-the-end iterator is invalidated.
Actually x.insert(it, 100); might be using a valid iterator, but the code does not check whether the push_back had to increase capacity, so one has to assume that it and ptr are invalid after the call to push_back.
insert:
Causes reallocation if the new size() is greater than the old capacity(). If the new size() is greater than capacity(), all iterators and references are invalidated. Otherwise, only the iterators and references before the insertion point remain valid. The past-the-end iterator is also invalidated.
Users of standard containers must be aware of iterator invalidation rules (see Iterator invalidation rules) or they will write horribly broken code like the one above.
In general, you cannot protect yourself from all mistakes a user can possibly make. Document pre- and post-conditions and if a user ignores them they just get what they deserve.
Note that you could try to overload the & operator, but there is no way you can prevent someone to get the adress via std::addressof, which is made exactly for that: Get the address of an object in case the object tries to prevent it by overloading &.

STL iterator revalidation for end (past-the-end) iterator?

See related questions on past-the-end iterator invalidation:
this, this.
This is more a question of design, namely, is there (in STL or elsewhere) such concept as past-the-end iterator "revalidation"?
What I mean by this, and use case: suppose an algorithm needs to "tail" a container (such as a queue). It traverses the container until end() is reached, then pauses; independently from this, another part of the program enqueues more items in the queue. How is it possible for the algorithm to (EDIT) efficiently tell, "have more items been enqueued" while holding the previously past-the-end iterator (call it tailIt)? (this would imply it is able to check if tailIt == container.end() still, and if that is false, conclude tailIt is now valid and points to the first element that was inserted).
Please don't dismiss the question as "no, there isn't" - I'm looking to form judgment around how to design some logic in an idiomatic way, and have many options (in fact the iterators in question are to a hand-built data structure for which I can provide this property - end() revalidation - but I would like to judge if it is a good idea).
EDIT: made it clear we have the iterator tailIt and a reference to container. A trivial workaround for what I'm trying to do is, also remember count := how many items you processed, and then check is container.size() == count still, and if not, seek to container[count] and continue processing from there. This comes with many disadvantages (extra state, assumption container doesn't pop from the front (!), random-access for efficient seeking).
Not in general. Here are some issues with your idea:
Some past-the-end iterators don't "point" to the data block at all; in fact this will be true of any iterator except a vector iterator. So, overall, an extant end-iterator just is never going to become a valid iterator to data;
Iterators often become invalidated when the container changes — while this isn't always true, it also precludes a general solution that relies on dereferencing some iterator from before the mutation;
Iterator validity is non-observable — you already need to know, before you dereference an iterator, whether or not it is valid. This is information that comes from elsewhere, usually your brain… by that I mean the developer must read the code and make a determination based on its structure and flow.
Put all these together and it is clear that the end iterator simply cannot be used this way as the iterator interface is currently designed. Iterators refer to data in a range, not to a container; it stands to reason, then, that they hold no information about a container, and if the container causes the range to change there's no entity that the iterator knows about that it can ask to find this out.
Is the described logic possible to create? Certainly! But with a different iterator interface (and support from the container). You could wrap the container in your own class type to do this. However, I advise against making things that look like standard iterators but behave differently; this will be very confusing.
Instead, encapsulate the container and provide your own wrapper function that can directly perform whatever post-enqueuement action you feel you need. You shouldn't need to watch the state of the end iterator to achieve your goal.
In the case for a std::queue, no there isn't (heh). Not because the iterators for a queue get invalidated once something is pushed, but because a queue doesn't have any iterators at all.
As for other iterator types, most (or any of them) of them don't require a reference to the container holder (the managing object containing all the info about the underlying data). Which is an trade-off for efficiency over flexibility. (I quickly checked the implementation of gcc's std::vector::iterator)It is possible to write an implementation for an iterator type that keeps a reference to the holder during its lifetime, that way the iterators never have to be invalidated! (unless the holder is std::move'd)
Now to throw in my professional opinion, I wouldn't mind seeing a safe_iterator/flex_iterator for cases where the iterator normally would be invalidated during iterations.
Possible user interface:
for (auto v : make_flex_iterator(my_vector)) {
if (some_outside_condition()) {
// Normally the vector would be invalidated at this point
// (only if resized, but you should always assume a resize)
my_vector.push_back("hello world!");
}
}
Literally revalidating iterators might be too complex to build for it's use case (I wouldn't know where to begin), but designing an iterator which simply never invalidates is quite trivial, with only as much overhead as a for (size_t i = 0; i < c.size(); i++); loop.But with that said, I cannot assure you how well the compiler will optimize, like unrolling loops, with these iterators. I do assume it will still do quite a good job.

How come some libstdc++ iterators have operator++ but no operator+?

I just noticed that for the iterator class std::__detail::_Node_iterator (in GCC's libstdc++, source here), we have an operator++() but no operator+(), so you can use (my_set.cbegin()++)++ but you can't use my_set.cbegin() + 2.
Why is that? Is it just lack of syntactic sugar or is there a deeper reason?
The standard doesn't provide an operator+ when that would be O(n), and would surprise many users.
It does provide a function std::advance you can use, if you are prepared to pay the cost.
Why is that? Is it just lack of syntactic sugar or is there a deeper reason?
Understanding that this might be slightly speculative on my part (I didn't design/implement iterators), my thought is that an iterator is meant to traverse a collection with some degree of safety. Writing something like iterator++ will safely take you from an existing element in the collection to the next, or automatically point to null if you've reached the end.
It's also shorter than writing iterator = iterator + 1 or iterator += 1 and this might be a key reason for limiting it to ++ only. Adding all of them would seem to be redundant and unnecessary.
Quoting from the C++ Reference (my emphasis)
An iterator is any object that, pointing to some element in a range of
elements (such as an array or a container), has the ability to iterate
through the elements of that range using a set of operators (with at
least the increment (++) and dereference (*) operators).
Based on this, it seems to be an intentional architectural decision to keep the client requirements for implementing iterators to a bare minimum. Note that, based on the wording above (i.e. "at least"), there seems to be no technical reason on why someone couldn't add support for other operators, including comparison operators other than !=. Whether they should is probably another discussion.
In addition, iterators generally traverse all objects within some range sequentially, so allowing iterator + X would seem to go against its intended purpose in this sense.
For example, if you do iterator + 2, how would you know that you're not asking for more than what the collection really has left from the iterator's current position? You could be trying to go past the end of the collection and waiting for a segmentation fault --or they would need to start throwing exceptions. The iterator's ability to safely traverse the collection without going out of bounds is a benefit you'd be losing here, IMHO.
An iterator is only meant to shield the clients from the implementation details of the collection. That is, a client doesn't need to know if the collection is implemented as an array, some kind of linked list, or some kind of tree. It also doesn't need to keep count of how many items the collection has, which again, makes it easier to implement and work with. (It has a single responsibility.)
Based on this, and keeping the previous 'safety' and minimalistic requirement details in mind, the decision to avoid implementing every possible operator an object could support makes sense to me.

Is it possible to store an iterator?

For example, say I have a const_iterator:
QHash<const QString, QPair<const Node, double> >::const_iterator citer = adjNodeHash.begin();
Can I then store citer in a data structure (containing many iterators) and re-use it later, with it still referring to the same place I left off the next time I use it? (assuming I update it accordingly/use a reference to it when I am incrementing it)
I ask this because I have used this approach yet am getting some undefined bahaviour and am wondering if this is the culprit.
Any help would be much appreciated.
Iteration invalidation rules for std containers are described in the standard. QHash will also have some iterator invalidation rules in its documentation (hopefully!).
A stored iterator remains valid until invalidated. Most hash maps invalidate their iterators when they "rehash", which happens when they grow past a certain bound.
In practice, it is probably a bad idea to store an iterator into a hash map over a period in which elements are added or removed from it. Maintaining that iterator as valid will take constant maintenance and error checking, adding overhead to every use of that hash map, and any errors developing may not immediately show up, and the error that happens won't occur near the spot where the mistake is made.
On top of that, if you ever swap out what hash container you are using, the details of the iterator invalidation rules are going to be different. This makes refactoring in the future more painful.

What does enabling STL iterator debugging really do?

I've enabled iterator debugging in an application by defining
_HAS_ITERATOR_DEBUGGING = 1
I was expecting this to really just check vector bounds, but I have a feeling it's doing a lot more than that. What checks, etc are actually being performed?
Dinkumware STL, by the way.
There is a number of operations with iterators which lead to undefined behavior, the goal of this trigger is to activate runtime checks to prevent it from occurring (using asserts).
The issue
The obvious operation is to use an invalid iterator, but this invalidity may arise from various reasons:
Uninitialized iterator
Iterator to an element that has been erased
Iterator to an element which physical location has changed (reallocation for a vector)
Iterator outside of [begin, end)
The standard specifies in excruciating details for each container which operation invalidates which iterator.
There is also a somehow less obvious reason that people tend to forget: mixing iterators to different containers:
std::vector<Animal> cats, dogs;
for_each(cats.begin(), dogs.end(), /**/); // obvious bug
This pertain to a more general issue: the validity of ranges passed to the algorithms.
[cats.begin(), dogs.end()) is invalid (unless one is an alias for the other)
[cats.end(), cats.begin()) is invalid (unless cats is empty ??)
The solution
The solution consists in adding information to the iterators so that their validity and the validity of the ranges they defined can be asserted during execution thus preventing undefined behavior to occur.
The _HAS_ITERATOR_DEBUGGING symbol serves as a trigger to this capability, because it unfortunately slows down the program. It's quite simple in theory: each iterator is made an Observer of the container it's issued from and is thus notified of the modification.
In Dinkumware this is achieved by two additions:
Each iterator carries a pointer to its related container
Each container holds a linked list of the iterators it created
And this neatly solves our problems:
An uninitialized iterator does not have a parent container, most operations (apart from assignment and destruction) will trigger an assertion
An iterator to an erased or moved element has been notified (thanks to the list) and know of its invalidity
On incrementing and decrementing an iterator it can checks it stays within the bounds
Checking that 2 iterators belong to the same container is as simple as comparing their parent pointers
Checking the validity of a range is as simple as checking that we reach the end of the range before we reach the end of the container (linear operation for those containers which are not randomly accessible, thus most of them)
The cost
The cost is heavy, but does correctness have a price? We can break down the cost:
extra memory allocation (the extra list of iterators maintained): O(NbIterators)
notification process on mutating operations: O(NbIterators) (Note that push_back or insert do not necessarily invalidate all iterators, but erase does)
range validity check: O( min(last-first, container.end()-first) )
Most of the library algorithms have of course been implemented for maximum efficiency, typically the check is done once and for all at the beginning of the algorithm, then an unchecked version is run. Yet the speed might severely slow down, especially with hand-written loops:
for (iterator_t it = vec.begin();
it != vec.end(); // Oops
++it)
// body
We know the Oops line is bad taste, but here it's even worse: at each run of the loop, we create a new iterator then destroy it which means allocating and deallocating a node for vec's list of iterators... Do I have to underline the cost of allocating/deallocating memory in a tight loop ?
Of course, a for_each would not encounter such an issue, which is yet another compelling case toward the use of STL algorithms instead of hand-coded versions.
As far as I understand:
_HAS_ITERATOR_DEBUGGING will display a dialog box at run time to assert any incorrect iterator use including:
1) Iterators used in a container after an element is erased
2) Iterators used in vectors after a .push() or .insert() function is called
According to http://msdn.microsoft.com/en-us/library/aa985982%28v=VS.80%29.aspx
The C++ standard describes which member functions cause iterators to a container to become invalid. Two examples are:
Erasing an element from a container causes iterators to the element to become invalid.
Increasing the size of a vector (push or insert) causes iterators into the vector container become invalid.