What does enabling STL iterator debugging really do? - c++

I've enabled iterator debugging in an application by defining
_HAS_ITERATOR_DEBUGGING = 1
I was expecting this to really just check vector bounds, but I have a feeling it's doing a lot more than that. What checks, etc are actually being performed?
Dinkumware STL, by the way.

There is a number of operations with iterators which lead to undefined behavior, the goal of this trigger is to activate runtime checks to prevent it from occurring (using asserts).
The issue
The obvious operation is to use an invalid iterator, but this invalidity may arise from various reasons:
Uninitialized iterator
Iterator to an element that has been erased
Iterator to an element which physical location has changed (reallocation for a vector)
Iterator outside of [begin, end)
The standard specifies in excruciating details for each container which operation invalidates which iterator.
There is also a somehow less obvious reason that people tend to forget: mixing iterators to different containers:
std::vector<Animal> cats, dogs;
for_each(cats.begin(), dogs.end(), /**/); // obvious bug
This pertain to a more general issue: the validity of ranges passed to the algorithms.
[cats.begin(), dogs.end()) is invalid (unless one is an alias for the other)
[cats.end(), cats.begin()) is invalid (unless cats is empty ??)
The solution
The solution consists in adding information to the iterators so that their validity and the validity of the ranges they defined can be asserted during execution thus preventing undefined behavior to occur.
The _HAS_ITERATOR_DEBUGGING symbol serves as a trigger to this capability, because it unfortunately slows down the program. It's quite simple in theory: each iterator is made an Observer of the container it's issued from and is thus notified of the modification.
In Dinkumware this is achieved by two additions:
Each iterator carries a pointer to its related container
Each container holds a linked list of the iterators it created
And this neatly solves our problems:
An uninitialized iterator does not have a parent container, most operations (apart from assignment and destruction) will trigger an assertion
An iterator to an erased or moved element has been notified (thanks to the list) and know of its invalidity
On incrementing and decrementing an iterator it can checks it stays within the bounds
Checking that 2 iterators belong to the same container is as simple as comparing their parent pointers
Checking the validity of a range is as simple as checking that we reach the end of the range before we reach the end of the container (linear operation for those containers which are not randomly accessible, thus most of them)
The cost
The cost is heavy, but does correctness have a price? We can break down the cost:
extra memory allocation (the extra list of iterators maintained): O(NbIterators)
notification process on mutating operations: O(NbIterators) (Note that push_back or insert do not necessarily invalidate all iterators, but erase does)
range validity check: O( min(last-first, container.end()-first) )
Most of the library algorithms have of course been implemented for maximum efficiency, typically the check is done once and for all at the beginning of the algorithm, then an unchecked version is run. Yet the speed might severely slow down, especially with hand-written loops:
for (iterator_t it = vec.begin();
it != vec.end(); // Oops
++it)
// body
We know the Oops line is bad taste, but here it's even worse: at each run of the loop, we create a new iterator then destroy it which means allocating and deallocating a node for vec's list of iterators... Do I have to underline the cost of allocating/deallocating memory in a tight loop ?
Of course, a for_each would not encounter such an issue, which is yet another compelling case toward the use of STL algorithms instead of hand-coded versions.

As far as I understand:
_HAS_ITERATOR_DEBUGGING will display a dialog box at run time to assert any incorrect iterator use including:
1) Iterators used in a container after an element is erased
2) Iterators used in vectors after a .push() or .insert() function is called

According to http://msdn.microsoft.com/en-us/library/aa985982%28v=VS.80%29.aspx
The C++ standard describes which member functions cause iterators to a container to become invalid. Two examples are:
Erasing an element from a container causes iterators to the element to become invalid.
Increasing the size of a vector (push or insert) causes iterators into the vector container become invalid.

Related

Preventing getting the address of items in a container in C++

I'm writing a C++ array-like container that can dynamically grow and shrink. I'd like to prevent users of this container from taking the address of its items, because they might be reallocated when the container needs to reallocate itself. The only correct way of using this container will be by keeping track of the address of the container and the index of each item (yes, I'll be specifying it in the documentation, but it would be better if I could make the compiler trigger an error if a user of the container tries to get the address of an item?
Can this be done somehow? I searched and found some question regarding making the "address of operator" private, but it doesn't seem to be guaranteed to work, nor it's a recommended practice either. So, I wonder if there could be any alternative technique for preventing access to pointers to items...
In C++ all memory allocated to the program is more or less fair game. So there is no real way to prevent users of your array type to obtain or calculate addresses within your array.
The STL even makes a lot of effort to guarantee that iterators do not suddenly become invalid because of this.
In the real world you write in the API description that it is very unwise to work with addresses within your container because the could become invalid at any time and then it is the responsibility of the users to follow that rule, IMHO.
What you want to do is not possible easily (if at all).
If your container invalidates pointers to elements in certain situations, that does not make your container "special". Lets consider what other containers do to mitigate this problem.
std::vector:
// THIS CODE IS BROKEN !! DO NOT WRITE CODE LIKE THIS !!
std::vector<int> x(24,0);
int* ptr = &x[0];
auto it = x.begin();
x.push_back(42);
x.insert(it, 100); // *
*ptr = 5;
*it = 7;
Starting from the line marked with * everything here uses an invalid pointer or iterator. From cppreference, push_back:
If the new size() is greater than capacity() then all iterators and references (including the past-the-end iterator) are invalidated. Otherwise only the past-the-end iterator is invalidated.
Actually x.insert(it, 100); might be using a valid iterator, but the code does not check whether the push_back had to increase capacity, so one has to assume that it and ptr are invalid after the call to push_back.
insert:
Causes reallocation if the new size() is greater than the old capacity(). If the new size() is greater than capacity(), all iterators and references are invalidated. Otherwise, only the iterators and references before the insertion point remain valid. The past-the-end iterator is also invalidated.
Users of standard containers must be aware of iterator invalidation rules (see Iterator invalidation rules) or they will write horribly broken code like the one above.
In general, you cannot protect yourself from all mistakes a user can possibly make. Document pre- and post-conditions and if a user ignores them they just get what they deserve.
Note that you could try to overload the & operator, but there is no way you can prevent someone to get the adress via std::addressof, which is made exactly for that: Get the address of an object in case the object tries to prevent it by overloading &.

STL iterator revalidation for end (past-the-end) iterator?

See related questions on past-the-end iterator invalidation:
this, this.
This is more a question of design, namely, is there (in STL or elsewhere) such concept as past-the-end iterator "revalidation"?
What I mean by this, and use case: suppose an algorithm needs to "tail" a container (such as a queue). It traverses the container until end() is reached, then pauses; independently from this, another part of the program enqueues more items in the queue. How is it possible for the algorithm to (EDIT) efficiently tell, "have more items been enqueued" while holding the previously past-the-end iterator (call it tailIt)? (this would imply it is able to check if tailIt == container.end() still, and if that is false, conclude tailIt is now valid and points to the first element that was inserted).
Please don't dismiss the question as "no, there isn't" - I'm looking to form judgment around how to design some logic in an idiomatic way, and have many options (in fact the iterators in question are to a hand-built data structure for which I can provide this property - end() revalidation - but I would like to judge if it is a good idea).
EDIT: made it clear we have the iterator tailIt and a reference to container. A trivial workaround for what I'm trying to do is, also remember count := how many items you processed, and then check is container.size() == count still, and if not, seek to container[count] and continue processing from there. This comes with many disadvantages (extra state, assumption container doesn't pop from the front (!), random-access for efficient seeking).
Not in general. Here are some issues with your idea:
Some past-the-end iterators don't "point" to the data block at all; in fact this will be true of any iterator except a vector iterator. So, overall, an extant end-iterator just is never going to become a valid iterator to data;
Iterators often become invalidated when the container changes — while this isn't always true, it also precludes a general solution that relies on dereferencing some iterator from before the mutation;
Iterator validity is non-observable — you already need to know, before you dereference an iterator, whether or not it is valid. This is information that comes from elsewhere, usually your brain… by that I mean the developer must read the code and make a determination based on its structure and flow.
Put all these together and it is clear that the end iterator simply cannot be used this way as the iterator interface is currently designed. Iterators refer to data in a range, not to a container; it stands to reason, then, that they hold no information about a container, and if the container causes the range to change there's no entity that the iterator knows about that it can ask to find this out.
Is the described logic possible to create? Certainly! But with a different iterator interface (and support from the container). You could wrap the container in your own class type to do this. However, I advise against making things that look like standard iterators but behave differently; this will be very confusing.
Instead, encapsulate the container and provide your own wrapper function that can directly perform whatever post-enqueuement action you feel you need. You shouldn't need to watch the state of the end iterator to achieve your goal.
In the case for a std::queue, no there isn't (heh). Not because the iterators for a queue get invalidated once something is pushed, but because a queue doesn't have any iterators at all.
As for other iterator types, most (or any of them) of them don't require a reference to the container holder (the managing object containing all the info about the underlying data). Which is an trade-off for efficiency over flexibility. (I quickly checked the implementation of gcc's std::vector::iterator)It is possible to write an implementation for an iterator type that keeps a reference to the holder during its lifetime, that way the iterators never have to be invalidated! (unless the holder is std::move'd)
Now to throw in my professional opinion, I wouldn't mind seeing a safe_iterator/flex_iterator for cases where the iterator normally would be invalidated during iterations.
Possible user interface:
for (auto v : make_flex_iterator(my_vector)) {
if (some_outside_condition()) {
// Normally the vector would be invalidated at this point
// (only if resized, but you should always assume a resize)
my_vector.push_back("hello world!");
}
}
Literally revalidating iterators might be too complex to build for it's use case (I wouldn't know where to begin), but designing an iterator which simply never invalidates is quite trivial, with only as much overhead as a for (size_t i = 0; i < c.size(); i++); loop.But with that said, I cannot assure you how well the compiler will optimize, like unrolling loops, with these iterators. I do assume it will still do quite a good job.

Do the iterator invalidation rules mean thread safety?

Here in this Stack Overflow answer it is listed the iterator invalidation rules for the standard containers in C++11.
Particularly, there are for insertion:
[multi]{set,map}: all iterators and references unaffected [23.2.4/9]
unordered_[multi]{set,map}: all iterators invalidated when rehashing occurs, but references unaffected [23.2.5/8]. Rehashing does not occur if the insertion does not cause the container's size to exceed z * B where z is the maximum load factor and B the current number of buckets. [23.2.5/14]
erasure:
[multi]{set,map} and unordered_[multi]{set,map}: only iterators and references to the erased elements are invalidated
Do these rules mean I can safely do insertion and erasure in one thread, and safely in another thread access, look for (using find) elements as long as these elements are not the ones being inserted and erased in the first thread, and make sure that rehashing is not happening?
If not, what do these rules exactly mean?
The fact that iterators to elements of the container are not invalidated in no way implies thread safety on the container itself. For example, the size member variable would need to be modified atomically which is a totally separate issue from iterators being invalidated (or not) on insertion/deletion.
tl;dr; No.
These rules simply tell you when an iterator to an element is invalidated by an operation. For example, when a vector resizes, the underlying array is reallocated elsewhere so if you had an iterator (or pointer) to an element, it would no longer be valid after the resize (because it would be pointing to deleted elements of the old array).
There are two kinds of operations on C++ std containers. Reader and Writer operations (these are not the terms the standard uses, but this reads easier). In addition, there are operations on elements in the container.
A const method is a Reader method, as are "lookup" functions that are only non-const because they return a non-const reference to an element (or similar). There is a complete list in the standard, but common sense should work. vector::operator[], begin(), map::find() are all "Readers". map::operator[] is not a "Reader".
You can have any number of threads engaging in Reader operations at the same time no problem.
If you have any thread engaged in a Writer operation, no other access can occur on the container.
You cannot safely have one Reader and one Writer at the same time. If you have any Writers, you must have excluded all other access.
You can safely have 1337 readers at once.
Operations on elements is somewhat similar, except that Writing to an element need only exclude other access to that element. And you are responsible for making your const method play nice with each other. (the std guarantees that the const methods of the container will only call const methods of the elements)
Note that changing sorting order in a std:: associative container is UB without modifying the container.
An iterator that is not invalidated, where you just operate on the element, will (I believe) count as operations on the element. Only synchronization on that element is required.
Note that std::vector<bool> does not follow the above rules.
If you violate the above rules, the C++ std library does not stop you. It just states there is a race condition -- aka, undefined behavior. Anything can happen. In C++ std library, if you don't need something, you don't pay for it: and a single-threaded use of those containers would not need the weight of synchronization. So you don't get it by default.
A std::shared_timed_mutex or std::experimental::shared_mutex can both be useful to guarantee the above holds. unique_lock for Writing, and shared_lock for Reading. Write access to elements has to be shared_locked on the container guarded, and somehow guarded against overlapping access to the same element without deadlock.
Iterator invalidation rules are relatively orthogonal to the thread-safety rules.
Using find implies traversal, at least over a subset of the elements. insert and erase on [multi]{set,map} results in rebalancing the underlying tree, which impacts the links between the nodes. If a rebalance happens at the same time as a find, bad things will happen.
Similarly bad things will happen if you attempt a find during unordered_[multi]{set,map} insert or erase. insert can cause rehashing. And both insert and erase need to link/unlink elements from a list. If a find is searching over a list during a link/unlink, you lose.
[] on [unordered][multi]{set,map} is shorthand for "find and insert if not found". at is shorthand for find. So no, these are not safe to use either.
If you have an existing iterator into a [multi]{set,map}, you can continue to dereference (but not increment/decrement) that iterator while another element is inserted or erased. For unordered_[multi]{set,map}, this is true only if you can guarantee that rehashing won't happen under the insert (it never happens under the erase).
There are other answers here who go into the thread safety issue. So if these rules don't mean thread safety where does that leaves us?
If not, what do these rules exactly mean?
They tell you when you can't use an iterator anymore.
Lets take a (deceptive innocent) example:
auto v = std::vector<int>{....};
//v.reserve(...);
for (auto it = std::begin(v); it != std::end(v); ++it) {
if (*it == ...)
std::insert(it, ...);
}
Here you traverse a vector and for each element that tests a condition, you insert something into that position.
Now is this code valid? The iterator invalidation rules tells you that if the vector's capacity is big enough the insertion invalidates only iterator after the insert position. So if you can prove that the reserve (commented line) is big enough, then yes, the code is valid. If not, the code is invalid, as the insert invalidates all the iterators of the vector, which means that it is invalidated, which means that you cannot use it anymore. You'd have to have to reacquire it:
auto idx = std::distance(std::begin(v), it);
std::insert(it, ...);
it = std::begin(v) + idx;

Is it possible to store an iterator?

For example, say I have a const_iterator:
QHash<const QString, QPair<const Node, double> >::const_iterator citer = adjNodeHash.begin();
Can I then store citer in a data structure (containing many iterators) and re-use it later, with it still referring to the same place I left off the next time I use it? (assuming I update it accordingly/use a reference to it when I am incrementing it)
I ask this because I have used this approach yet am getting some undefined bahaviour and am wondering if this is the culprit.
Any help would be much appreciated.
Iteration invalidation rules for std containers are described in the standard. QHash will also have some iterator invalidation rules in its documentation (hopefully!).
A stored iterator remains valid until invalidated. Most hash maps invalidate their iterators when they "rehash", which happens when they grow past a certain bound.
In practice, it is probably a bad idea to store an iterator into a hash map over a period in which elements are added or removed from it. Maintaining that iterator as valid will take constant maintenance and error checking, adding overhead to every use of that hash map, and any errors developing may not immediately show up, and the error that happens won't occur near the spot where the mistake is made.
On top of that, if you ever swap out what hash container you are using, the details of the iterator invalidation rules are going to be different. This makes refactoring in the future more painful.

Why does push_back or push_front invalidate a deque's iterators?

As the title asks.
My understanding of a deque was that it allocated "blocks". I don't see how allocating more space invalidates iterators, and if anything, one would think that a deque's iterators would have more guarantees than a vector's, not less.
The C++ standard doesn't specify how deque is implemented. It isn't required to allocate new space by allocating a new chunk and chaining it on to the previous ones, all that's required is that insertion at each end be amortized constant time.
So, while it's easy to see how to implement deque such that it gives the guarantee you want[*], that's not the only way to do it.
[*] Iterators have a reference to an element, plus a reference to the block it's in so that they can continue forward/back off the ends of the block when they reach them. Plus I suppose a reference to the deque itself, so that operator+ can be constant-time as expected for random-access iterators -- following a chain of links from block to block isn't good enough.
What's more interesting is that push_back and push_front will not invalidate any references to a deque's elements. Only iterators are to be assumed invalid.
The standard, to my knowledge, doesn't state why. However if an iterator were implemented that was aware of its immediate neighbors - as a list is - that iterator would become invalid if it pointed to an element that was both at the edge of the deque and the edge of a block.
My guess. push_back/push_front can allocate a new memory block. A deque iterator must know when increment/decrement operator should jump into the next block. The implementation may store that information in iterator itself. Incrementing/decrementing an old iterator after push_back/push_front may not work as intended.
This code may or may not fail with run time error. On my Visual Studio it failed in debug mode but run to the conclusion in release mode. On Linux it caused segmentation fault.
#include <iostream>
#include <deque>
int main() {
std::deque<int> x(1), y(1);
std::deque<int>::iterator iterx = x.begin();
std::deque<int>::iterator itery = y.begin();
for (int i=1; i<1000000; ++i) {
x.push_back(i);
y.push_back(i);
++iterx;
++itery;
if(*iterx != *itery) {
std::cout << "increment failed at " << i << '\n';
break;
}
}
}
The key thing is not to make any assumptions just treat the iterator as if it will be invalidated.
Even if it works fine now, a later version of the compiler or the compiler for a different platform might come along and break your code. Alternatively, a colleague might come along and decide to turn your deque into a vector or linked list.
An iterator is not just a reference to the data. It must know how to increment, etc.
In order to support random access, implementations will have a dynamic array of pointers to the chunks. The deque iterator will point into this dynamic array. When the deque grows, a new chunk might need to be allocated. The dynamic array will grow, invalidating its iterators and, consequently, the deque's iterators.
So it is not that chunks are reallocated, but the array of pointers to these chunks can be. Indeed, as Johannes Schaub noted, references are not invalidated.
Also note that the deque's iterator guarantees are not less than the vector's, which are also invalidated when the container grows.
Even when you are allocating in chunks, an insert will cause that particular chunk to be reallocated if there isn't enough space (as is the case with vectors).
Because the standard says it can. It does not mandate that deque be implemented as a list of chunks. It mandates a particular interface with particular pre and post conditions and particular algorithmic complexity minimums.
Implementors are free to implement the thing in whatever way they choose, so long as it meets all of those requirements. A sensible implementation might use lists of chunks, or it might use some other technique with different trade-offs.
It's probably impossible to say that one technique is strictly better than another for all users in all situations. Which is why the standard gives implementors some freedom to choose.