why std::for_each with deletion of elements not break iteration? - c++

As far as I know, erasing elements during iteration of a collection should break the iteration or cause you to skip elements. Why does calling std::for_each with a predicate which erases not cause this to happen? (It works).
Code snip:
#include <iostream>
#include <map>
#include <algorithm>
using namespace std;
int main() {
map<int,long long> m;
m[1] = 5000;
m[2] = 1;
m[3] = 2;
m[4] = 5000;
m[5] = 5000;
m[6] = 3;
// Erase all elements > 1000
std::for_each(m.begin(), m.end(), [&](const decltype(m)::value_type& v){
if (v.second > 1000) {
m.erase(v.first);
}
});
for (const auto& a: m) {
cout << a.second << endl;
}
return 0;
}
it prints out
1
2
3
EDIT: I now see that if it actually increments the iterator before calling the function then it could work. But does this count as compiler specific/undefined behavior?

It's undefined behaviour, and won't work reliably. After adding a line to print keys and values inside your erasing lambda function, I see:
1=5000
2=1
3=2
4=5000
2=1 // AGAIN!!!
3=2 // AGAIN!!!
5=5000
6=3
With my Standard library's map implementation, after erasing the element with key 4, iteration returns to the node with key 2! It then revisits the node with key 3. Because your lambda happily retested such nodes (v.second > 1000) and returned without any side effects, the broken iteration wasn't affecting the output.
You might reasonably ask: "but isn't it astronomically unlikely that it'd have managed to continue iteration (even if to the wrong next node) without crashing?"
Actually, it's quite likely.
Erasing a node causes delete to be called for the memory that node occupied, but in general the library code performing the delete will just:
invoke the destructor (which has no particular reason to waste time overwriting the left-child-, right-child- and parent-pointers), then
modify its records of which memory regions are allocated vs. available.
It's unlikely to "waste" time arbitrarily modifying the heap memory being deallocated (though some implementations will in memory-usage debugging modes).
So, the erased node probably sits there intact until some other heap allocation's performed.
And, when you erase an element in a map, the Standard requires that none of the container's other elements be moved in memory - iterators, pointers and references to other elements must remain valid. It can only modify nodes' left/right/parent pointers that maintain the binary tree.
Consequently, if you continue to use the iterator to the erased element, it is likely to see pointers to the left/right/parent nodes the erased element linked to before erasure, and operator++() will "iterate" to them using the same logic it would have employed if the erased element were still in the map.
If we consider an example map's internal binary tree, where N3 is a node with key 3:
N5
/ \
N3 N7
/ \ /
N1 N4 N6
The way iteration is done will likely be:
initially, start at the N1; the map must directly track where this is to ensure begin() is O(1)
if on a node with no children, repeat { Nfrom = where you are, move to parent, if nullptr or right != Nfrom break} (e.g. N1->N3, N4->N3->N5, N6->N7->N5->nullptr)
if on a node with right-hand child, take it then any number of left-hand links (e.g. N3->N4, N5->N7->N6)
So, if say N4 is removed (so N3->right = nullptr;) and no rebalancing occurs, then iteration records NFrom=N4 then moves to the parent N3, then N3->right != Nfrom, so it will think it should stop on the already-iterated-over N3 instead of moving on up to N5.
On the other hand, if the tree has been rebalanced after the erase, all bets are off and the invalidated iterator could repeat or skip elements or even iterate "as hoped".
This is not intended to let you reason about behaviour after an erase - it's undefined and shouldn't be relied on. Rather, I'm just showing that a sane implementation can account for your unexpected observations.

std::for_each is defined in the C++ draft standard (25.2.4) as a non-modifying sequence operation. The fact that modifying the sequence using your implementation of the function works is probably just luck.
This is definitely implementation-defined, and you shouldn't be doing it. The standard expects that you don't modify the container inside the function object.

This happened to work for you, but I wouldn't count on it -- it's likely undefined behavior.
Specifically, I'd be concerned that erasing a map element while running std::for_each would attempt to increment an invalid iterator. For example, it looks like libc++ implements std::for_each like so:
template<typename _InputIterator, typename _Function>
_Function
for_each(_InputIterator __first, _InputIterator __last, _Function __f) {
// concept requirements
__glibcxx_function_requires(_InputIteratorConcept<_InputIterator>)
__glibcxx_requires_valid_range(__first, __last);
for (; __first != __last; ++__first)
__f(*__first);
return _GLIBCXX_MOVE(__f);
}
If calling __f ends up performing an erase, it seems likely that __first would be invalidated. Attempting to subsequently increment an invalid iterator would then be undefined behavior.

I find this operation pretty common, so to avoid the undefined behaviour above, I wrote a container based algorithm.
void remove_erase_if( Container&&, Test&& );
to deal with both associative and not containers, I tag dispatch on a custom trait class is_associative_container -- the associative goes to a manual while loop, while the others go to a remove_if-erase version.
In my case I just hard code the 4 associative containers in the trait -- you could duck type it, bit it is a higher level concept, so you would be just pattern matching anyhow.

Related

erasing nlohmann::json object during iteration causes segmentation fault

I have a simple database consisting of objects with strings containing unix time as keys and strings containing instructions as values
I want to iterate though the database and erase any object who's key is smaller that current time ( so erase objects with dates before current date)
for (auto it = m_jsonData.begin(); it != m_jsonData.end(); it++) {
if (std::stoi(it.key()) <= (std::time(NULL))) {
std::cout << "command is behind schedule, removing\n";
m_jsonData.erase(it);
} else {
/*
*/
}
}
this code works fine as long as m_jsonData.erase(it); isn't invoked. when it does, in the next iteration std::stoi(it.key()) causes a segfault, after a bit of playing with it I came to a conclusion that is somehow loses track of what it's actually iterating. Is my conclusion true? If not then what is? And how do I fix it?
It's extremely normal for mutating container operations to invalidate iterators. It's one of the first things you should check for.
Documentation for nlohnmann::json::erase():
Notes
Invalidates iterators and references at or after the point of the erase, including the end() iterator.
References and iterators to the erased elements are invalidated. Other references and iterators are not affected.
That means after this line:
m_jsonData.erase(it);
the iterator it can't be used for anything including incrementing it to the next element. It is invalid.
Fortunately, the documentation also points out that the successor to the removed element is returned, so you can just write
for (auto it = m_jsonData.begin(); it != m_jsonData.end(); ) {
if (std::stoi(it.key()) <= (std::time(NULL))) {
it = m_jsonData.erase(it);
} else {
++it;
}
}
Note that when I say this is extremely normal, it's because the standard containers often have similar behaviour. See the documentation for examples, but this is something everyone should be aware of:
std::vector::erase Iterator invalidation
std::unordered_map::erase Iterator invalidation
etc.
This is exactly the reason std::erase was added in C++20, and previously std::remove_if was provided to suppport the erase(remove_if(...), end) idiom, instead of writing fragile mutating loops.

Is it okay to assign values to map.end()?

I'm wondering whether it's a bad practice to assign values 1 past the last element in a map like the example below.
using namespace std;
auto chances = map<int, int>{};
chances[0] = 20;
chances[1] = 10;
chances[2] = 30;
int last = 0;
for (auto it = chances.begin(); it != chances.end();) {
last = it->second;
(++it)->second += last;
}
Also, in a for loop is it faster to check against a variable than a function for termination (what is this part of the loop called?)
Yes, it's bad practice to assign to the end() iterator of any container (not just map)
In all standard C++ containers, the end() iterator is not dereferencable. Any attempts to dereference (in this case, assign) to the end() iterator is undefined behavior.
In your example code, a dereference of this end() iterator occurs due to the pre-increment operator being used:
(++it)->second += last
When it is 1 before end() during the iteration, this will increment it and dereference the result (the end) for the assignment.
Also, in a for loop is it faster to check against a variable than a function for termination
Generally it's better to assign the termination condition to a constant first and compare against that. Although compilers can perform this transformation,
there are a number of factors that may result in the function call being repeatedly evaluated each iteration.
That said, benchmark for yourself, and don't prematurely optimize. Small things like this seldomly make big differences unless they are in a tight loop.
Note: Please try to ask only 1 question per SO post to help for searchability and prevent it from being closed for being too broad.
Oh and yes, I'm using namespace std; ;)
You should train yourself not to because it's bad practice that only exists due to legacy. Plus your future coworkers will thank you.

Keeping std::list iterators valid through insertion

Note: This is not a question whether I should "use list or deque". It's a question about the validity of iterators in the face of insert().
This may be a simple question and I'm just too dense to see the right way to do this. I'm implementing (for better or worse) a network traffic buffer as a std::list<char> buf, and I'm maintaining my current read position as an iterator readpos.
When I add data, I do something like
buf.insert(buf.end(), newdata.begin(), newdata.end());
My question is now, how do I keep the readpos iterator valid? If it points to the middle of the old buf, then it should be fine (by the iterator guarantees for std::list), but typically I may have read and processed all data and I have readpos == buf.end(). After the insertion, I want readpos always to point to the next unread character, which in case of the insertion should be the first inserted one.
Any suggestions? (Short of changing the buffer to a std::deque<char>, which appears to be much better suited to the task, as suggested below.)
Update: From a quick test with GCC4.4 I observe that deque and list behave differently with respect to readpos = buf.end(): After inserting at the end, readpos is broken in a list, but points to the next element in a deque. Is this a standard guarantee?
(According to cplusplus, any deque::insert() invalidated all iterators. That's no good. Maybe using a counter is better than an iterator to track a position in a deque?)
if (readpos == buf.begin())
{
buf.insert(buf.end(), newdata.begin(), newdata.end());
readpos = buf.begin();
}
else
{
--readpos;
buf.insert(buf.end(), newdata.begin(), newdata.end());
++readpos;
}
Not elegant, but it should work.
From http://www.sgi.com/tech/stl/List.html
"Lists have the important property that insertion and splicing do not invalidate iterators to list elements, and that even removal invalidates only the iterators that point to the elements that are removed."
Therefore, readpos should still be valid after the insert.
However...
std::list< char > is a very inefficient way to solve this problem. Each byte you store in a std::list requires a pointer to keep track of the byte, plus the size of the list node structure, two more pointers usually. That is at least 12 or 24 bytes (32 or 64-bit) of memory used to keep track of a single byte of data.
std::deque< char> is probably a better container for this. Like std::vector it provides constant time insertions at the back however it also provides constant time removal at the front. Finally, like std::vector std::deque is a random-access container so you can use offsets/indexes instead of iterators. These three features make it an efficient choice.
I was indeed being dense. The standard gives us all the tools we need. Specifically, the sequence container requirements 23.2.3/9 say:
The iterator returned from a.insert(p, i, j) points to the copy of the first element inserted into a, or p if i == j.
Next, the description of list::insert says (23.3.5.4/1):
Does not affect the validity of iterators and references.
So in fact if pos is my current iterator inside the list which is being consumed, I can say:
auto it = buf.insert(buf.end(), newdata.begin(), newdata.end());
if (pos == buf.end()) { pos = it; }
The range of new elements in my list is [it, buf.end()), and the range of yet unprocessed elements is [pos, buf.end()). This works because if pos was equal to buf.end() before the insertion, then it still is after the insertion, since insertion does not invalidate any iterators, not even the end.
list<char> is a very inefficient way to store a string. It is probably 10-20 times larger than the string itself, plus you are chasing a pointer for every character...
Have you considered using std::dequeue<char> instead?
[edit]
To answer your actual question, adding and removing elements does not invalidate iterators in a list... But end() is still going to be end(). So you would need to check for that as a special case at the point where you insert the new element in order to update your readpos iterator.

How can I stop iterating "n" before the end of a map when the iterators aren't random-access?

I would like to traverse a map in C++ with iterators but not all the way to the end.
The problem is that even if we can do basic operations with iterators, we cannot add or compare iterators with integers.
How can I write the following instructions? (final is a map; window, an integer)
for (it=final.begin(); it!=final.end()-window; it++)
You cannot subtract from a map iterator directly, because it is an expensive operation (in practice doing --iter the required number of times). If you really want to do it anyway, you can use the standard library function 'advance'.
map<...>::iterator end = final.end();
std::advance(end, -window);
That will give you the end of your window.
std::map<T1, T2>::iterator it = final.begin();
for (int i = 0; i < final.size()-window; ++i, ++it)
{
// TODO: add your normal loop body
}
Replace T1 and T2 with the actual types of the keys and values of the map.
Why don't you make 'it' an iterator as well ?
See the example here : http://www.cplusplus.com/reference/stl/map/begin/
Another solution:
size_t count=final.size();
size_t processCount=(window<count?count-window:0);
for (it=final.begin(); processCount && it!=final.end(); ++it, --processCount)
{
// loop body
}
This one is a bit safer:
It takes care of the case when your map is actually smaller than the value of window.
It will process at most processCount elements, even if you change the size of your map inside your loop (e.g. add new elements)
According to STL, size() can take O(n) time to compute, although usual implementations can do this in O(1). To be on the safe side, it is better not to call size() many times, if it is not necessary.
'end()' on the other hand has amortized constant time, so it should be OK to have it in the for-loop condition
++it may be faster than it++. The post-increment operator creates a temporary object, while the other - does not. When the variable is a simple integral type, compiler can optimise it out, but with iterators it is not always the case.

Problem with invalidation of STL iterators when calling erase

The STL standard defines that when an erase occurs on containers such as std::deque, std::list etc iterators are invalidated.
My question is as follows, assuming the list of integers contained in a std::deque, and a pair of indicies indicating a range of elements in the std::deque, what is the correct way to delete all even elements?
So far I have the following, however the problem here is that the assumed end is invalidated after an erase:
#include <cstddef>
#include <deque>
int main()
{
std::deque<int> deq;
for (int i = 0; i < 100; deq.push_back(i++));
// range, 11th to 51st element
std::pair<std::size_t,std::size_t> r(10,50);
std::deque<int>::iterator it = deq.begin() + r.first;
std::deque<int>::iterator end = deq.begin() + r.second;
while (it != end)
{
if (*it % 2 == 0)
{
it = deq.erase(it);
}
else
++it;
}
return 0;
}
Examining how std::remove_if is implemented, it seems there is a very costly copy/shift down process going on.
Is there a more efficient way of achieving the above without all the copy/shifts
In general is deleting/erasing an element more expensive than swapping it with the next nth value in the sequence (where n is the number of elements deleted/removed so far)
Note: Answers should assume the sequence size is quite large, +1mil elements and that on average 1/3 of elements would be up for erasure.
I'd use the Erase-Remove Idiom. I think the Wikipedia article linked even shows what you're doing -- removing odd elements.
The copying that remove_if does is no more costly than what happens when you delete elements from the middle of the container. It might even be more efficient.
Calling .erase() also results in "a very costly copy/shift down process going on.". When you erase an element from the middle of the container, every other element after that point must be shifted down one spot into the available space. If you erase multiple elements, you incur that cost for every erased element. Some of the non-erased elements will move several spots, but are forced to move one spot at a time instead of all at once. That is very inefficient.
The standard library algorithms std::remove and std::remove_if optimize this work. They use a clever trick to ensure that every moved element is only moved once. This is much, much faster than what you are doing yourself, contrary to your intuition.
The pseudocode is like this:
read_location <- beginning of range.
write_location <- beginning of range.
while read_location != end of range:
if the element at read_location should be kept in the container:
copy the element at the read_location to the write_location.
increment the write_location.
increment the read_location.
As you can see, every element in the original sequence is considered exactly once, and if it needs to be kept, it gets copied exactly once, to the current write_location. It will never be looked at again, because the write_location can never run in front of the read_location.
Remember that deque is a contiguous memory container (like vector, and probably sharing implementation), so removing elements mid-container necessarily means copying subsequent elements over the hole. You just want to make sure you're doing one iteration and copying each not-to-be-deleted object directly to its final position, rather than moving all objects one by one during each delete. remove_if is efficient and appropriate in this regard, your erase loop is not: it does massive amounts of unnecessary copying.
FWIW - alternatives:
add a "deleted" state to your objects and mark them deleted in place, but then every time you operate on the container you'll need to check yourself
use a list, which is implemented using pointers to previous and next elements, such that removing a list element alters the adjacent points to bypass that element: no copying, efficient iteration, just no random access, more small (i.e. inefficient) heap allocations and pointer overheads
What to choose depends on the nature, relative frequency, and performance requirements of specific operations (e.g. it may be that you can afford slow removals if they're done at non-critical times, but need fastest-possible iteration - whatever it is, make sure you understand your needs and the implications of the various data structures).
One alternative that hasn't been mentioned is to create a new deque, copy the elements that you want to keep into it, and swap it with the old deque.
void filter(std::deque<int>& in, std::pair<std::size_t,std::size_t> range) {
std::deque<int> out;
std::deque<int>::const_iterator first = in.begin();
std::deque<int>::const_iterator curr = first + range.first;
std::deque<int>::const_iterator last = first + range.second;
out.reserve(in.size() - (range.second-range.first));
std::copy(first, curr, std::back_inserter(out));
while (curr != last) {
if (*curr & 1) {
out.push_back(*curr);
}
++curr;
}
std::copy(last, in.end(), std::back_inserter(out));
in.swap(out);
}
I'm not sure if you have enough memory to create a copy, but it usually is faster and easier to make a copy instead of trying to inline erase elements from a large collection. If you still see memory thrashing, then figure out how many elements you are going to keep by calling std::count_if and reserve that many. This way you would have a single memory allocation.