Difference between std::remove and erase for vector? - c++

I have a doubt that I would like to clarify in my head. I am aware of the different behavior for std::vector between erase and std::remove where the first physically removes an element from the vector, reducing size, and the other just moves an element leaving the capacity the same.
Is this just for efficiency reasons? By using erase, all elements in a std::vector will be shifted by 1, causing a large amount of copies; std::remove does just a 'logical' delete and leaves the vector unchanged by moving things around. If the objects are heavy, that difference might matter, right?

Is this just for efficiency reason? By using erase all elements in a std::vector will be shifted by 1 causing a large amount of copies; std::remove does just a 'logical' delete and leaves the vector unchanged by moving things around. If the objects are heavy that difference mihgt matter, right?
The reason for using this idiom is exactly that. There is a benefit in performance, but not in the case of a single erasure. Where it does matter is if you need to remove multiple elements from the vector. In this case, the std::remove will copy each not removed element only once to its final location, while the vector::erase approach would move all of the elements from the position to the end multiple times. Consider:
std::vector<int> v{ 1, 2, 3, 4, 5 };
// remove all elements < 5
If you went over the vector removing elements one by one, you would remove the 1, causing copies of the remainder elements that get shifted (4). Then you would remove 2 and shift all remainding elements by one (3)... if you see the pattern this is a O(N^2) algorithm.
In the case of std::remove the algorithm maintains a read and write heads, and iterates over the container. For the first 4 elements the read head will be moved and the element tested, but no element is copied. Only for the fifth element the object would be copied from the last to the first position, and the algorithm will complete with a single copy and returning an iterator to the second position. This is a O(N) algorithm. The later std::vector::erase with the range will cause destruction of all the remainder elements and resizing the container.
As others have mentioned, in the standard library algorithms are applied to iterators, and lack knowledge of the sequence being iterated. This design is more flexible than other approaches on which algorithms are aware of the containers in that a single implementation of the algorithm can be used with any sequence that complies with the iterator requirements. Consider for example, std::remove_copy_if, it can be used even without containers, by using iterators that generate/accept sequences:
std::remove_copy_if(std::istream_iterator<int>(std::cin),
std::istream_iterator<int>(),
std::ostream_iterator<int>(std::cout, " "),
[](int x) { return !(x%2); } // is even
);
That single line of code will filter out all even numbers from standard input and dump that to standard output, without requiring the loading of all numbers into memory in a container. This is the advantage of the split, the disadvantage is that the algorithms cannot modify the container itself, only the values referred to by the iterators.

std::remove is an algorithm from the STL which is quite container agnostic. It requires some concept, true, but it has been designed to also work with C arrays, which are static in sizes.

std::remove simply returns a new end() iterator to point to one past the last non-removed element (the number of items from the returned value to end() will match the number of items to be removed, but there is no guarantee their values are the same as those you were removing - they are in a valid but unspecified state). This is done so that it can work for multiple container types (basically any container type that a ForwardIterator can iterate through).
std::vector::erase actually sets the new end() iterator after adjusting the size. This is because the vector's method actually knows how to handle adjusting it's iterators (the same can be done with std::list::erase, std::deque::erase, etc.).
remove organizes a given container to remove unwanted objects. The container's erase function actually handles the "removing" the way that container needs it to be done. That is why they are separate.

I think it has to do with needing direct access to the vector itself to be able to resize it. std::remove only has access to the iterators, so it has no way of telling the vector "Hey, you now have fewer elements".
See yves Baumes answer as to why std::remove is designed this way.

Yes, that's the gist of it. Note that erase is also supported by the other standard containers where its performance characteristics are different (e.g. list::erase is O(1)), while std::remove is container-agnostic and works with any type of forward iterator (so it works for e.g. bare arrays as well).

Kind of. Algorithms such as remove work on iterators (which are an abstraction to represent an element in a collection) which do not necessarily know which type of collection they are operating on - and therefore cannot call members on the collection to do the actual removal.
This is good because it allows algorithms to work generically on any container and also on ranges that are subsets of the entire collection.
Also, as you say, for performance - it may not be necessary to actually remove (and destroy) the elements if all you need is access to the logical end position to pass on to another algorithm.

Standard library algorithms operate on sequences. A sequence is defined by a pair of iterators; the first points at the first element in the sequence, and the second points one-past-the-end of the sequence. That's all; algorithms don't care where the sequence comes from.
Standard library containers hold data values, and provide a pair of iterators that specify a sequence for use by algorithms. They also provide member functions that may be able to do the same operations as an algorithm more efficiently by taking advantage of the internal data structure of the container.

Related

Why is there a "erase–remove idiom" for std::remove, std::remove_if with containers?

I was just looking into why the function std::remove_if wasn't working the way I expected, and learned about the C++ "erase-remove idiom" where I'm supposed to pass the result of remove or remove_if to erase to actually remove the items I want from a container.
This strikes me as quite unintuitive: this means remove and remove_if don't do what they say on the tin. It also makes for more verbose, less clear code.
Is there a justification for this? I figure there's some kind of trade-off with an upside balancing out the downsides I listed in the previous paragraph.
My first thought would be that there's some use-case for using remove or remove_if on their own, but since they leave the remaining items in a collection in an undefined state, I can't think of any possible use case for that.
This is a necessary function of the way the container/iterator/algorithm paradigm works. The basic concept of the model is as follows:
Containers contain and manage sequences of values.
Algorithms act on sequences of values.
Iterators represent moveable positions within a sequence of values.
Therefore, algorithms act on iterators, which represent locations within some sequence of values, usually provided by a container.
The problem is that removal of an item from a container doesn't fit that paradigm. The removal of an element from a container is not an act on a "sequence of values"; it fundamentally changes the nature of the sequence itself.
That is, "removal" ultimately finishes with a container operation, not an iterator operation. If algorithms only act on iterators, no pure algorithm can truly remove elements. Iterators don't know how to do that. Algorithms that only act on iterators can move values around within a sequence, but they cannot change the nature of the sequence such that the "removed" values no longer exist.
But while the removal of elements is a container operation... it's not a value agnostic operation. remove removes only values that compare equal to the given value. remove_if removes only values for which the predicate returns true. These are not container operations; they are algorithms that don't really care about the nature of the container.
Except for when it comes time to actually remove them from the container. From the perspective of the above paradigm, it is inherently two separate operations: an algorithm followed by a container operation.
That all being said, C++20 does give a number of containers non-member std::erase and std::erase_if specializations. These do the full job of erase-remove as a non-member function.
My first thought would be that there's some use-case for using remove or remove_if on their own, but since they leave the remaining items in a collection in an undefined state, I can't think of any possible use case for that.
There are uses for it. Multiple removal being the obvious one. You can perform a series of remove actions, so long as you pass the new end iterator to each subsequent removal (so that no operation examines removed elements). You can do a proper container erase at the end.
It should also be noted that the C++20 std::erase and std::erase_if functions only take containers, not sub-sections of containers. That is, they don't allow you to erase from some range within a container. Only the erase/remove idiom allows for that.
Also, not all containers can erase elements. std::array has a fixed size; truly erasing elements isn't allowed. But you can still std::remove with them, so long as you keep track of the new end iterator.
Many algorithms from the standard library operate on general iterators, which cannot be used to remove elements. erase is a method of the container and has access to more information, so it can be used to directly delete elements.

vector vs. list from stl - remove method

std::list has a remove method, while the std::vector doesn't. What is the reason for that?
std::list<>::remove is a physical removal method, which can be implemented by physically destroying list elements that satisfy certain criteria (by physical destruction I mean the end of element's storage duration). Physical removal is only applicable to lists. It cannot be applied to arrays, like std::vector<>. It simply is not possible to physically end storage duration of an individual element of an array. Arrays can only be created and destroyed as a whole. This is why std::vector<> does not have anything similar to std::list<>::remove.
The universal removal method applicable to all modifiable sequences is what one might call logical removal: the target elements are "removed" from the sequence by overwriting their values with values of elements located further down in the sequence. I.e. the sequence is shifted and compacted by copying the persistent data "to the left". Such logical removal is implemented by freestanding functions, like std::remove. Such functions are applicable in equal degree to both std::vector<> and std::list<>.
In cases where the approach based on immediate physical removal of specific elements applies, it will work more efficiently than the generic approach I referred above as logical removal. That is why it was worth providing it specifically for std::list<>.
std::list::remove removes all items in a list that match the provided value.
std::list<int> myList;
// fill it with numbers
myList.remove(10); // physically removes all instances of 10 from the list
It has a similar function, std::list::remove_if, which allows you to specify some other predicate.
std::list::remove (which physically removes the elements) is required to be a member function as it needs to know about the memory structure (that is, it must update the previous and next pointers for each item that needs to be updated, and remove the items), and the entire function is done in linear time (a single iteration of the list can remove all of the requested elements without invalidating any of the iterators pointing to items that remain).
You cannot physically remove a single element from a std::vector. You either reallocate the entire vector, or you move every element after the removed items and adjust the size member. The "cleanest" implementation of that set of operations would be to do
// within some instance of vector
void vector::remove(const T& t)
{
erase(std::remove(t), end());
}
Which would require std::vector to depend on <algorithm> (something that is currently not required).
As the "sorting" is needed to remove the items without multiple allocations and copies being required. (You do not need to sort a list to physically remove elements).
Contrary to what others are saying, it has nothing to do with speed. It has to do with the algorithm needing to know how the data is stored in memory.
As a side note: This is also a similar reason why std::remove (and friends) do not actually remove the items from the container they operate on; they just move all the ones that are not going to be removed to the "front" of the container. Without the knowledge of how to actually remove an object from a container, the generic algorithm cannot actually do the removing.
Consider the implementation details of both containers. A vector has to provide a continuous memory block for storage. In order to remove an element at index n != N (with N being the vector's length), all elements from n+1 to N-1 need to be moved. The various functions in the <algorithm> header implement that behavior, like std::remove or std::remove_if. The advantage of these being free-standing functions is that they can work for any type that offers the needed iterators.
A list on the other hand, is implemented as a linked list structure, so:
It's fast to remove an element from anywhere
It's impossible to do it as efficiently using iterators (since the internal structure has to be known and manipulated).
In general in STL the logic is "if it can be done efficiently - then it's a class member. If it's inefficient - then it's an outside function"
This way they make the distinction between "correct" (i.e. "efficient") use of classes vs. "incorrect" (inefficient) use.
For example, random access iterators have a += operator, while other iterators use the std::advance function.
And in this case - removing elements from an std::list is very efficient as you don't need to move the remaining values like you do in std::vector
It's all about efficiency AND reference/pointer/iterator validity. list items can be removed without disturbing any other pointers and iterators. This is not true for a vector and other containers in all but the most trivial cases. Nothing prevents use the external strategy, but you have a superior options.. That said this fellow said it better than I could on a duplicate question
From another poster on a duplicate question:
The question is not why std::vector does not offer the operation, but
rather why does std::list offer it. The design of the STL is focused
on the separation of the containers and the algorithms by means of
iterators, and in all cases where an algorithm can be implemented
efficiently in terms of iterators, that is the option.
There are, however, cases where there are specific operations that can
be implemented much more efficiently with knowledge of the container.
That is the case of removing elements from a container. The cost of
using the remove-erase idiom is linear in the size of the container
(which cannot be reduced much), but that hides the fact that in the
worst case all but one of those operations are copies of the objects
(the only element that matches is the first), and those copies can
represent quite a big hidden cost.
By implementing the operation as a method in std::list the complexity
of the operation will still be linear, but the associated cost for
each one of the elements removed is very low, a couple of pointer
copies and releasing of a node in memory. At the same time, the
implementation as part of the list can offer stronger guarantees:
pointers, references and iterators to elements that are not erased do
not become invalidated in the operation.
Another example of an algorithm that is implemented in the specific
container is std::list::sort, that uses mergesort that is less
efficient than std::sort but does not require random-access iterators.
So basically, algorithms are implemented as free functions with
iterators unless there is a strong reason to provide a particular
implementation in a concrete container.
std::list is designed to work like a linked list. That is, it is designed (you might say optimized) for constant time insertion and removal ... but access is relatively slow (as it typically requires traversing the list).
std::vector is designed for constant-time access, like an array. So it is optimized for random access ... but insertion and removal are really only supposed to be done at the "tail" or "end", elsewhere they're typically going to be much slower.
Different data structures with different purposes ... therefore different operations.
To remove an element from a container you have to find it first. There's a big difference between sorted and unsorted vectors, so in general, it's not possible to implement an efficient remove method for the vector.

What is the efficient way to split a vector

Is there any other constant time way to split a vector other than using the following.
std::vector<int> v_SplitVector(start , end);
This would take a complexity of O(N). In this case O(end - start). Is there a constant time operation to do this.
OR am I using the wrong container for the task?..
The act of "splitting" a container, for container like vectors, where elements sits on contiguous memory, require necessarily a copy / move of everything needs to go on the other side.
Container like list, that have elements each on its own memory block can be easily rearranged (see std::list::splice)
But having elements in non contiguous memory may result in lower memory access performance due to more frequent cache missing.
In other words, the complexity of the algorithm may be not the only factor influencing performance: an infrequent linear copy may damage you less than a frequent linear walk on dispersed elements.
The trade-off mostly depends on how the hardware manage caches and how the std implementation you are using takes care of that (and how the compiler can eventually optimize)
This is a copy rather than a split, hence the complexity. You can probably write a split for list which might perform better.
std::vector doesn't support the following, but if an efficient "split" operation is very important to you then you could perhaps write your own container. This would be quite a lot of work.
You could define "split" as follows:
removes an initial segment of the container, and returns a new container containing those elements. References to those elements continue to refer to the same elements in the new container. The old container contains the remaining elements. The capacity of the new container is equal to its size, and the capacity of the old container is reduced by the number of elements removed.
Then the old container and the new container would share a block of underlying storage (presumably with ref-counting). The new container would have to reallocate if you append to it (since the memory immediately at the end of its elements is in use), but so long as that happens rarely or never it could be a win.
Your example code takes a copy, though, it doesn't modify the original container. If a logical copy is a requirement then to do it without actually copying the elements you need either COW or immutable objects.
std::list has a splice() function that can move a range of elements from one list to another. This avoids copying the elements, but as of C++11 it is in effect guaranteed not to be O(1), because it needs to count how many elements it has moved. In C++03 implementations could choose whether they wanted this op to be O(1) or list::size() to be O(1), but in C++11 size() is required to be constant time for all containers.
Comparing the performance of std::vector with std::list is usually about more than just one operation, though. You have to consider that list doesn't have random-access iterators, and so on.
Creating a new std::vector necessarily requires copying, since
vectors aren't allowed to share parts of their implementation.
A modification in the container from which you obtained start
and end shouldn't affect the values in splitVector.
What you can do, fairly easily, is create a View container,
which simply holds the two iterators, and maps all accesses
through them. Something along the lines of:
template <typename Iterator>
class View
{
Iterator myBegin;
Iterator myEnd;
public:
typedef typename std::iterator_traits<Iterator>::value_type value_type;
// And the other usual typedefs, including...
typedef Iterator iterator;
View( Iterator begin, Iterator end )
: myBegin( begin )
, myEnd( end )
{
}
iterator begin() { return myBegin; }
iterator end() { return myEnd; }
value_type operator[]( ptrdiff_t index ) { return *(myBegin + index ); }
// ...
};
This requires a fair amount of boilerplate, because the
interface to something like vector is rather complete, but it's
all very straight forward and simple. The one thing you cannot
do with this, however, is modify the topology of either the
underlying container or of any View—anything which might
invalidate any iterators will of course, wreck havoc.
When adding or removing elements to/from a place different than start/end, the vector must have complexity of at least o(n) due to internal shifts required. The sme follows when you want to not only remove, but move the elements out: for a vector, they must be copied, hence, at least 1 op per element moved. That means that moving elements out of a vector is at least O(N) where N is the amount of elements moved.
If you need near-constant time add/remove operations (be it adding/inserting one, or many elements) you should look at list/linkedlist containers, where all elements and sublists are easily 'detachable', especially if you know the pointer/iterator. Or trees, or any other dynamic structure.
completely by the way, I sense what v_SplitVector does, but where did it came from? I do not remember such function/method in stdlib or boost?

Inserting an element into vector in the middle

Is there an way of inserting/deleting an element from the vectors other than the following..
The formal method of using 'push_back'
Using 'find()' in this way... find(v.begin(), v.end(), int)
I have read some where that inserting in the middle can be achieved by inclusive insertion/deletion.
So, is it really possible?
You can use std::vector::insert; however, note that this operation is O(.size()). If your code needs to perform insertions in the middle frequently, you may want to switch to a linked-list structure.
Is there an way of inserting/deleting an element from the vectors other than the following
Yes, you can use std::vector::insert() to insert element at a specified position.
Because vectors use an array as their underlying storage, inserting elements in positions other than the vector end causes the container to move all the elements that were after position to their new positions. This is generally an inefficient operation compared to the one performed for the same operation by other kinds of sequence containers (such as std::list).
std::vector is standard container, you could apply standard STL algorithms on it.
vector::insert seems to be what you want.

removing elements by value in C++ - does the preferred idiom really consist of a double negative?

I came across this answer to the question of removing elements by value in C++:
C++ Erase vector element by value rather than by position?
Basically:
vec.erase(std::remove(vec.begin(), vec.end(), valueToRemove), vec.end());
The answer makes sense, but isn't this bad style? The logic is consists of a double negative... is there a cleaner way to do this?
Deleting an element from a collection consists of two steps:
Moving down all subsequent elements to fill in the holes created by matches
Marking the new end
With the C++ standard library, these are two separate functions, remove and erase, respectively.
One could certainly imagine an erase_if type of function which would be easier to use, but evidently the current code is considered good enough. Of course you can write your own remove_if.
This isn't bad and in fact an efficient way of removing elements from a vector based on a condition in linear time. Watch this video from 35th minute. STL explanation for the Erase and Remove Idiom
Remember that there are different types of containers: Contiguous vs node-based, and sequential vs associative.
Node-based containers allow efficient erase/insert. Sequential containers organize elements by insertion order (i.e. position), while associative containers arrange them by (key) value.
All current associative containers (map/set/unordered) are node-based, and with them you can erase elements directly, and you should use the element-wise member erase function directly. Lists are node-based sequence containers, so you can erase individual elements efficiently, but finding an element by value takes linear time, which is why lists offer a member remove function. Only sequence containers (vector and deque) have no easy way to erase elements by value, and that's where the free remove algorithm comes in, which first rearranges the sequence to then allow the container's member erase to perform an efficient erasure at the end of the container.
Unlike the many generic aspects of the standard library which work without any knowledge of the underlying container, the copy/erase idiom is one of those things which require a bit of detail knowledge about the differences between the containers.