How does sorting in c++ work? - c++

I'm a Java developer. I'm currently learning C++. I've been looking at code samples for sorting. In Java, one normally gives a sorting method the container it needs to sort e.g
sort(Object[] someArray)
I've noticed in C++ that you pass two args, the start and end of the container. My question is that how is the actual container accessed then?
Here's sample code taken from Wikipedia illustrating the the sort method
#include <iostream>
#include <algorithm>
#include <vector>
int main() {
std::vector<int> vec;
vec.push_back(10); vec.push_back(5); vec.push_back(100);
std::sort(vec.begin(), vec.end());
for (int i = 0; i < vec.size(); ++i)
std::cout << vec[i] << ' ';
}

vec.begin() and vec.end() are returning iterators iterators. The iterators are kind of pointers on the elements, you can read them and modify them using iterators. That is what sort is doing using the iterators.
If it is an iterator, you can directly modify the object the iterator is referring to:
*it = X;
The sort function does not have to know about the containers, which is the power of the iterators. By manipulating the pointers, it can sort the complete container without even knowing exactly what container it is.
You should learn about iterators (http://www.cprogramming.com/tutorial/stl/iterators.html)

vec.begin() and vec.end() do not return the first and last elements of the vector. They actually return what is known as an iterator. An iterator behaves very much like a pointer to the elements. If you have an iterator i that you initialised with vec.begin(), you can get a pointer to the second element in the vector just by doing i++ - the same as you would if you had a point to the first element in an array. Likewise you can do i-- to go backwards. For some iterators (known as random access iterators), you can even do i + 5 to get an iterator to the 5th element after i.
This is how the algorithm accesses the container. It knows that all of the elements that it should be sorting are between begin() and end(). It navigates around the elements by doing simple iterator operations. It can then modify the elements by doing *i, which gives the algorithm a reference to the element that i is pointing at. For example, if i is set to vec.begin(), and you do *i = 5;, you will change the value of the first element of vec.
This approach allows you to pass only part of a vector to be sorted. Let's say you only wanted to sort the first 5 elements of your vector. You could do:
std::sort(vec.begin(), vec.begin() + 5);
This is very powerful. Since iterators behave very much like pointers, you can actually pass plain old pointers too. Let's say you have an array int array[] = {4, 3, 2, 5, 1};, you could easily call std::sort(array, array + 5) (because the name of an array will decay to a pointer to its first element).

The container doesn't have to be accessed. That's the whole point of the design behind the Standard Template Library (which became part of the C++ standard library): The algorithms don't know anything about containers, just iterators.
This means they can work with anything that provides a pair of iterators. Of course all STL containers provide begin() and end() methods, but you can also use a regular old C array, or an MFC or glib container, or anything else, just by writing your own iterators for it. (And for C arrays, it's as simple as a and a+a_len for the begin and end iterators.)
As for how it works under the covers: Iterators follow an implicit protocol: you can do things like ++it to advance an iterator to the next element, or *it to get the value of the current element, or *it = 3 to set the value of the current element. (It's a bit more complicated than this, because there are a few different protocols—iterators can be random-access or forward-only, const or writable, etc. But that's the basic idea.) So, if `sort is coded to restrict itself to the iterator protocol (and, of course, it is), it works with anything that conforms to that protocol.
To learn more, there are many tutorials on the internet (and in the bookstore); there's only so much an SO answer can explain.

begin() and end() return iterators. See e.g. http://www.cprogramming.com/tutorial/stl/iterators.html

Iterators act like references into part of a container. That is, *iter = z; actually changes one of the elements in the container.
std::sort actually uses a swap function on references to the contained objects, so that any iterators you have already initialized remain in the same order but the values those iterators refer to are changed.
Note that std::list also has member functions called sort. It works the other way around: any iterators you have already initialized keep the same values, but the order of those iterators changes.

Related

What is the advantage of using (it != vector.end()) instead of (it < vector.end()) in for loops? [duplicate]

I'm used to writing loops like this:
for (std::size_t index = 0; index < foo.size(); index++)
{
// Do stuff with foo[index].
}
But when I see iterator loops in others' code, they look like this:
for (Foo::Iterator iterator = foo.begin(); iterator != foo.end(); iterator++)
{
// Do stuff with *Iterator.
}
I find the iterator != foo.end() to be offputting. It can also be dangerous if iterator is incremented by more than one.
It seems more "correct" to use iterator < foo.end(), but I never see that in real code. Why not?
All iterators are equality comparable. Only random access iterators are relationally comparable. Input iterators, forward iterators, and bidirectional iterators are not relationally comparable.
Thus, the comparison using != is more generic and flexible than the comparison using <.
There are different categories of iterators because not all ranges of elements have the same access properties. For example,
if you have an iterators into an array (a contiguous sequence of elements), it's trivial to relationally compare them; you just have to compare the indices of the pointed to elements (or the pointers to them, since the iterators likely just contain pointers to the elements);
if you have iterators into a linked list and you want to test whether one iterator is "less than" another iterator, you have to walk the nodes of the linked list from the one iterator until either you reach the other iterator or you reach the end of the list.
The rule is that all operations on an iterator should have constant time complexity (or, at a minimum, sublinear time complexity). You can always perform an equality comparison in constant time since you just have to compare whether the iterators point to the same object. So, all iterators are equality comparable.
Further, you aren't allowed to increment an iterator past the end of the range into which it points. So, if you end up in a scenario where it != foo.end() does not do the same thing as it < foo.end(), you already have undefined behavior because you've iterated past the end of the range.
The same is true for pointers into an array: you aren't allowed to increment a pointer beyond one-past-the-end of the array; a program that does so exhibits undefined behavior. (The same is obviously not true for indices, since indices are just integers.)
Some Standard Library implementations (like the Visual C++ Standard Library implementation) have helpful debug code that will raise an assertion when you do something illegal with an iterator like this.
Short answer: Because Iterator is not a number, it's an object.
Longer answer: There are more collections than linear arrays. Trees and hashes, for example, don't really lend themselves to "this index is before this other index". For a tree, two indices that live on separate branches, for example. Or, any two indices in a hash -- they have no order at all, so any order you impose on them is arbitrary.
You don't have to worry about "missing" End(). It is also not a number, it is an object that represents the end of the collection. It doesn't make sense to have an iterator that goes past it, and indeed it cannot.

Is it!=container.end() design mistake/feature or just necessity?

recently I was thinking about how it would be nice if iterators implicitly converted to bool so you could do
auto it = find(begin(x),end(x), 42);
if (it) //not it!=x.end();
{
}
but thinking about it I realized that this would mean that either it would had to be set to "NULL", so that you couldnt use the it directly if you want to do something with it (you would have to use x.end()) or you could use it but iter size would have to be bigger(to store if what it points to was .end() or not).
So my questions are:
Is the syntax in my example achievable without breaking current code and without increasing the sizeof iterator?
would implicitly conversion to bool cause some problems?
You are working on the assumption that iterators are a way of accessing a container. They allow you to do that, but they also allow many more things that would clearly not fit your intended operations:
auto it = std::find(std::begin(x), std::next(std::begin(x),10), 42 );
// Is 42 among the first 10 elements of 'x'?
auto it = std::find(std::istream_iterator<int>(std::cout),
std::istream_iterator<int>(), 42 );
// Is 42 one of the numbers from standard input?
In the first case the iterator does refer to a container, but the range where you are finding does not enclose the whole container, so it cannot be tested against end(x). In the second case there is no container at all.
Note that an efficient implementation of an iterator for many containers holds just a pointer, so any other state would increase the size of the iterator.
Regarding conversions to any-type or bool, they do cause many problems, but they can be circumvented in C++11 by means of explicit conversions, or in C++03 by using the safe-bool idiom.
You are probably more interested on a different concept: ranges. There are multiple approaches to ranges, so it is not so clear what the precise semantics should be. The first two that come to mind are Boost.Iterator and an article by Alexandrescu I read recently called On Iteration.
Two reasons this wouldn't work:
First, it's possible to use raw pointers as iterators (usually into an array):
int data[] = { 50, 42, 37, 5 };
auto it = find(begin(data), end(data), 42);
Second, you don't have to pass the actual end of the container to find; to e.g. find the first space character before a period:
auto sentence = "Hello, world.";
auto it1 = find(begin(sentence), end(sentence), '.');
auto it2 = find(begin(sentence), it1, ' ');
There's very little that can be done with a single iterator. A pair of iterators defines a sequence consisting of the elements; the first iterator points to the first element and the second iterator points one past the end of the last element. There's no way, in general, for the first iterator to know when it's been incremented to match the second iterator. Algorithms do that, because they have both iterators and can tell when the work is done. For example:
std::vector<int> vec;
vec.push_back(1);
vec.push_back(2);
vec.push_back(3);
// copy the contents of the vector:
std::copy(somewhere, vec.begin(), vec.end());
// copy the first two elements of the vector:
std::copy(somewhere, vec.begin(), vec.begin() + 2);
In both calls to copy, vec.begin() is the same iterator; the algorithm does different things because it got the second iterator that tells it when to stop.
Granted, it's possible to design a different kind of iterator that contains both the beginning and the end of the sequence (as Java does), but that's not how C++ iterators are designed. There's discussion of standardizing the notion of a "range" that holds two iterators (the new range-based for loop is a first step toward that).
Well, you probably don't want an implicit conversion, but requiring two
separate objects to determine when iteration is done is clearly a design
error. It's not so much because of the if or the for (although
using a single iterator would make these clearer as well); it's really
because it makes functional decomposition and filtering iterators
extreamly difficult, if not impossible.
Fundamentally, STL iterators are closer to smart pointers than they are
to iterators. There are times when such pointers are appropriate, but
they aren't a good replacement for iterators.

vector::erase and reverse_iterator

I have a collection of elements in a std::vector that are sorted in a descending order starting from the first element. I have to use a vector because I need to have the elements in a contiguous chunk of memory. And I have a collection holding many instances of vectors with the described characteristics (always sorted in a descending order).
Now, sometimes, when I find out that I have too many elements in the greater collection (the one that holds these vectors), I discard the smallest elements from these vectors some way similar to this pseudo-code:
grand_collection: collection that holds these vectors
T: type argument of my vector
C: the type that is a member of T, that participates in the < comparison (this is what sorts data before they hit any of the vectors).
std::map<C, std::pair<T::const_reverse_iterator, std::vector<T>&>> what_to_delete;
iterate(it = grand_collection.begin() -> grand_collection.end())
{
iterate(vect_rit = it->rbegin() -> it->rend())
{
// ...
what_to_delete <- (vect_rit->C, pair(vect_rit, *it))
if (what_to_delete.size() > threshold)
what_to_delete.erase(what_to_delete.begin());
// ...
}
}
Now, after running this code, in what_to_delete I have a collection of iterators pointing to the original vectors that I want to remove from these vectors (overall smallest values). Remember, the original vectors are sorted before they hit this code, which means that for any what_to_delete[0 - n] there is no way that an iterator on position n - m would point to an element further from the beginning of the same vector than n, where m > 0.
When erasing elements from the original vectors, I have to convert a reverse_iterator to iterator. To do this, I rely on C++11's §24.4.1/1:
The relationship between reverse_iterator and iterator is
&*(reverse_iterator(i)) == &*(i- 1)
Which means that to delete a vect_rit, I use:
vector.erase(--vect_rit.base());
Now, according to C++11 standard §23.3.6.5/3:
iterator erase(const_iterator position); Effects: Invalidates
iterators and references at or after the point of the erase.
How does this work with reverse_iterators? Are reverse_iterators internally implemented with a reference to a vector's real beginning (vector[0]) and transforming that vect_rit to a classic iterator so then erasing would be safe? Or does reverse_iterator use rbegin() (which is vector[vector.size()]) as a reference point and deleting anything that is further from vector's 0-index would still invalidate my reverse iterator?
Edit:
Looks like reverse_iterator uses rbegin() as its reference point. Erasing elements the way I described was giving me errors about a non-deferenceable iterator after the first element was deleted. Whereas when storing classic iterators (converting to const_iterator) while inserting to what_to_delete worked correctly.
Now, for future reference, does The Standard specify what should be treated as a reference point in case of a random-access reverse_iterator? Or this is an implementation detail?
Thanks!
In the question you have already quoted exactly what the standard says a reverse_iterator is:
The relationship between reverse_iterator and iterator is &*(reverse_iterator(i)) == &*(i- 1)
Remember that a reverse_iterator is just an 'adaptor' on top of the underlying iterator (reverse_iterator::current). The 'reference point', as you put it, for a reverse_iterator is that wrapped iterator, current. All operations on the reverse_iterator really occur on that underlying iterator. You can obtain that iterator using the reverse_iterator::base() function.
If you erase --vect_rit.base(), you are in effect erasing --current, so current will be invalidated.
As a side note, the expression --vect_rit.base() might not always compile. If the iterator is actually just a raw pointer (as might be the case for a vector), then vect_rit.base() returns an rvalue (a prvalue in C++11 terms), so the pre-decrement operator won't work on it since that operator needs a modifiable lvalue. See "Item 28: Understand how to use a reverse_iterator's base iterator" in "Effective STL" by Scott Meyers. (an early version of the item can be found online in "Guideline 3" of http://www.drdobbs.com/three-guidelines-for-effective-iterator/184401406).
You can use the even uglier expression, (++vect_rit).base(), to avoid that problem. Or since you're dealing with a vector and random access iterators: vect_rit.base() - 1
Either way, vect_rit is invalidated by the erase because vect_rit.current is invalidated.
However, remember that vector::erase() returns a valid iterator to the new location of the element that followed the one that was just erased. You can use that to 're-synchronize' vect_rit:
vect_rit = vector_type::reverse_iterator( vector.erase(vect_rit.base() - 1));
From a standardese point of view (and I'll admit, I'm not an expert on the standard): From §24.5.1.1:
namespace std {
template <class Iterator>
class reverse_iterator ...
{
...
Iterator base() const; // explicit
...
protected:
Iterator current;
...
};
}
And from §24.5.1.3.3:
Iterator base() const; // explicit
Returns: current.
Thus it seems to me that so long as you don't erase anything in the vector before what one of your reverse_iterators points to, said reverse_iterator should remain valid.
Of course, given your description, there is one catch: if you have two contiguous elements in your vector that you end up wanting to delete, the fact that you vector.erase(--vector_rit.base()) means that you've invalidated the reverse_iterator "pointing" to the immediately preceeding element, and so your next vector.erase(...) is undefined behavior.
Just in case that's clear as mud, let me say that differently:
std::vector<T> v=...;
...
// it_1 and it_2 are contiguous
std::vector<T>::reverse_iterator it_1=v.rend();
std::vector<T>::reverse_iterator it_2=it_1;
--it_2;
// Erase everything after it_1's pointee:
// convert from reverse_iterator to iterator
std::vector<T>::iterator tmp_it=it_1.base();
// but that points one too far in, so decrement;
--tmp_it;
// of course, now tmp_it points at it_2's base:
assert(tmp_it == it_2.base());
// perform erasure
v.erase(tmp_it); // invalidates all iterators pointing at or past *tmp_it
// (like, say it_2.base()...)
// now delete it_2's pointee:
std::vector<T>::iterator tmp_it_2=it_2.base(); // note, invalid iterator!
// undefined behavior:
--tmp_it_2;
v.erase(tmp_it_2);
In practice, I suspect that you'll run into two possible implementations: more commonly, the underlying iterator will be little more than a (suitably wrapped) raw pointer, and so everything will work perfectly happily. Less commonly, the iterator might actually try to track invalidations/perform bounds checking (didn't Dinkumware STL do such things when compiled in debug mode at one point?), and just might yell at you.
The reverse_iterator, just like the normal iterator, points to a certain position in the vector. Implementation details are irrelevant, but if you must know, they both are (in a typical implementation) just plain old pointers inside. The difference is the direction. The reverse iterator has its + and - reversed w.r.t. the regular iterator (and also ++ and --, > and < etc).
This is interesting to know, but doesn't really imply an answer to the main question.
If you read the language carefully, it says:
Invalidates iterators and references at or after the point of the erase.
References do not have a built-in sense of direction. Hence, the language clearly refers to the container's own sense of direction. Positions after the point of the erase are those with higher indices. Hence, the iterator's direction is irrelevant here.

Why is "!=" used with iterators instead of "<"?

I'm used to writing loops like this:
for (std::size_t index = 0; index < foo.size(); index++)
{
// Do stuff with foo[index].
}
But when I see iterator loops in others' code, they look like this:
for (Foo::Iterator iterator = foo.begin(); iterator != foo.end(); iterator++)
{
// Do stuff with *Iterator.
}
I find the iterator != foo.end() to be offputting. It can also be dangerous if iterator is incremented by more than one.
It seems more "correct" to use iterator < foo.end(), but I never see that in real code. Why not?
All iterators are equality comparable. Only random access iterators are relationally comparable. Input iterators, forward iterators, and bidirectional iterators are not relationally comparable.
Thus, the comparison using != is more generic and flexible than the comparison using <.
There are different categories of iterators because not all ranges of elements have the same access properties. For example,
if you have an iterators into an array (a contiguous sequence of elements), it's trivial to relationally compare them; you just have to compare the indices of the pointed to elements (or the pointers to them, since the iterators likely just contain pointers to the elements);
if you have iterators into a linked list and you want to test whether one iterator is "less than" another iterator, you have to walk the nodes of the linked list from the one iterator until either you reach the other iterator or you reach the end of the list.
The rule is that all operations on an iterator should have constant time complexity (or, at a minimum, sublinear time complexity). You can always perform an equality comparison in constant time since you just have to compare whether the iterators point to the same object. So, all iterators are equality comparable.
Further, you aren't allowed to increment an iterator past the end of the range into which it points. So, if you end up in a scenario where it != foo.end() does not do the same thing as it < foo.end(), you already have undefined behavior because you've iterated past the end of the range.
The same is true for pointers into an array: you aren't allowed to increment a pointer beyond one-past-the-end of the array; a program that does so exhibits undefined behavior. (The same is obviously not true for indices, since indices are just integers.)
Some Standard Library implementations (like the Visual C++ Standard Library implementation) have helpful debug code that will raise an assertion when you do something illegal with an iterator like this.
Short answer: Because Iterator is not a number, it's an object.
Longer answer: There are more collections than linear arrays. Trees and hashes, for example, don't really lend themselves to "this index is before this other index". For a tree, two indices that live on separate branches, for example. Or, any two indices in a hash -- they have no order at all, so any order you impose on them is arbitrary.
You don't have to worry about "missing" End(). It is also not a number, it is an object that represents the end of the collection. It doesn't make sense to have an iterator that goes past it, and indeed it cannot.

C++ vector insights

I am a little bit frustrated of how to use vectors in C++. I use them widely though I am not exactly certain of how I use them. Below are the questions?
If I have a vector lets say: std::vector<CString> v_strMyVector, with (int)v_strMyVector.size > i can I access the i member: v_strMyVector[i] == "xxxx"; ? (it works, though why?)
Do i always need to define an iterator to acces to go to the beginning of the vector, and lop on its members ?
What is the purpose of an iterator if I have access to all members of the vector directly (see 1)?
Thanks in advance,
Sun
It works only because there's no bounds checking for operator[], for performance reason. Doing so will result in undefined behavior. If you use the safer v_strMyVector.at(i), it will throw an OutOfRange exception.
It's because the operator[] returns a reference.
Since vectors can be accessed randomly in O(1) time, looping by index or iterator makes no performance difference.
The iterator lets you write an algorithm independent of the container. This iterator pattern is used a lot in the <algorithm> library to allow writing generic code easier, e.g. instead of needing N members for each of the M containers (i.e. writing M*N functions)
std::vector<T>::find(x)
std::list<T>::find(x)
std::deque<T>::find(x)
...
std::vector<T>::count(x)
std::list<T>::count(x)
std::deque<T>::count(x)
...
we just need N templates
find(iter_begin, iter_end, x);
count(iter_begin, iter_end, x);
...
and each of the M container provide the iterators, reducing the number of function needed to just M+N.
It returns a reference.
No,, because vector has random access. However, you do for other types (e.g. list, which is a doubly-linked list)
To unify all the collections (along with other types, like arrays). That way you can use algorithms like std::copy on any type that meets the requirements.
Regarding your second point, the idiomatic C++ way is not to loop at all, but to use algorithms (if feasible).
Manual looping for output:
for (std::vector<std::string>::iterator it = vec.begin(); it != end(); ++it)
{
std::cout << *it << "\n";
}
Algorithm:
std::copy(vec.begin(), vec.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));
Manual looping for calling a member function:
for (std::vector<Drawable*>::iterator it = vec.begin(); it != end(); ++it)
{
(*it)->draw();
}
Algorithm:
std::for_each(vec.begin(), vec.end(), std::mem_fun(&Drawable::draw));
Hope that helps.
Workd because the [] operator is overloaded:
reference operator[](size_type n)
See http://www.sgi.com/tech/stl/Vector.html
Traversing any collection in STL using iterator is a de facto.
I think one advantage is if you replace vector by another collection, all of your code would continue to work.
That's the idea of vectors, they provide direct access to all items, much as regular arrays. Internally, vectors are represented as dynamically allocated, contiguous memory areas. The operator [] is defined to mimic semantics of the regular array.
Having an iterator is not really required, you may as well use an index variable that goes from 0 to v_strMtVector.size()-1, as you would do with regular array:
for (int i = 0; i < v_strMtVector.size(); ++i) {
...
}
That said, using an iterator is considered to be a good style by many, because...
Using an iterator makes it easier to replace underlying container type, e.g. from std::vector<> to std::list<>. Iterators may also be used with STL algorithms, such as std::sort().
std::vector is a type of sequence that provides constant time random access. You can access a reference to any item by reference in constant time but you pay for it when inserting into and deleting from the vector as these can be very expensive operations. You do not need to use iterators when accessing the contents of the vector, but it does support them.