Efficiency of iterators in unordered_map (C++) - c++

I can't seem to find any information on this, so I turn to stackoverflow. How efficient are the iterators of std::tr1::unordered_map in C++? Especially compared to, say, list iterators. Would it make sense to make a wrapper class that also holds all the keys in a list to allow for efficient iteration (my code does use a lot of iteration over the keys in an unordered_map). For those who will recommend boost, I can't use it (for whatever reasons).

I haven't checked TR1, but N3035 (C++0x draft) says this:
All the categories of iterators
require only those functions that are
realizable for a given category in
constant time (amortized). Therefore,
requirement tables for the iterators
do not have a complexity column.
The standard isn't going to give an efficiency guarantee other than in terms of complexity, so you have no guaranteed comparison of list and unordered_map other than that they're both amortized constant time (i.e., linear time for a complete iteration over the container).
In practice, I'd expect an unordered_map iterator to be at least in the vicinity of list, unless your hashmap is very sparsely populated. There could be an O(number of buckets) term in the complexity of the complete iteration. But I've never looked at even one implementation specifically of unordered_map for C++, so I don't know what adornments to expect on a simplistic "array of linked lists" hashtable implementation. If you have a "typical" platform test it, if you're trying to write code that will definitely be the fastest possible on all C++ implementations then tough luck, you can't ;-)

The unordered_map iterator basically just has to walk over the internal tree structure of the hashtable. This just means doing some pointer following, and so should be pretty efficient. Of course, if you are walking an unordered_map a lot, you may be using the wrong data structure in the first place. In any case, the answer to this, as it is for all performance questions, is for you to time your specific code to see if it is fast enough.

Unfortunately, you can't say for sure if something is efficient enough unless you've tried it and measured the results. I can tell you that the standard library, TR1, and Boost classes have had tons of eyes on them. They're probably as fast as they're going to get for most common use cases. Walking an container is certainly a common use case.
With all that said, you need to ask yourself a few questions:
What's the clearest way to say what I want? It may be that writing a wrapper class adds unneeded complexity to your code. Make it correct first, then make it fast.
Can I afford the extra memory and time to maintain a list in parallel with the unordered_map?
Is unordered_map really the right data structure? If your most common use case is traversal from beginning to end, you might be better off with vector because the memory is guaranteed to be contiguous.

Answered by the benchmarks here https://stackoverflow.com/a/25027750/1085128
unordered_map is partway between vector and map for iteration. It is significantly faster than map.

Related

How to keep track of visited points in C++

I am doing a problem in c++ that has to keep track of points that are visited in a traversal. The point is basically,
struct Point {
int x;
int y;
};
My first thought to solving something like this would be to use something like
std::set<Point> visited_points;
or maybe
std::map<Point, bool> visited_points;
However, I am a beginner in c++, and I realized you have to implement a Compare, which I didn't know how to do. When I asked, I was told said that using a map was "overkill" in a problem like this. He said the better solution was to do something like
std::vector<std::vector<bool>> visited_points;
He said std::map was not the best solution, since using a vector was faster.
I'm wondering why using a double vector is better in terms of style and performance. Is it because implementing a Compare is hard for a Point? A double vector feels hacky to me, and I also think it looks uglier than using a set or map. Is it really the best way to approach this problem, or is there a better solution I don't know about?
If someone asks you, in abstract, "What is the best way of keeping track of objects I've visited?", then you would be forgiven for replying "Use an std::unordered_set<Object>" (usually called a hash table for languages other than C++). That's a nice simple answer and it is often correct if you don't know anything at all about the objects. After all, a hash lookup is (expected) O(1), and in practice is usually quite fast.
There are a few caveats, the biggest one being that you will need to be able to compute a hash for each object. The C++ standard library does not (yet) come with a framework for computing hashes of arbitrary objects, not even PODs, and rendering an object as a string in order to be able to take advantage of std::hash<std::basic_string> is usually way too much work (unless the object is already a string, of course).
If you can't figure out how to write a hash function for you object, you might then think about using an ordered associative container (aka a balanced BST). However, that is not a good idea. Not because it is difficult to write a comparison function. Writing comparison functions is usually trivial, particularly for PODs; you can leverage the fact that std::tuple implements a comparison function for every tuple whose element types are all comparable.
The real problem with ordered associative containers is that they are high overhead. Element access is slow: O(log n), not O(1), and the constant is not small either. And the bookkeeping data required to maintain the balanced tree is much larger than the two-pointer hash-table node (and even that is quite big for small objects). So ordered associative containers really only make sense if you need to be able to traverse them in order. Generally, "visited" maps don't need to be traversed at all -- they are just used for lookup.
Both ordered and unordered containers have another problem: the objects in the container are individual dynamic memory allocations (the API requires that references to the objects in the container must be stable), so over time the individual objects end up getting scattered across dynamic memory, leading to a lot of cache misses.
But, really, even before you start thinking about how easy (or difficult) it will be to hash your objects in order to keep them in a hash-set, you should think about the nature of the objects you are tracking. In particular, can they be easily indexed with a small(-ish) integer? If so, you could just use a vector of bits, one bit per possible object. That's an efficient representation, both for access speed (definitely O(1)) and for space, and it is optimal for memory caching.
If your objects are easily numbered then bit-vectors will be an attractive alternative. One bit per object is (literally) two orders of magnitude less space than a hash-map, so unless you expect your visited map to be extremely sparse (rarely the case in algorithms which need a visited map), it's going to be a big win.
In the case of your problem, which I gather has to do with keeping track of points visited in a rectangular array such as a gameboard or an image, it is clear that the bit vector approach is going to work out well. It's true that you require two levels of indexing (unless you reduce the two indices into a single integer, which is quite easy if you know the dimensions), but that doesn't add much overhead.
Although there are doubts about how good an idea it was, the C++ standard library special cases std::vector<bool> to really be a bit vector. That makes it impossible to create a native pointer to a single element of the vector (which is why many people consider std::vector<bool> to be a hack), and creates some other odd issues when you try to use it as a vector. But if all you want is a bitmask -- as in the case of a visited map -- then it is a pretty good solution.
C++ also offers real bit vectors -- std::bitset -- but unfortunately these need to have their size known at compile time. Boost offers dynamic_bitset, which is a kind of std::vector<bool> written with hindsight, so it's also worth looking at.

std::vector::insert vs std::list::operator[]

I know that std::list::operator[] is not implemented as it has bad performance. But what about std::vector::insert it is as much inefficient as much std::list::operator[] is. What is the explanation behind?
std::vector::insert is implemented because std::vector has to meet requirements of SequenceContainer concept, while operator[] is not required by any concepts (that I know of), possible that will be added in ContiguousContainer concept in c++17. So operator[] added to containers that can be used like arrays, while insert is required by interface specification, so containers that meet certain concept can be used in generic algorithms.
I think Slava's answer does the best job, but I'd like to offer a supplementary explanation. Generally with data structures, it's far more common to have more accesses than insertions, than vice versa. There are many data structures that may try to optimize access at the cost of insertion, but the inverse is much more rare, because it tends to be less useful in real life.
operator[], if implemented for a linked list, would access an existing value presumably (similar to how it does for vector). insert adds new values. It's much more likely you will be willing to take a big performance hit to insert some new elements into a vector, provided that the subsequent accesses are extremely fast. In many cases, element insertion may be outside of the critical path entirely whereas the critical path consists of a single traversal, or random access of already-present data. So it's simply convenient to have insert take care of the details for you in that case (it's actually a bit annoying to write efficiently and correctly). This is actually a not-uncommon use of a vector.
On the other hand, using operator[] on a linked list would almost always be a sign that you are using the wrong data structure.
std::list::operator[] would require an O(N) traversal and is not really in accordance with what a list is designed to do. If you need operator[] then use a different container type. When C++ folk see a [] they assume an O(1) (or, at worse, an O(Log N)) operation. Supplying [] for a list would break that.
But although std::vector::insert is also O(N), it can be optimised: an at-end insertion can be readily optimised by having the vector's capacity grow in large chunks. An insertion in the middle requires an element-by-element move, but that again can be performed very quickly on modern chipsets.
The [] operator is inherited from plain arrays. It has always been understood as a fast (sub linear time) accessor of the underlying container. Since list is does not support sub linear time access, it does not make sense for it to implement the operator.
auto a = Container [10]; // Ideally I can assume this is quick
The equivalent of your std::list <>::operator [] is std::next <std::list<>::iterator>. Documented at cpp-reference.
auto a = *std::next (Container.begin (), 10); // This may take a little while
This is the truly generic way index a container. If the container supports random access, it will be constant time, other wise it will be linear.

Should I use std::set or std::unordered_set for a set of pointers?

I have a set of pointers. In the first step, I insert data pointers, and in the second step, I iterate over the whole set and do something with the elements. The order is not important, I just need to avoid duplicates, which works fine with pointer comparison.
My question is, whether it might be advantageous to use an unordered set for the same purpose. Is insertion faster for an unordered set?
As Ami Tavory commented, if you don't need order, then it's usually best to go for unordered containers. The reason being that if order somehow improved performance, unordered containers would still be free to use it, and hence get the same or better complexity anyhow.
A downside of unordered collections is that they usually require a hash function for the key type. If it's too hard or expensive to make one, then containers which don't use hashes might be better.
In C++'s standard library, the average insertion complexity for std::set is O(log(N)), whereas for std::unordered_set it's O(1). Aside from that, there are probably less cache misses on average when using std::unordered_set.
At the end of the day though, this is just theory. You should try something that sounds good enough and profile it to see if it really is.

Prefer unordered_set over vector

Is it safe to say that if I don't want duplicates in my container, and I don't care about element position as I only want to iterate through the container, then I should use an unordered_set instead of vector?
Is it safe to say that if I don't want duplicates in my container, and I don't care about element position as I only want to iterate through the container, then I should use an unordered_set instead of vector?
No, it is not. It depends on many factors. For example if you seldom add new elements but iterate over container quite often it would be preferable to use std::vector and maintain uniqueness manually. There also could be other factors affecting your decision. But normally yes you may prefer std::unordered_set as it simplifies your program.
Not entirely. unordered_sets are not required to be contiguous containers; in the case where you'd frequently want to read all numerous values contained in the set, you may prefer std::vector on time-critic application.
std::unordered_set:
Internally, the elements are not sorted in any particular order, but organized into buckets. Which bucket an element is placed into depends entirely on the hash of its value. This allows fast access to individual elements, since once a hash is computed, it refers to the exact bucket the element is placed into.
But in the general case, I'd say Yes.
I generally prefer vector or map. (or in your case, std::set).
Hash tables can be faster than maps/sets (red-black trees), but red-black trees have guaranteed performance 100% of the time. And logarithmic performance is REALLY fast! A hash table kan kill performance when it starts rehashing.
std::vector is the workhorse of the STL and should be your default choice. Vector is very straightforward, and is very cache-friendly
This article by Matt Austern is related to this topic and it is worth reading:
Why you shouldn't use set (and what you should use instead) by Matt Austern
This thread is trying to identify conditions under which unordered_set is preferable over vectors. Similarly, in the above article, the author clearly identifies four conditions, which all need to be satisfied in order to prefer set over a custom but simpler data structure called sorted_vector (last section: What is set good for?). It will be interesting to clearly state a set of conditions for preferring unordered_set over vector.
also, the last paragraph of the article summarizes a useful rule to keep in mind:
Every component in the standard C++ library is there because it's useful for some purpose, but sometimes that purpose is narrowly defined and rare. As a general rule you should always use the simplest data structure that meets your needs. The more complicated a data structure, the more likely that it's not as widely useful as it might seem.
Of course yes. If you do not want duplicates, you have to use a key-aware container, and since unordered_* totally win over their tree-based counterparts, this is pretty much your only choice.

sort() function in C++

I found two forms of sort() in C++:
1) sort(begin, end);
2) XXX.sort();
One can be used directly without an object, and one is working with an object.
Is that all? What's the differences between these two sort()? They are from the same library or not? Is the second a method of XXX?
Can I use it like this
vector<int> myvector
myvector.sort();
or
list<int> mylist;
mylist.sort();
std::sort is a function template that works with any pair of random-access iterators. Consequently, the algorithm implemented by std::sort is tailored (optimized) for random access. Most of the time it is going to be some flavor of quick-sort.
std::list::sort is a dedicated list-oriented version of sort. std::list is a container that does not support [efficient] random access, which means that you can't use std::sort with it. This creates the need for a dedicated sorting algorithm. Most of the time it will be implemented as some flavor of merge-sort.
std::list includes a sort() member, because there are ways to do sorting efficiently that work for lists, but almost nothing else. If you tried to use the same sorting implementation for a list as everything else, one of them wouldn't work very well, so they provide a special one for list.
Most of the time, you should use std::sort, and forget that std::list even exists. It's useful in a few very specific circumstances, but they're sufficiently rare that using it is a mistake at least 90% of the time that people think they should.
Edit: The situation is not that using std::list::sort is a mistake, but that using std::list is usually a mistake. There are a couple of reasons for this. First of all, std::list requires a couple of pointers in addition to space for each item you store, so unless your data is fairly large, it can impose a substantial space overhead. Second, since each node is allocated individually, traversing nodes tends to be fairly slow (locality of reference is low, so cache utilization tends to be poor). Finally, although you can insert or delete an element from the middle of a list with constant complexity, you (normally) need to use a linear-complexity traversal of the list to find the right point at which to insert or delete that item. Couple that with the poor cache utilization, and you get insertions and deletions in the middle of a linked list usually being slower than with something like a vector or deque.
The other 10% of the time (or maybe somewhat less than that) is when you can reasonably store a position in the list in the long term, and (at least typically) plan on making a number of insertions and deletions at (close to) that same point, without traversing the linked list each time to find the right place. If you make enough modifications to the list at (or close to) the same place, you can amortize the list traversal over a number of insertions and deletions. If you make enough modifications at (close to) the same place, you can come out ahead.
That can work out well in a few cases, but forces you to store more "state" data for a long period of time, mostly to make up for the deficiencies in the data structure itself. To the extent possible, I prefer to make individual operations close to pure and atomic, so they minimize dependence (and effect) on long-term data. In the single-threaded case, this mostly helped modularity, but with multi-core, multithreaded (etc.) programming, it can also have a substantial effect on efficiency and difficulty of development. Access to that long-term data generally requires thread synchronization.
That's not to say that avoiding list will automatically make your programs lots faster, or anything like that -- just that I find situations where it's really useful to be quite rare, and a lot of situations that seem to fit what people have been told about when a linked list will be useful, really aren't.
std::sort only works with random-access iterators. That limits it to arrays, vector and deque (and user-defined types which provide random access).
std::list is a linked list which doesn't support random access. The generic sort function doesn't work for it. And even if it did, it would be suboptimal. A linked list is not sorted by copying data from node to node - it is sorted by relinking the nodes.
std::sort also wouldn't work for other containers (sets and maps), but those are always sorted anyway.