Boost provides boost::container::set/map/multiset/multimap where the underlying binary-search-tree (BST) can be configured, and it can be chosen to be an AVL tree.
One (maybe the most crucial one) reason, why one would prefer AVL trees over Red-Black trees, is the merge and split operations of complexity O(logN). However, surprisingly for me, it seems boost::container doesn't provide these operations. The documentation describes merge as an element-wise operation of O(NlogN) complexity (this is regardless of the underlying BST implementation!?), and the documentation doesn't even mention about split!
I can't say about merge, but as for split, I can assume that the lack of it might be justified by the constant-time size issue, so split of complexity O(logN) might not be aware of the sizes of the two resulting parts. But this could be fixed having an intrusive container and holding the sub-tree nodes count with each node.
There is also boost::intrusive::avl_set, but I couldn't find the AVL merge and split algorithms in the documentation.
So the questions are.
Is there a full-functional, ready-to-go AVL based implementation of set/map/multiset/multimap that provides merge and split operations with the complexity of O(logN)?
If not, how can I build one using boost::intrusive::avl_set?
Related
Inspired by this question: Why isn't std::set just called std::binary_tree? I came up with one of my own. Is red-black tree the only possible data structure fullfilling requirements of std::set or are there any others? For instance, another self-balancing tree - AVL tree - seems to be good alternative with very similar properties. Is it theoretically possible to replace underlying data structure of std::set or is there a group of requirements that makes red-black tree the only viable choice?
AVL trees have worse performance (not to be confused with asymptotic complexity) than RB trees in most real world situations. You can base std::set on AVL trees and be fully standard-compliant, but it will not win you any customers.
This question already has answers here:
Binary search tree over AVL tree
(4 answers)
Closed 7 years ago.
It seems to me like an AVL tree is always more efficient than an BST. So why do people still use BST? Is there a costs incurred in an AVL implementation?
AVL Trees have their advantages and disadvantages over other trees like Red-Black trees or 2-3 Trees or just plain BST.
AVL Trees:
Advantage:
Simple. Fairly easy to code and understand
Extra storage: fairly minimal, 2 bits per node (to store +1,0,-1). There is also a
trick where you can use 1 bit per node by using your children's
single bit.
The constant for lookup (look in your favorite
analysis book: Knuth, etc.) is 1.5
(so 1.5 log). Red-Black trees have a constant of 2 (so 2*log(n) for a lookup).
Disadvantages:
Deletions are expensive-ish. It is still logarithm to delete a node, but you may have to "rotate" all the way up to the root of the tree. In other words, a bigger constant. Red-Black trees only have to do 1 rotate.
Not simple to code. They are probably the "simplist" of the tree family, but there are still a lot or corner cases.
If you expect your data to be sorted, a BST devolves into a linked list. BUT if you expect your data to be fairly random, "on average", all of your operations for a BST (lookup, deletion, insertion) will be about logarithmic. It's VERY easy to code up BSTs: AVL trees, although fairly straightforward to code up, have a lot of corner cases and testing can be tricky.
In summary, plain Binary Search Trees are easy to code and get right, and if your data is fairly random, should perform very well (on average, all operations would be logarithmic). AVL Tree are harder to code, but guarantee logarithmic performance, at the price of some extra space and more complex code.
If we insert random integers in std::set, and read the set, we get ordered sequence. Basically, we have implicit sorting. However, what kind of sorting algorithm do we have here? Is it heapsort?
At least normally, it's a tree sort. That is, the items are inserted into a balanced binary search tree (usually a red-black tree), and that tree is traversed in order.
std::set and std::map are usually implemented using self-balancing binary search trees, usually red-black trees because they tend to be the fastest in practice. For detailed information about these data structures, you might want to consult a textbook such as Introduction to Algorithms by Cormen et al. or Algorithms by Sedgewick.
The C++ standard doesn't enforce any kind of sorting algorithm for std::set or std::map. So their implementations might differ among different platforms.
With that said, they are commonly implemented as a red-black tree, which is a self-balancing binary search tree. They don't sort their contents, they maintain the order of their contents as new items are inserted. Inserting a single item to them is usually O(logn).
I'm looking for a binary data structure (tree, list) that enables very fast searching. I'll only add/remove items at the beginning/end of the program, all at once. So it's gonna be fixed-sized, thus I don't really care about the insertion/deletion speed. Basically what I'm looking for is a structure that provides fast searching and doesn't use much memory.
Thanks
Look up the Unordered set in the Boost C++ library here. Unlike red-black trees, which are O(log n) for searching, the unordered set is based on a hash, and on average gives you O(1) search performance.
One container not to be overlooked is a sorted std::vector.
It definitely wins on the memory consumption, especially if you can reserve() the correct amount up front.
So the key can be a simple type and the value is a smallish structure of five pointers.
With only 50 elements it starts getting small enough that the Big-O theoretical performance may be overshadowed or at least measurable affected by the fixed time overhead of the algorithm or structure.
For example an array a vector with linear search is often the fastest with less than ten elements because of its simple structure and tight memory.
I would wrap the container and run real data on it with timing. Start with STL's vector, go to the standard STL map, upgrade to unordered_map and maybe even try Google's dense or sparse_hash_map:
http://google-sparsehash.googlecode.com/svn/trunk/doc/performance.html
One efficient (albeit a teeny bit confusing) algorithm is the Red-Black tree.
Internally, the c++ standard library uses Red-Black trees to implement std::map - see this question
The std::map and hash map are good choices. They also have constructors to ease one time construction.
The hash map puts key data into a function that returns an array index. This may be slower than an std::map, but only profiling will tell.
My preference would be std::map, as it is usually implemented as a type of binary tree.
The fastest tends to be a trei/trie. I implemented one 3 to 15 times faster than the std::unordered_map, they tend to use more ram unless you use a large number of elements though.
I apologize if this question is a bit broad, but I'm having a difficult time trying to understand how I would create a minimum cost spanning tree. This is in C++ if it matters at all.
From what I understand, you would use Kruskal's to select the minimum cost edges for building the spanning tree. My thinking is to read the edges into a minheap and that way you can remove from the top in order to get the edge with the minimum cost.
So far I've only been able to implement the minheap and sets for union-find, I am still unsure of the purpose of union-find and a sorting algorithm for the purpose of creating a spanning tree.
I would greatly appreciate any advice.
EDIT: I am not limited to union find, minheap, kruskals, and a sorting algorith, nor am I required to do any. These were just the items suggested by the instructor.
These two structures serve different purposes in the algorithm. Kruskal's algorithm works by adding the cheapest possible edge at each point that doesn't form a cycle. It can be shown using some not particularly complex math that this guarantees that the resulting spanning tree is minimal. The intuition behind why this works is as follows. Suppose that Kruskal's algorithm is not optimal and that there is a cheaper spanning tree. Sort all of the edges in that tree by weight, then compare those edges in sorted order to the edges chosen by Kruskal's algorithm in sorted order. Since we assume for contradiction that Kruskal's algorithm isn't optimal, there must be some place in the sequences where there's a disagreement. If in this disagreement Kruskal's algorithm has a lighter edge than the optimal solution, then we can make the optimal solution even better by adding that edge in, finding the cycle it creates, then deleting the heaviest edge in the cycle. That edge can't be the edge we just added, because otherwise that would have created a cycle in the MST produced by Kruskal's algorithm and Kruskal's algorithm never adds an edge that creates a cycle. So this means that Kruskal's algorithm must have diverged from the optimal solution by not adding some light edge. But the only reason Kruskal's algorithm skips an edge is if it creates a cycle, and this means that there must be a cycle in the optimal MST, also a contradiction. This means that our assumption was wrong and that Kruskal's algorithm must be optimal.
Hopefully, this motivates why Kruskal's algorithm needs the heap and the union-find structure. We need the heap so that we can get back all the edges in sorted order. If we don't visit the edges in this order, then the above proof breaks down and all bets are off. Interestingly, you don't actually need a heap; you just need some way of visiting all the edges in sorted order. If you want, you can just dump all the edges into a giant array and then sort the array. This doesn't change the runtime of the algorithm from the binary heap case if you use a fast sort.
The union-find structure is a bit trickier. At each point in Kruskal's algorithm you need to be able to tell whether adding an edge would create a cycle in the graph. One way to do this is to store a structure that keeps track of what nodes are already connected to one another. That way, when adding an edge, you can check whether the endpoints are already connected. If they are, then the edge would form a cycle and should be ignored. The union-find structure is a way of maintaining this information efficiently. In particular, its two operations - union and find - correspond to the act of connecting together two distinct groups of nodes that were previously not connected, as would be the case if you added an edge that connected two trees contained in different parts of the spanning forest. The find step gives you a way to check if two nodes are already connected; if so you should skip the current edge.
Hope this helps!