STLish lower_bound function for Radix/Patricia Trie - c++

Lately I've been studying Patricia tries, and working with a really good C++ implementation which can be used as an STL Sorted Associative Container. Patricia tries differ from normal binary trees because leaf nodes have back pointers which point back to internal nodes. Nonetheless, it's possible to traverse a Patricia trie in alphabetical order by doing an in-order traversal, if you only visit internal nodes through leaf-node back pointers.
Which brings me to the question: is it possible to implement the STL lower_bound and upper_bound functions with a Patricia trie? The implementation I'm using does in fact, implement these functions, but they don't work as expected.
For example:
typedef uxn::patl::trie_set<std::string> trie;
trie ts;
ts.insert("LR");
ts.insert("BLQ");
ts.insert("HCDA");
trie::iterator it = ts.lower_bound("GG");
std::cout << *it << std::endl;
This outputs BLQ, when I would expect it to output HCDA. (An std::set, for example, would certainly output HCDA here.)
I emailed the developer who made this library, but never got a response. Regardless, I feel I have a pretty good understanding of how Patricia tries work, and I can't figure out how something like lower_bound would even be possible. The problem is that lower_bound seems to rely on the ability to lexicographically compare the two strings. Since "GG" doesn't exist in the tree, we'd need to find out which element is >= to GG. But Radix/Patricia tries don't use lexicographical comparison to move from node to node; rather each node stores a bit index which is used to perform a bit comparison on the search key. The result of the bit comparison tells you whether to move left or right. This makes it easy to find a particular prefix in the tree. But if the prefix doesn't exist in the tree, (as in the case of my search for "GG"), there doesn't seem to be any way, short of a lexicographical comparison, to get the lower_bound.
The fact that the C++ implementation I'm using doesn't seem to implement lower_bound properly confirms my suspicion that it may not be possible. Still, the fact that you can iterate over the tree in alphabetical order makes me think there might be a way to do it.
Does anyone have experience with this, or know if it is possible to implement a lower_bound functionality with a Patricia Trie?

Yes, it is possible. I have implemented a variant which does this, and D. J. Bernstein's page describes that as one of the fast operations.
http://cr.yp.to/critbit.html
In principle, you keep matching the prefix until you can't match any more, and then you go to the next value, and there's the node you're after.

Related

C++ Difference between std::lower_bound and std::set::lower_bound?

Recently, while working on a programming problem in C++, I came across something interesting. My algorithm used a really large set and would use std::lower_bound on it a great deal of times. However, after submitting my solution, contrary to the math I did on paper to prove that my code was fast enough, it ended up being far too slow. The code looked something like this:
using namespace std;
set<int> s;
int x;
//code code code
set<int>::iterator it = lower_bound(s.begin(),s.end(),x);
However, after getting a hint from a buddy to use set::lower_bound, the algorithm in question worked waaaaaaaay faster than before, and it followed my math. The binary search after changing:
set<int>::iterator it = s.lower_bound(x);
My question is what's the difference between these two? Why does one work much, much faster than the other? Isn't lower_bound supposed to be a binary search function that has the complexity O(log2(n))? In my code it ended up being way slower than that.
std::set is typically implemented as a self-balancing tree with some list-like structure tied into it. Knowing this structure, std::set::lower_bound will traverse the tree knowing the properties of the tree structure. Each step in this just means following a left or right child branch.
std::lower_bound needs to run something akin to a binary search over the data. However since std::set::iterator is bidirectional (not random access), this is much slower, a lot of increments need to be done between the checked elements. The work done between elements is thus much more intense. In this case the algorithm will check the element half way between A and B, then adjust one of A or B, find the element half way between them, and repeat.
After reading the API of std::lower_bound
On non-random-access iterators, the iterator advances produce themselves an additional linear complexity in N on average.
And I think STL set is using non-random-access iterators, so it is not doing a O(log N) binary search if using on STL set
std::lower_bound is a generic binary search algorithm, meant to work with most STL containers. set::lower_bound is designed to work with std::set so it takes advantages of the unique properties of std::set.
As std::set is often implemented as a red-black tree, one can imagine std::lower_bound iterating across all nodes, while set::lower_bound just traverses down the tree.
std::lower_bound always guarantees a O(log n) comparisons, only guarantees O(log n) time if passed a RandomAccessIterator, not just a ForwardIterator which does not provide constant-time std::advance.
The std::set::lower_bound implementation of the same algorithm is able to use internal details of the structure to avoid this problem.

Why is the complexity of the C++ STL map container O(log(n))?

For C++ STL containers such as vector and list, the complexity of finding elements and inserting or removing them is self-explanatory. However, for the map container, even though I know from my reading that the access and insertion complexity/performance is O(log(n)), I can't work out why. I clearly don't understand maps as much as I need to, so some enlightenment on this topic would be very much appreciated.
The elements of a map or set are contained in a tree structure; every time you examine a node of the tree, you determine if the element you're trying to find/insert is less than or greater than the node. The number of times you need to do this (for a properly balanced tree) is log2(N) because each comparison throws out half of the possibilities.
As slavik262 points, maps are usually implemented with red-black-trees, which are self-balanced.
Check the complexity of a red-black-tree for example in the wikipedia
I don't know any implementation of a map with a binary tree; if Mark Ransom knows one, I'd be pleased to know which one.

BST implementations in STL

C++ STL "set" and "map" support logarthmic worst case time for insert,
erase, and find operations. Consequently underlying implementation is
Binary search tree. An important issue in implementing set and map is
providing support for iterator class. Ofcourse, internally the
iterator maintains a pointer to the "current" node in the iterator .
The hard point is efficiently advancing to next node.
My question are
if "set" and "map" implements binary search tree how we advance to next node using iterator class i.e., what we return ++ or -- i.e., is it left subtree or right subtree?
In general how most of the STL implementaions BST and how ++ or -- of iterator is implemented?
Thanks!
There is nothing in the C++ specification that requires the use of a binary tree. Because of this, ++ and -- are defined in terms of the sequence of elements. map and set store an ordered sequence of elements. Therefore, ++ will go to the next element, and -- will go to the previous element. The next and previous elements are defined by the order of the elements.
Note that while the C++ specification doesn't require the use of a binary tree, the particular performance requirements pretty much force some use of a binary tree.
Generally, they use some from of Red/Black self-balancing binary tree.
It mostly depends on the particular implementation.
One way (my description only: not necessarily the one you have) can be the following:
a node has 3 pointers: a left, a right and an up one.
what ++ does is:
if there is "right", go right than go deepest left as possible.
Otherwise go up until we came from the right, and up once again.
what -- does its exactly the inverse.
Besides the worst case operational complexities both map & set (and all Orderer/Sorted Associative Containers in the C++ Standard Library) have a very important property: their elements are always sorted in order by key (according to a comparator).
A self-balancing binary search tree satisfies the operational complexities and the only traversal that will give you the elements in sorted order is, you've guessed it, the in-order one.
Nicol Bolas' answer gives you more details about the usual implementation. If you're curios to actually see such an implementation, you can take a gander at the RDE STL implementation.
It's a lot easier to read than what you'll find in your average C++ Standard Library implementation. As a side note, both set & map implementations in RDESTL have only forward iterator (not bidirectional as the standard says) but it should be pretty easy to figure out the -- part.

Should I randomly shuffle before inserting into STL set?

I need to insert 10-million strings into a C++ STL set. The strings are sorted. Will I have a pathological problem if I insert the strings in sorted order? Should I randomize first? Or will the G++ STL implementation automatically rebalance for me?
The set implementation typically uses a red-black tree, which will rebalance for you. However, insertion may be faster (or it may not) if you randomise the data before inserting - the only way to be sure is to do a test with your set implementation and specific data. Retrieval times will be the same, either way.
The implementation will re-balance automatically. Given that you know the input is sorted, however, you can give it a bit of assistance: You can supply a "hint" when you do an insertion, and in this case supplying the iterator to the previously inserted item will be exactly the right hint to supply for the next insertion. In this case, each insertion will have amortized constant complexity instead of the logarithmic complexity you'd otherwise expect.
The only question I have: do you really need a set ?
If the data is already sorted and you don't need to insert / delete elements after the creation, a deque would be better:
you'll have the same big-O complexity using a binary search for retrieval
you'll get less memory overhead... and better cache locality
On binary_search: I suspect you need more than a ForwardIterator for a binary search, guess this site is off again :(
http://en.wikipedia.org/wiki/Standard_Template_Library
set: "Implemented using a self-balancing binary search tree."
g++'s libstdc++ uses red black trees for sets and maps.
http://en.wikipedia.org/wiki/Red-black_tree
This is a self balancing tree, and insertions are always O(log n). The C++ standard also requires that all implementations have this characteristic, so in practice, they are almost always red black trees, or something very similar.
So don't worry about the order you put the elements in.
A very cheap and simple solution is to insert from both ends of your collections of strings. That is to say, first add "A", then "ZZZZZ", then "AA", then "ZZZZY", etcetera until you meet in the middle. It doesn't require the hefty cost of shuffling, yet it is likely to sidestep pathological cases.
Maybe 'unordered_set' can be an alternative.

What's a good and stable C++ tree implementation?

I'm wondering if anyone can recommend a good C++ tree implementation, hopefully one that is
stl compatible if at all possible.
For the record, I've written tree algorithms many times before, and I know it can be fun, but I want to be pragmatic and lazy if at all possible. So an actual link to a working solution is the goal here.
Note: I'm looking for a generic tree, not a balanced tree or a map/set, the structure itself and the connectivity of the tree is important in this case, not only the data within.
So each branch needs to be able to hold arbitrary amounts of data, and each branch should be separately iterateable.
I don't know about your requirements, but wouldn't you be better off with a graph (implementations for example in Boost Graph) if you're interested mostly in the structure and not so much in tree-specific benefits like speed through balancing? You can 'emulate' a tree through a graph, and maybe it'll be (conceptually) closer to what you're looking for.
Take a look at this.
The tree.hh library for C++ provides an STL-like container class for n-ary trees, templated over the data stored at the nodes. Various types of iterators are provided (post-order, pre-order, and others). Where possible the access methods are compatible with the STL or alternative algorithms are available.
HTH
I am going to suggest using std::map instead of a tree.
The complexity characteristics of a tree are:
Insert: O(ln(n))
Removal: O(ln(n))
Find: O(ln(n))
These are the same characteristics the std::map guarantees.
Thus as a result most implementations of std::map use a tree (Red-Black Tree) underneath the covers (though technically this is not required).
If you don't have (key, value) pairs, but simply keys, use std::set. That uses the same Red-Black tree as std::map.
Ok folks, I found another tree library; stlplus.ntree. But haven't tried it out yet.
Let suppose the question is about balanced (in some form, mostly red black tree) binary trees, even if it is not the case.
Balanced binaries trees, like vector, allow to manage some ordering of elements without any need of key (like by inserting elements anywhere in vector), but :
With optimal O(log(n)) or better complexity for all the modification of one element (add/remove at begin, end and before & after any iterator)
With persistance of iterators thru any modifications except direct destruction of the element pointed by the iterator.
Optionally one may support access by index like in vector (with a cost of one size_t by element), with O(log(n)) complexity. If used, iterators will be random.
Optionally order can be enforced by some comparison func, but persistence of iterators allow to use non repeatable comparison scheme (ex: arbitrary car lanes change during traffic jam).
In practice, balanced binary tree have interface of vector, list, double linked list, map, multimap, deque, queue, priority_queue... with attaining theoretic optimal O(log(n)) complexity for all single element operations.
<sarcastic> this is probably why c++ stl does not propose it </sarcastic>
Individuals may not implement general balanced tree by themselves, due to the difficulties to get correct management of balancing, especially during element extraction.
There is no widely available implementation of balanced binary tree because the state of the art red black tree (at this time the best type of balanced tree due to fixed number of costly tree reorganizations during remove) know implementation, slavishly copied by every implementers’ from the initial code of the structure inventor, does not allow iterator persistency. It is probably the reason of the absence of fully functionnal tree template.