I've been using std::vector but it has become unwieldy as the data it iterates through has grown and I would like to be able to filter out random elements when they become redundant. I have this behaviour elsewhere with std::list but can't get binary_search to play nice with that.
Is there some code I could utilize to get binary_search to work again or must I employ still more elaborate containers and syntax?
if(binary_search(iter + 1, myLines.end(), line)) {
firstFound.assign(line);
if (numFinds++) break;
}
std::set's lookup is O(log(N)), exactly as for binary_search, and the deletion is O(1) provided you have an iterator, and O(log(N)) if you don't (lookup + deletion). Although the set will store the elements sorted, that must be OK for you, since binary_search also works only for sorted ranges.
It sounds like what you're looking for is a binary search tree (BST).
It trivially allows binary searching (it's in the name after all), and deletion is relatively straightforward.
Many kinds of binary search trees exist. Some common ones :
simple BST : large spread between worst and best case search performance, depending on the specific usage (inserting and deleting) of the BST. The higher the tree, the worse the performance.
self-balancing BST : on every insert/delete, nodes are shifted around if needed to keep the tree height minimal, and hence the search performance optimal. This comes at the cost of added overhead (which can become prohibitively high) when inserting and deleting.
red-black tree : a kind of self-balancing BST that doesn't always strive for the optimal height, but is still good enough to keep lookups O(log n). It performs very consistently for a wide range of use cases (ie. has good general purpose performance), which is why it was picked for the implementation of std::set (see Armen Tsirunyan's answer).
splay tree : BST that keeps recently accessed items at the top of the tree where they can be accessed the fastest. It is not self-balancing, so the height is not kept minimal. This kind of tree is very useful when recently accessed data is likely to be accessed again soon.
treap : BST that uses random priorities are assigned to items, in order to probably (ie. using probability theory) keep the height of the tree close to minimal. It doesn't guarantee an optimal search performance, but in practice it usually does, and this without the added overhead of self-balancing or similar re-organizing algorithms.
Which of these is best for you, will depend on your specific use case.
Related
In a C++ std::set (often implemented using red-black binary search trees), the elements are automatically sorted, and key lookups and deletions in arbitrary positions take time O(log n) [amortised, i.e. ignoring reallocations when the size gets too big for the current capacity].
In a sorted C++ std::vector, lookups are also fast (actually probably a bit faster than std::set), but insertions are slow (since maintaining sortedness takes time O(n)).
However, sorted C++ std::vectors have another property: they can find the number of elements in a range quickly (in time O(log n)).
i.e., a sorted C++ std::vector can quickly answer: how many elements lie between given x,y?
std::set can quickly find iterators to the start and end of the range, but gives no clue how many elements are within.
So, is there a data structure that allows all the speed of a C++ std::set (fast lookups and deletions), but also allows fast computation of the number of elements in a given range?
(By fast, I mean time O(log n), or maybe a polynomial in log n, or maybe even sqrt(n). Just as long as it's faster than O(n), since O(n) is almost the same as the trivial O(n log n) to search through everything).
(If not possible, even an estimate of the number to within a fixed factor would be useful. For integers a trivial upper bound is y-x+1, but how to get a lower bound? For arbitrary objects with an ordering there's no such estimate).
EDIT: I have just seen the
related question, which essentially asks whether one can compute the number of preceding elements. (Sorry, my fault for not seeing it before). This is clearly trivially equivalent to this question (to get the number in a range, just compute the start/end elements and subtract, etc.)
However, that question also allows the data to be computed once and then be fixed, unlike here, so that question (and the sorted vector answer) isn't actually a duplicate of this one.
The data structure you're looking for is an Order Statistic Tree
It's typically implemented as a binary search tree in which each node additionally stores the size of its subtree.
Unfortunately, I'm pretty sure the STL doesn't provide one.
All data structures have their pros and cons, the reason why the standard library offers a bunch of containers.
And the rule is that there is often a balance between quickness of modifications and quickness of data extraction. Here you would like to easily access the number of elements in a range. A possibility in a tree based structure would be to cache in each node the number of elements of its subtree. That would add an average log(N) operations (the height of the tree) on each insertion or deletion, but would highly speedup the computation of the number of elements in a range. Unfortunately, few classes from the C++ standard library are tailored for derivation (and AFAIK std::set is not) so you will have to implement your tree from scratch.
Maybe you are looking for LinkedHashSet alternate for C++ https://docs.oracle.com/javase/7/docs/api/java/util/LinkedHashSet.html.
Background
Im building a performance minded application and I came across a place where I have to use std::set. And it works like a charm. But then I started reading into the documentation (which you can find here) and the first thing I noticed was that
Search, removal, and insertion operations have logarithmic complexity. Sets are usually implemented as red-black trees.
The search, removal and insertions makes perfect sense to me as they are using some kind of a tree structure (because the documentation does not guarantee that it uses a Red-Black Tree). But the problem is, why should they?
I made an alternate solution to the std::set of my own and which uses a std::vector to store all the entries. Then I performed some basic benchmarks and here are the results,
Iterations: 100000
// Insertion
VectorSet : 211464us
std::set : 1272864us
// Find/ Lookup
VectorSet : 404264us
std::set : 551464us
// Removal
VectorSet : 254321964us
std::set : 834664us
// Traversal (iterating through all the elements (100000 elements; 100000 iterations)
VectorSet : 2464us
std::set : 4374174264us
According to these results, my implementation (VectorSet) outperformed std::set in both insertions and lookups, and traversal was over 1800000 times. But std::set outperformed my implementation VectorSet by a significant margin (which is understandable as we are dealing with vectors).
I can justify why removal is slower in VectorSet but faster on std::set and why std::set takes so long to iterate through the entries. Some things which affect the performance would be (correct me if im wrong),
Cache misses.
Pointer dereferences.
Better locality.
For the vector being slower in removal,
Finding the element.
Removal of the element.
Possible resize.
Question
As what I can see, using a std::vector to store entries rather than a tree structure performs better in 3/4 instances. And even in the place where std::set performed better, it is still a small amount compared to iterating through. And in my opinion, developers use other aspects (lookups, insertions and iterations) more than removals. Even though these numbers are in the range of nanoseconds, the slightest improvement is better.
So my question is, why does std::set use a tree structure when they can use something like a vector to improve their efficiency?
Note: The container will be filled up with an average of 1000 elements and will be iterated repeatedly throughout the application lifetime and will directly affect the application runtime.
The standard set has some guarantees that you can't provide with your implementation:
inserting/erasing doesn't invalidate other iterators/references/pointers.
inserting/erasing elements has (at most) logarithmic complexity, as opposed to linear in your implementation.
If these don't matter to you, you're welcome to use a sorted vector and binary search. The standard provides std::sort, std::vector and std::binary_search, so you are good to go. The thing to notice is that each container has a specific use case and there is no "one size fits all" container.
The standard also provides unordered_set, which is a hash table. It is often criticized for being slow and causing cache-misses. Well, if that degrades your performance in a way you identified as a bottle-neck, go ahead and use some other hash-set implementation from other libraries. If you believe that you can do better, go ahead. Many projects build their own containers that are more specialized to that project. Could be faster, use less memory, could give different guarantees about iterator invalidation and/or complexity of operations. They all solve different problems.
Another point is that profiling and benchmarking is hard. Make sure you get it right. Performance comparison is usually done at scale (with varying number of input arguments). Picking a constant and relatively small size won't tell the whole story.
There are a couple questions on this site about accessing elements in a std::set by index, but the answers I saw were old and unenlightening.
Ordered sets can be (and often are) implemented as binary search trees. In a binary search tree, by storing the number of nodes in the tree rooted at each node, we can access the kth element in sorted order in O(log n) time without increasing the algorithmic complexity of other operations (please correct me if this is the error in my thinking).
Nevertheless, if I want the kth element in sorted order from a set::set, I must walk from begin() all the way to k, using O(k) time instead. In general, this may equate to O(n) time.
So, my questions are:
Is it correct that we could maintain a height-balanced binary search tree in which it's possible to find the kth element in O(log n) time without damaging the time complexity of other operations?
If so, is there a function or alternate data structure in the C++ standard library which I could utilize to this effect?
If yes to the former but no to the latter, has it been or is it being considered? Is it not implemented because of some technical barrier or simply because the implementation cost is deemed too expensive for the potential utility of this feature?
It is possible to augment a (balanced) search tree with extra information that can be used to implement searching by index in logarithmic time. Such augmented search tree may be called an order statistic tree.
Augmenting the tree doesn't affect the worst case asymptotic complexity of the main operations (insert, lookup, erase): Their worst case is still logarithmic. I don't know whether it prevents amortized constant complexity that is required for erase and hinted insertion operations of the ordered associative containers.
Asymptotic complexity is not the only criteria for a feature however. Augmenting the tree increases the complexity coefficient of the logarithmic operations making all (or most) other operations slower. It also increases the spatial overhead of the data structure. So, just because such data structure is possible, doesn't necessarily mean it would be good idea to use it to implement the generic associative containers provided by the standard library.
No. There is no container based on a search tree with logarithmic index lookup in the standard library.
I found a proposal n3700 based on the Boost tree component library which proposes to add generic tree structures. It includes the class rank_tree, which appears to be an augmented search tree that provides the operation that you are looking for.
Which is better (space/time) to find certain strings:
To use a vector of strings (alphabetically ordered) and a binary search.
To use a BST of strings (also ordered alphabetically).
Or are both similar?
Both have advantages, and it is going to depend on what your usage scenario is.
A sorted vector will be more efficient if your usage scenario can be broken into phases: load everything, then sort it once, then look things up without changing anything.
A tree structure will work better for you if your scenario involves inserting, searching, and removing things at different times, and you can't break it into phases. (In this case, a vector can add overhead, since inserting in the middle is expensive.)
There's a really good discussion of this in Effective STL, and there's a sorted vector implementation in Loki.
Assuming the binary search tree is balanced (which it will be if you are using std::set), then both of these are O(n) space and O(log n) time. So theoretically they are comparable.
In practice, the vector will take up somewhat less space and thus might be slightly faster thanks to locality effects. But probably not enough to matter, and since std::set supports O(log n) insertion, O(log n) deletion, and has a straightforward interface, I would recommend std::set.
That said... If all you care about is queries and you do not need to enumerate the strings in order, std::tr1::unordered_set (or boost::unordered_set or C++0x std::unordered_set) will be much faster than either, especially if the set is large. Hash tables rock.
I'm wondering if anyone can recommend a good C++ tree implementation, hopefully one that is
stl compatible if at all possible.
For the record, I've written tree algorithms many times before, and I know it can be fun, but I want to be pragmatic and lazy if at all possible. So an actual link to a working solution is the goal here.
Note: I'm looking for a generic tree, not a balanced tree or a map/set, the structure itself and the connectivity of the tree is important in this case, not only the data within.
So each branch needs to be able to hold arbitrary amounts of data, and each branch should be separately iterateable.
I don't know about your requirements, but wouldn't you be better off with a graph (implementations for example in Boost Graph) if you're interested mostly in the structure and not so much in tree-specific benefits like speed through balancing? You can 'emulate' a tree through a graph, and maybe it'll be (conceptually) closer to what you're looking for.
Take a look at this.
The tree.hh library for C++ provides an STL-like container class for n-ary trees, templated over the data stored at the nodes. Various types of iterators are provided (post-order, pre-order, and others). Where possible the access methods are compatible with the STL or alternative algorithms are available.
HTH
I am going to suggest using std::map instead of a tree.
The complexity characteristics of a tree are:
Insert: O(ln(n))
Removal: O(ln(n))
Find: O(ln(n))
These are the same characteristics the std::map guarantees.
Thus as a result most implementations of std::map use a tree (Red-Black Tree) underneath the covers (though technically this is not required).
If you don't have (key, value) pairs, but simply keys, use std::set. That uses the same Red-Black tree as std::map.
Ok folks, I found another tree library; stlplus.ntree. But haven't tried it out yet.
Let suppose the question is about balanced (in some form, mostly red black tree) binary trees, even if it is not the case.
Balanced binaries trees, like vector, allow to manage some ordering of elements without any need of key (like by inserting elements anywhere in vector), but :
With optimal O(log(n)) or better complexity for all the modification of one element (add/remove at begin, end and before & after any iterator)
With persistance of iterators thru any modifications except direct destruction of the element pointed by the iterator.
Optionally one may support access by index like in vector (with a cost of one size_t by element), with O(log(n)) complexity. If used, iterators will be random.
Optionally order can be enforced by some comparison func, but persistence of iterators allow to use non repeatable comparison scheme (ex: arbitrary car lanes change during traffic jam).
In practice, balanced binary tree have interface of vector, list, double linked list, map, multimap, deque, queue, priority_queue... with attaining theoretic optimal O(log(n)) complexity for all single element operations.
<sarcastic> this is probably why c++ stl does not propose it </sarcastic>
Individuals may not implement general balanced tree by themselves, due to the difficulties to get correct management of balancing, especially during element extraction.
There is no widely available implementation of balanced binary tree because the state of the art red black tree (at this time the best type of balanced tree due to fixed number of costly tree reorganizations during remove) know implementation, slavishly copied by every implementers’ from the initial code of the structure inventor, does not allow iterator persistency. It is probably the reason of the absence of fully functionnal tree template.