What is the best data structure for a generic step function? - c++

I need to implement a step function (i.e. piecewise constant). There are a few requirements that it will need to have.
It will have to be evaluated repeatedly at random locations, then evaluated sequentially over an interval.
It Will have to be easily updated, i.e. adding an increase/decrease over an interval.
So my question is what is the best data structure for this sort of thing? I was thinking that due to the random access nature a Binary tree is the most likely, but I'm hoping I'm not missing something. Also is there a good implementation already out there for C++.

Yes, since the intervals cannot overlap, binary tree is the right data structure. If you did not need fast updates, it could be as simple as having a sorted sequence of the interval endpoints; querying the function value at a certain location could be done using, for example, one of the STL binary search methods std::lower_bound or std::upper_bound, evaluating over a range using two searches. However, for fast updates a self-balancing binary tree is naturally better.
The standard library does define a container with tree-like access: std::map (although it is not required to be implemented using a binary tree, it usually is).
for completeness, it should be mentioned that Boost ICL library has an interval_map, a std::map-like container that uses intervals as keys and with aggregate on overlap insertion semantics.


Is there a data structure like a C++ std set which also quickly returns the number of elements in a range?

In a C++ std::set (often implemented using red-black binary search trees), the elements are automatically sorted, and key lookups and deletions in arbitrary positions take time O(log n) [amortised, i.e. ignoring reallocations when the size gets too big for the current capacity].
In a sorted C++ std::vector, lookups are also fast (actually probably a bit faster than std::set), but insertions are slow (since maintaining sortedness takes time O(n)).
However, sorted C++ std::vectors have another property: they can find the number of elements in a range quickly (in time O(log n)).
i.e., a sorted C++ std::vector can quickly answer: how many elements lie between given x,y?
std::set can quickly find iterators to the start and end of the range, but gives no clue how many elements are within.
So, is there a data structure that allows all the speed of a C++ std::set (fast lookups and deletions), but also allows fast computation of the number of elements in a given range?
(By fast, I mean time O(log n), or maybe a polynomial in log n, or maybe even sqrt(n). Just as long as it's faster than O(n), since O(n) is almost the same as the trivial O(n log n) to search through everything).
(If not possible, even an estimate of the number to within a fixed factor would be useful. For integers a trivial upper bound is y-x+1, but how to get a lower bound? For arbitrary objects with an ordering there's no such estimate).
EDIT: I have just seen the
related question, which essentially asks whether one can compute the number of preceding elements. (Sorry, my fault for not seeing it before). This is clearly trivially equivalent to this question (to get the number in a range, just compute the start/end elements and subtract, etc.)
However, that question also allows the data to be computed once and then be fixed, unlike here, so that question (and the sorted vector answer) isn't actually a duplicate of this one.
The data structure you're looking for is an Order Statistic Tree
It's typically implemented as a binary search tree in which each node additionally stores the size of its subtree.
Unfortunately, I'm pretty sure the STL doesn't provide one.
All data structures have their pros and cons, the reason why the standard library offers a bunch of containers.
And the rule is that there is often a balance between quickness of modifications and quickness of data extraction. Here you would like to easily access the number of elements in a range. A possibility in a tree based structure would be to cache in each node the number of elements of its subtree. That would add an average log(N) operations (the height of the tree) on each insertion or deletion, but would highly speedup the computation of the number of elements in a range. Unfortunately, few classes from the C++ standard library are tailored for derivation (and AFAIK std::set is not) so you will have to implement your tree from scratch.
Confusion in the differences between hashmap and hashtable

I have a confusion:
I have read in many posts that Hash-maps are implemented as binary search trees which makes the various operations time complexity of logarithmic order.
Hashtables on the other hand provide constant time fetching.
But, as I read in this post, no difference has been provided in terms of the complexity for retrieval/searching of elements in the two data structures.
So, here is my question-
Since hashtables are guaranteed to provide constant searching time complexity, their implementation must differ from those of hash-maps. So, why will someone ever use hash-maps if they do not provide constant time searching. Also, why in the first place, they are implemented as binary search trees?
I know hash maps store the keys in sorted form and provide iteration through the map. But, the same could also be provide in the hashtable too.
Your confusion stems from the following:
Hash-maps are implemented as binary search trees
They are not. No sensible naming convention would use the term "hashmap" to describe a structure based on a tree.
For example, in Java, HashMap is based on a hash table and TreeMap is based on a tree.
C++ uses neither "hash" nor "tree" in the naming of its standard containers. The two containers that broaly correspond to HashMap and TreeMap are std::unordered_map and std::map.

Which is the fastest STL container for find?

Alright as a preface I have a need to cache a relatively small subset of rarely modified data to avoid querying the database as frequently for performance reasons. This data is heavily used in a read-only sense as it is referenced often by a much larger set of data in other tables.
I've written a class which will have the ability to store basically the entirety of the two tables in question in memory while listening for commit changes in conjunction with a thread safe callback mechanism for updating the cached objects.
My current implementation has two std::vectors one for the elements of each table. The class provides both access to the entirety of each vector as well as convenience methods for searching for a specific element of table data via std::find, std::find_if, etc.
Does anyone know if using std::list, std::set, or std::map over std::vector for searching would be preferable? Most of the time that is what will be requested of these containers after populating once from the database when a new connection is made.
I'm also open to using C++0x features supported by VS2010 or Boost.
For searching a particular value, with std::set and std::map it takes O(log N) time, while with the other two it takes O(N) time; So, std::set or std::map are probably better. Since you have access to C++0x, you could also use std::unordered_set or std::unordered_map which take constant time on average.
For find_if, there's little difference between them, because it takes an arbitrary predicate and containers cannot optimize arbitrarily, of course.
However if you will be calling find_if frequently with a certain predicate, you can optimize yourself: use a std::map or std::set with a custom comparator or special keys and use find instead.
A sorted vector using std::lower_bound can be just as fast as std::set if you're not updating very often; they're both O(log n). It's worth trying both to see which is better for your own situation.
Since from your (extended) requirements you need to search on multiple fields, I would point you to Boost.MultiIndex.
This Boost library lets you build one container (with only one exemplary of each element it contains) and index it over an arbitrary number of indices. It also lets you precise which indices to use.
To determine the kind of index to use, you'll need extensive benchmarks. 500 is a relatively low number of entries, so constant factors won't play nicely. Furthermore, there can be a noticeable difference between single-thread and multi-thread usage (most hash-table implementations can collapse on MT usage because they do not use linear-rehashing, and thus a single thread ends up rehashing the table, blocking all others).
I would recommend a sorted index (skip-list like, if possible) to accomodate range requests (all names beginning by Abc ?) if the performance difference is either unnoticeable or simply does not matter.
If you only want to search for distinct values, one specific column in the table, then std::hash is fastest.
If you want to be able to search using several different predicates, you will need some kind of index structure. It can be implemented by extending your current vector based approach with several hash tables or maps, one for each field to search for, where the value is either an index into the vector, or a direct pointer to the element in the vector.
Going further, if you want to be able to search for ranges, such as all occasions having a date in July you need an ordered data structure, where you can extract a range.
Not an answer per se, but be sure to use a typedef to refer to the container type you do use, something like typedef std::vector< itemtype > data_table_cache; Then use your typedef type everywhere.

What is the difference between set and hashset in C++ STL?

When should I choose one over the other?
Are there any pointers that you would recommend for using the right STL containers?
hash_set is an extension that is not part of the C++ standard. Lookups should be O(1) rather than O(log n) for set, so it will be faster in most circumstances.
Another difference will be seen when you iterate through the containers. set will deliver the contents in sorted order, while hash_set will be essentially random (Thanks Lou Franco).
Edit: The C++11 update to the C++ standard introduced unordered_set which should be preferred instead of hash_set. The performance will be similar and is guaranteed by the standard. The "unordered" in the name stresses that iterating it will produce results in no particular order.
stl::set is implemented as a binary search tree.
hashset is implemented as a hash table.
The main issue here is that many people use stl::set thinking it is a hash table with look-up of O(1), which it isn't, and doesn't have. It really has O(log(n)) for look-ups. Other than that, read about binary trees vs hash tables to get a better idea of the data structures.
Another thing to keep in mind is that with hash_set you have to provide the hash function, whereas a set only requires a comparison function ('<') which is easier to define (and predefined for native types).
I don't think anyone has answered the other part of the question yet.
The reason to use hash_set or unordered_set is the usually O(1) lookup time. I say usually because every so often, depending on implementation, a hash may have to be copied to a larger hash array, or a hash bucket may end up containing thousands of entries.
The reason to use a set is if you often need the largest or smallest member of a set. A hash has no order so there is no quick way to find the smallest item. A tree has order, so largest or smallest is very quick. O(log n) for a simple tree, O(1) if it holds pointers to the ends.
A hash_set would be implemented by a hash table, which has mostly O(1) operations, whereas a set is implemented by a tree of some sort (AVL, red black, etc.) which have O(log n) operations, but are in sorted order.
Edit: I had written that trees are O(n). That's completely wrong.

What's a good and stable C++ tree implementation?

I'm wondering if anyone can recommend a good C++ tree implementation, hopefully one that is
stl compatible if at all possible.
For the record, I've written tree algorithms many times before, and I know it can be fun, but I want to be pragmatic and lazy if at all possible. So an actual link to a working solution is the goal here.
Note: I'm looking for a generic tree, not a balanced tree or a map/set, the structure itself and the connectivity of the tree is important in this case, not only the data within.
So each branch needs to be able to hold arbitrary amounts of data, and each branch should be separately iterateable.
I don't know about your requirements, but wouldn't you be better off with a graph (implementations for example in Boost Graph) if you're interested mostly in the structure and not so much in tree-specific benefits like speed through balancing? You can 'emulate' a tree through a graph, and maybe it'll be (conceptually) closer to what you're looking for.
Take a look at this.
The tree.hh library for C++ provides an STL-like container class for n-ary trees, templated over the data stored at the nodes. Various types of iterators are provided (post-order, pre-order, and others). Where possible the access methods are compatible with the STL or alternative algorithms are available.
I am going to suggest using std::map instead of a tree.
The complexity characteristics of a tree are:
Insert: O(ln(n))
Removal: O(ln(n))
Find: O(ln(n))
These are the same characteristics the std::map guarantees.
Thus as a result most implementations of std::map use a tree (Red-Black Tree) underneath the covers (though technically this is not required).
If you don't have (key, value) pairs, but simply keys, use std::set. That uses the same Red-Black tree as std::map.
Ok folks, I found another tree library; stlplus.ntree. But haven't tried it out yet.
Let suppose the question is about balanced (in some form, mostly red black tree) binary trees, even if it is not the case.
Balanced binaries trees, like vector, allow to manage some ordering of elements without any need of key (like by inserting elements anywhere in vector), but :
With optimal O(log(n)) or better complexity for all the modification of one element (add/remove at begin, end and before & after any iterator)
With persistance of iterators thru any modifications except direct destruction of the element pointed by the iterator.
Optionally one may support access by index like in vector (with a cost of one size_t by element), with O(log(n)) complexity. If used, iterators will be random.
Optionally order can be enforced by some comparison func, but persistence of iterators allow to use non repeatable comparison scheme (ex: arbitrary car lanes change during traffic jam).
In practice, balanced binary tree have interface of vector, list, double linked list, map, multimap, deque, queue, priority_queue... with attaining theoretic optimal O(log(n)) complexity for all single element operations.
<sarcastic> this is probably why c++ stl does not propose it </sarcastic>
Individuals may not implement general balanced tree by themselves, due to the difficulties to get correct management of balancing, especially during element extraction.
There is no widely available implementation of balanced binary tree because the state of the art red black tree (at this time the best type of balanced tree due to fixed number of costly tree reorganizations during remove) know implementation, slavishly copied by every implementers’ from the initial code of the structure inventor, does not allow iterator persistency. It is probably the reason of the absence of fully functionnal tree template.