I am having some confusion on the runtimes of the find_min operation on a Binary Search Tree and a Binary Heap. I understand that returning the min in a Binary Heap is a O(1) operation. I also understand why in theory, returning the minimum element in a Binary Search Tree is O(log(N)) operation. Much to my surprise, when I read up on the data structure in C++ STL, the documentation states that returning an iterator to the first element in a map (which is the same as returning the minimum element) occurs in constant time! Shouldnt this be getting returned in logarithmic time? I need someone to help me understand what C++ is doing under the hood to return this in constant time. Because then, there is no point in really using a binary heap in C++, the map data structure would then support retrieve min and max in both constant time, delete and search in O(log(N)) and keeps everything sorted. This means that the data structure has the benefits of both a BST and Binary Heap all tied up in one!
I had an argument about this with an interviewer (not really an argument) but I was trying to explain to him that in C++ returning min and max from map in C++ (which is a self-balancing binary search tree) occurs in constant time. He was baffled and kept saying I was wrong and that a binary heap was the way to go. Clarification would be much appreciated
The constant-time lookup of the minimum and maximum is achieved by storing references to the leftmost and the rightmost nodes of the RB-tree in the header structure of the map. Here is a comment from the source code of the RB-tree, a template from which the implementation of std::set, std::map, and std::multimap are derived:
the header cell is maintained with links not only to the root but also to the leftmost node of the tree, to enable constant time begin(), and to the rightmost node of the tree, to enable linear time performance when used with the generic set algorithms (set_union, etc.)
The tradeoff here is that these pointers need to be maintained, so insertion and deletion operations would have another "housekeeping operation" to do. However, insertions and deletions are already done in logarithmic time, so there is no additional asymptotic cost for maintaining these pointers.
At least in a typical implementation, std::set (and std::map) will be implemented as a threaded binary tree1. In other words, each node contains not only a pointer to its (up to) two children, but also to the previous and next node in order. The set class itself then has pointers to not only the root of the tree, but also to the beginning and end of the threaded list of nodes.
To search for a node by key, the normal binary pointers are used. To traverse the tree in order, the threading pointers are used.
This does have a number of disadvantages compared to a binary heap. The most obvious is that it stores four pointers for each data item, where a binary heap can store just data, with no pointers (the relationships between nodes are implicit in the positions of the data). In an extreme case (e.g., std::set<char>) this could end up using a lot more storage for the pointers than for the data you actually care about (e.g., on a 64-bit system you could end up with 4 pointers of 64-bits apiece, to store each 8-bit char). This can lead to poor cache utilization, which (in turn) tends to hurt speed.
In addition, each node will typically be allocated individually, which can reduce locality of reference quite badly, again hurting cache usage and further reducing speed.
As such, even though the threaded tree can find the minimum or maximum, or traverse to the next or previous node in O(1), and search for any given item in O(log N), the constants can be substantially higher than doing the same with a heap. Depending on the size of items being stored, total storage used may be substantially greater than with a heap as well (worst case is obviously when only a little data is stored in each node).
1. With some balancing algorithm applied--most often red-black, but sometimes AVL trees or B-trees. Any number of other balanced trees could be using (e.g., alpha-balanced trees, k-neighbor trees, binary b-trees, general balanced trees).
I'm not an expert at maps, but returning the first element of a map would be considered a 'root' of sorts. there is always a pointer to it, so the look up time of it would be instant. The same would go for a BSTree as it clearly has a root node then 2 nodes off of it and so on (which btw I would look into using an AVL Tree as the look up time for the worst case scenario is much better than the BSTree).
The O(log(N)) is normally only used to get an estimate of the worst case scenario. So if you have a completely unbalanced BSTree you'll actually have O(N), so if your searching for the last node you have to do a comparison to every node.
I'm not too sure about your last statement though a map is different from a self-balancing tree, those are called AVL Trees (or thats what I was taught). A map uses 'keys' to organize objects in a specific way. The key is found by serializing the data into a number and the number is for the most part placed in a list.
Related
In a C++ std::set (often implemented using red-black binary search trees), the elements are automatically sorted, and key lookups and deletions in arbitrary positions take time O(log n) [amortised, i.e. ignoring reallocations when the size gets too big for the current capacity].
In a sorted C++ std::vector, lookups are also fast (actually probably a bit faster than std::set), but insertions are slow (since maintaining sortedness takes time O(n)).
However, sorted C++ std::vectors have another property: they can find the number of elements in a range quickly (in time O(log n)).
i.e., a sorted C++ std::vector can quickly answer: how many elements lie between given x,y?
std::set can quickly find iterators to the start and end of the range, but gives no clue how many elements are within.
So, is there a data structure that allows all the speed of a C++ std::set (fast lookups and deletions), but also allows fast computation of the number of elements in a given range?
(By fast, I mean time O(log n), or maybe a polynomial in log n, or maybe even sqrt(n). Just as long as it's faster than O(n), since O(n) is almost the same as the trivial O(n log n) to search through everything).
(If not possible, even an estimate of the number to within a fixed factor would be useful. For integers a trivial upper bound is y-x+1, but how to get a lower bound? For arbitrary objects with an ordering there's no such estimate).
EDIT: I have just seen the
related question, which essentially asks whether one can compute the number of preceding elements. (Sorry, my fault for not seeing it before). This is clearly trivially equivalent to this question (to get the number in a range, just compute the start/end elements and subtract, etc.)
However, that question also allows the data to be computed once and then be fixed, unlike here, so that question (and the sorted vector answer) isn't actually a duplicate of this one.
The data structure you're looking for is an Order Statistic Tree
It's typically implemented as a binary search tree in which each node additionally stores the size of its subtree.
Unfortunately, I'm pretty sure the STL doesn't provide one.
All data structures have their pros and cons, the reason why the standard library offers a bunch of containers.
And the rule is that there is often a balance between quickness of modifications and quickness of data extraction. Here you would like to easily access the number of elements in a range. A possibility in a tree based structure would be to cache in each node the number of elements of its subtree. That would add an average log(N) operations (the height of the tree) on each insertion or deletion, but would highly speedup the computation of the number of elements in a range. Unfortunately, few classes from the C++ standard library are tailored for derivation (and AFAIK std::set is not) so you will have to implement your tree from scratch.
Maybe you are looking for LinkedHashSet alternate for C++ https://docs.oracle.com/javase/7/docs/api/java/util/LinkedHashSet.html.
I've been using std::vector but it has become unwieldy as the data it iterates through has grown and I would like to be able to filter out random elements when they become redundant. I have this behaviour elsewhere with std::list but can't get binary_search to play nice with that.
Is there some code I could utilize to get binary_search to work again or must I employ still more elaborate containers and syntax?
if(binary_search(iter + 1, myLines.end(), line)) {
firstFound.assign(line);
if (numFinds++) break;
}
std::set's lookup is O(log(N)), exactly as for binary_search, and the deletion is O(1) provided you have an iterator, and O(log(N)) if you don't (lookup + deletion). Although the set will store the elements sorted, that must be OK for you, since binary_search also works only for sorted ranges.
It sounds like what you're looking for is a binary search tree (BST).
It trivially allows binary searching (it's in the name after all), and deletion is relatively straightforward.
Many kinds of binary search trees exist. Some common ones :
simple BST : large spread between worst and best case search performance, depending on the specific usage (inserting and deleting) of the BST. The higher the tree, the worse the performance.
self-balancing BST : on every insert/delete, nodes are shifted around if needed to keep the tree height minimal, and hence the search performance optimal. This comes at the cost of added overhead (which can become prohibitively high) when inserting and deleting.
red-black tree : a kind of self-balancing BST that doesn't always strive for the optimal height, but is still good enough to keep lookups O(log n). It performs very consistently for a wide range of use cases (ie. has good general purpose performance), which is why it was picked for the implementation of std::set (see Armen Tsirunyan's answer).
splay tree : BST that keeps recently accessed items at the top of the tree where they can be accessed the fastest. It is not self-balancing, so the height is not kept minimal. This kind of tree is very useful when recently accessed data is likely to be accessed again soon.
treap : BST that uses random priorities are assigned to items, in order to probably (ie. using probability theory) keep the height of the tree close to minimal. It doesn't guarantee an optimal search performance, but in practice it usually does, and this without the added overhead of self-balancing or similar re-organizing algorithms.
Which of these is best for you, will depend on your specific use case.
So I am trying to understand the data types and Big O notation of some functions for a BST and Hashing.
So first off, how are BSTs and Hashing stored? Are BSTs usually arrays, or are they linked lists because they have to point to their left and right leaves?
What about Hashing? I've had the most trouble finding clear information regarding Hashing in terms of computation-based searching. I understand that Hashing is best implemented with an array of chains. Is this for faster searching or to decrease overhead on creating the allocated data type?
This following question might be just bad interpretation on my part, but what makes a traversal function different from a search function in BSTs, Hashing, and STL containers?
Is traversal Big O(N) for BSTS because you're actually visiting each node/data member, whereas search() can reduce its time by eliminating half the searching field?
And somewhat related, why is it that in the STL, list.insert() and list.erase() have a Big O(1) whereas the vector and deque counterparts are O(N)?
Lastly, why would a vector.push_back() be O(N)? I thought the function could be done something along the lines of this like O(1), but I've come across text saying it is O(N):
vector<int> vic(2,3);
vector<int>::const iterator IT = vic.end();
//wanna insert 4 to the end using push_back
IT++;
(*IT) = 4;
hopefully this works. I'm a bit tired but I would love any explanations why something similar to that wouldn't be efficient or plausible. Thanks
BST's (Ordered Binary Trees) are a series of nodes where a parent node points to its two children, which in turn point to their max-two children, etc. They're traversed in O(n) time because traversal visits every node. Lookups take O(log n) time. Inserts take O(1) time because internally they don't need to a bunch of existing nodes; just allocate some memory and re-aim the pointers. :)
Hashes (unordered_map) use a hashing algorithm to assign elements to buckets. Usually buckets contain a linked list so that hash collisions just result in several elements in the same bucket. Traversal will again be O(n), as expected. Lookups and inserts will be amortized O(1). Amortized means that on average, O(1), though an individual insert might result in a rehashing (redistribution of buckets to minimize collisions). But over time the average complexity is O(1). Note, however, that big-O notation doesn't really deal with the "constant" aspect; only order of growth. The constant overhead in the hashing algorithms can be high enough that for some data-sets the O(log n) binary trees outperform the hashes. Nevertheless, the hash's advantage is that its operations are constant time-complexity.
Search functions take advantage (in the case of binary trees) of the notion of "order"; a search through a BST has the same characteristics as a basic binary search over an ordered array. O(log n) growth. Hashes don't really "search". They compute the bucket, and then quickly run through the collisions to find the target. That's why lookups are constant time.
As for insert and erase; in array-based sequence containers, all elements that come after the target have to be bumped over to the right. Move semantics in C++11 can improve upon the performance, but the operation is still O(n). For linked sequence containers (list, forward_list, trees), insertion and erasing just means fiddling with some pointers internally. It's a constant-time process.
push_back() will be O(1) until you exceed the existing allocated capacity of the vector. Once the capacity is exceeded, a new allocation takes place to produce a container that is large enough to accept more elements. All the elements need to then be moved into the larger memory region, which is an O(n) process. I believe Move Semantics can help here as well, but it's still going to be O(n). Vectors and strings are implemented such that as they allocate space for a growing data set, they allocate more than they need, in anticipation of additional growth. This is an efficiency safeguard; it means that the typical push_back() won't trigger a new allocation and move of the entire data set into a larger container. But eventually after enough push_backs, the limit will be reached, and the vector's elements will be copied into a larger container, which again has some extra headroom left over for more efficient push_backs.
Traversal refers to visiting every node, whereas search is only to find a particular node, so your intuition is spot on there. O(N) complexity because you need to visit N nodes.
std::vector::insert is for insert in the middle, and it involves copying all subsequent elements over by one slot, inorder to make room for the element being inserted, hence O(N). Linked list doesnt have this issue, hence O(1). Similar logic for erase. deque properties are similar to vector
std::vector::push_back is a O(1) operation, for the most part, only deviates if capacity is exceeded and reallocations + copy are needed.
I am looking for the best data structure for C++ in which insertion and deletion can take place very efficiently and fast.
Traversal should also be very easy for this data structure. Which one should i go with?
What about SET in C++??
A linked list provides efficient insertion and deletion of arbitrary elements. Deletion here is deletion by iterator, not by value. Traversal is quite fast.
A dequeue provides efficient insertion and deletion only at the ends, but those are faster than for a linked list, and traversal is faster as well.
A set only makes sense if you want to find elements by their value, e.g. to remove them. Otherwise the overhead of checking for duplicate as well as that of keeping things sorted will be wasted.
It depends on what you want to put into this data structure. If the items are unordered or you care about their order, list<> could be used. If you want them in a sorted order, set<> or multiset<> (the later allows multiple identical elements) could be an alternative.
list<> is typically a double-linked list, so insertion and deletion can be done in constant time, provided you know the position. traversal over all elements is also fast, but accessing a specified element (either by value or by position) could become slow.
set<> and its family are typically binary trees, so insertion, deletion and searching for elements are mostly in logarithmic time (when you know where to insert/delete, it's constant time). Traversal over all elements is also fast.
(Note: boost and C++11 both have data structures based on hash-tables, which could also be an option)
I would say a linked list depending on whether or not you're deletions are specific and often. Iterator about it.
It occurs to me, that you need a tree.
I'm not sure about the exact structure (since you didnt provide in-detail info), but if you can put your data into a binary tree, you can achieve decent speed at searching, deleting and inserting elements ( O(logn) average and O(n) worst case).
Note that I'm talking about the data structure here, you can implement it in different ways.
I am trying to implement my own heap with the method removing any number (not only the min or max) but I can't solve one problem. To write that removing function I need the pointers to the elements in my heap (to have O(logn) time of removing given element). But when I have tried do it this way:
vector<int*> ptr(n);
it of course did not work.
Maybe I should insert into my heap another class or structure containing int, but for now I would like to find any solution with int (because I have already implemented it using int)?
When you need to remove (or change the priority of) other objects than the root, a d-heap isn't necessarily the ideal data structure: the nodes keep changing their position and you need to keep track of various moves. It is doable, however. To use a heap like this you would return a handle to the newly inserted object which identifies some sort of node which stays put. Since the d-heap algorithm relies on the tree being a perfectly balanced tree, you effectively need to implement it using an array. Since these two requirements (using an array and having nodes stay put) are mutually exclusive you need to do both and have an index from the nodes into the array (so you can find the position of the object in the array) and a pointer from the array to the node (so you can update the node when the position changes). Almost certainly you don't want to move your nodes a lot, i.e. you rather accept finding the proper direction to move a nodes by searching multiple nodes, i.e. you want to use a d > 2.
There are alternative approach to implement a heap which are inherently nodes based. In particular Fibonacci heaps which yield for certain usage patterns a better amortized complexity than the usual O(ln(n)) complexity. However, they are somewhat harder to implement and the actual efficiency only pays off if you either need to change the priority of a node frequently or you have fairly large data sets.
A heap is a particular sort of data structure; the elements are stored in a binary tree, and there are well-established procedures for adding or removing elements. Many implementations use an array to hold the tree nodes, and removing an element involved moving log(n) elements around. Normally the way the array is used, the children of the node at array location n are stored at locations 2n and 2n+1; element 0 is left empty.
This Wikipedia page does a fine job of explaining the algorithms.