Data structure (in STL or Boost) for retrieving kth smallest/largest item in a set? - c++

I am looking for a data structure in C++ STL or boost with the following properties:
Retrieval of kth largest item in O(log n) time
Searching in O(log n) time
Deletion in O(log n) time
If such a data structure implementation doesn't exist, is there a way to adapt a different data structure with extra data (e.g., set) so that the above is possible?
Note: I've found is-there-any-data-structure-in-c-stl-for-performing-insertion-searching-and-r, but this is 5 years old and doesn't mention boost.

For the moment I assume that the elements are unique and that there are at least k elements. If not, you can use multiset similarly.
You can accomplish this using two sets in C++:
#include <set>
Set 1: Let's call this large. It keeps the k largest elements only.
Set 2: Let's call this rest. It keeps the rest of the elements.
Searching: Just search boths sets, takes O(log n) since both sets are red-black tree.
Deleting: If the element is in rest, just delete it. If not, delete it from large, and then remove the largest element from rest and put it into large. Deleting from red-black tree takes O(log n).
Inserting new elements (initializing): Each time a new element comes: (1) If large has less than k elements, add it to large. (2) Otherwise, if the element is greater than the minimum element in large, remove that minimum and move it to rest. Then, add the new element to large. If none of the above, simply add the new element to rest. Deleting and inserting for red-black trees takes O(log n).
This way, large always has the k largest elements, and the minimum of those is the k-th largest which you want.
I leave it to you to find how you can do search, insert, find min, find max, and delete in a set. It's not that hard. But all of these operations take O(log n) on a balanced binary search tree.

Related

Is there a sorted data structure with logarithmic time insertion, deletion and find (with distance)?

I have a sorted array in which I find number of items less than particular value using binary search (std::upper_bound) in O(logn) time.
Now I want to insert and delete from this array while keeping it sorted. I want the overall complexity to be O(logn).
I know that using binary search tree or std::multiset I can do insertion, deletion and upper_bound in O(logn) but I am not able do get the distance/index (std::distance is O(n) for sets) in logarithmic time.
So is there a way to achieve what I want to do?
You can augment any balanced-binary-search-tree data structure (e.g. a red-black tree) by including a "subtree size" data-member in each node (alongside the standard "left child", "right child", and "value" members). You can then calculate the number of elements less than a given element as you navigate downward from the root to that element.
It adds a fair bit of bookkeeping, and of course it means you need to use your own balanced-binary-search-tree implementation instead of one from the standard library; but it's quite doable, and it doesn't affect the asymptotic complexities of any of the operations.
You can use balanced BST with size of left subtree in each node to calculate distance

Structure like priority queue but with something like lower bound

I want a structure to store (for example) numbers where I can insert and remove elements, my structure remains sorted always (like a priority queue) BUT with the possibility of knowing where is a given number, and every operation in logarithmic time.
Maybe with lower_bound, upper_bound, or just a binary search, but in priority_queue what blocks me to do binary search is that I cannot access the elements with an index, only the first one.
I think you’re looking for an order statistics tree, an augmented BST that supports all the regular BST operations in time O(log n), along with two others:
rank(elem): return which index elem would occupy in the sorted sequence.
index(k): given an index k, return the element at that index in the sorted sequence.
The two above operations run in O(log n) time, making them extremely fast.
You can treat an order statistics tree as a priority queue. Insertions work as normal BST insertions, and to extract the lowest/highest element you just remove the smallest/greatest element from the tree, which you can do in time O(log n) by just walking down the left or right spines of the tree.
A priority queue does not keep things in sorted order. At least, not typically. A priority queue makes it possible for you to quickly obtain the next item in the sequence. But you can't efficiently access, say, the 5th item in the queue.
I know of three different ways to build a priority queue in which you can efficiently access items by key:
Use a balanced binary search tree to implement the queue. Although all operations are O(log n), typical running time is slower than a binary heap.
Implement the heap priority queue as a skip list. This is a good option. I've seen some people report that a skip list priority queue outperforms a binary heap. A search for [C++ skip list] will return you lots of implementations.
What I call an indexed binary heap also works. Basically, you marry a hash map or dictionary with a binary heap. The map is indexed by key, and its value contains the index of the item in the heap array. Such a thing is not difficult to build, and is quite effective.
Come to think of it, you can make an indexed version of any type of heap.
You have a number of options. I rather like the skip list, myself, but your mileage may vary.
The indexed binary heap, as I pointed out, is a hybrid data structure that maintains a dictionary (hash map) and a binary heap. Briefly how it works:
The dictionary key is the field that you use to look up an item that you put into the heap. The value is an integer: the index of that item in the heap.
The heap itself is a standard binary heap implemented in an array. The only difference is that every time you move an item from one place to another in the heap, you update its location in the dictionary. So, for example, if you swap two items, you have to swap not only the items themselves in the array, but also their positions as stored in the dictionary. For example:
heap is an array of string references
dict is a dictionary, keyed by string
swap (a, b)
{
// swap item at heap[a] with item at heap[b]
temp = heap[a]
heap[a] = heap[b]
heap[b] = temp
// update their positions in the dictionary
dict[heap[a]] = b
dict[heap[b]] = a
}
It's a pretty simple modification of a standard binary heap implementation. You just have to be careful to update the position every time you move an item.
You can also do this with node-based heaps like Pairing heap, Fibonacci heap, Skew heap, etc.

container for binary search as well as insert and delete

I am looking for a STL container or any other data structure that can do Binary Search in O(log n) time and also support insert and delete in atmost O(log n) time. Is it possible ? n = no. of elements in the Container.
Note - Binary Search here means knowing the indices and accessing values based on index.

complexity of set::insert

I have read that insert operation in a set takes only log(n) time. How is that possible?
To insert, first we have find the location in the sorted array where the new element must sit. Using binary search it takes log(n). Then to insert in that location, all the elements succeeding it should be shifted one place to the right. It takes another n time.
My doubt is based on my understanding that set is implemented as an array and elements are stored in sorted order. Please correct me if my understanding is wrong.
std::set is commonly implemented as a red-black binary search tree. Insertion on this data structure has a worst-case of O(log(n)) complexity, as the tree is kept balanced.
Things do not get shifted over when inserting to a set. It is usually not stored as a vector or array, but rather a binary tree. All you do is add a new leaf node, which takes O(1). So in total insertion takes O(log(n))

Difference between std::set and std::priority_queue

Since both std::priority_queue and std::set (and std::multiset) are data containers that store elements and allow you to access them in an ordered fashion, and have same insertion complexity O(log n), what are the advantages of using one over the other (or, what kind of situations call for the one or the other?)?
While I know that the underlying structures are different, I am not as much interested in the difference in their implementation as I am in the comparison their performance and suitability for various uses.
Note: I know about the no-duplicates in a set. That's why I also mentioned std::multiset since it has the exactly same behavior as the std::set but can be used where the data stored is allowed to compare as equal elements. So please, don't comment on single/multiple keys issue.
A priority queue only gives you access to one element in sorted order -- i.e., you can get the highest priority item, and when you remove that, you can get the next highest priority, and so on. A priority queue also allows duplicate elements, so it's more like a multiset than a set. [Edit: As #Tadeusz Kopec pointed out, building a heap is also linear on the number of items in the heap, where building a set is O(N log N) unless it's being built from a sequence that's already ordered (in which case it is also linear).]
A set allows you full access in sorted order, so you can, for example, find two elements somewhere in the middle of the set, then traverse in order from one to the other.
std::priority_queue allows to do the following:
Insert an element O(log n)
Get the smallest element O(1)
Erase the smallest element O(log n)
while std::set has more possibilities:
Insert any element O(log n) and the constant is greater than in std::priority_queue
Find any element O(log n)
Find an element, >= than the one your are looking for O(log n) (lower_bound)
Erase any element O(log n)
Erase any element by its iterator O(1)
Move to previous/next element in sorted order O(1)
Get the smallest element O(1)
Get the largest element O(1)
set/multiset are generally backed by a binary tree. http://en.wikipedia.org/wiki/Binary_tree
priority_queue is generally backed by a heap. http://en.wikipedia.org/wiki/Heap_(data_structure)
So the question is really when should you use a binary tree instead of a heap?
Both structures are laid out in a tree, however the rules about the relationship between anscestors are different.
We will call the positions P for parent, L for left child, and R for right child.
In a binary tree L < P < R.
In a heap P < L and P < R
So binary trees sort "sideways" and heaps sort "upwards".
So if we look at this as a triangle than in the binary tree L,P,R are completely sorted, whereas in the heap the relationship between L and R is unknown (only their relationship to P).
This has the following effects:
If you have an unsorted array and want to turn it into a binary tree it takes O(nlogn) time. If you want to turn it into a heap it only takes O(n) time, (as it just compares to find the extreme element)
Heaps are more efficient if you only need the extreme element (lowest or highest by some comparison function). Heaps only do the comparisons (lazily) necessary to determine the extreme element.
Binary trees perform the comparisons necessary to order the entire collection, and keep the entire collection sorted all-the-time.
Heaps have constant-time lookup (peek) of lowest element, binary trees have logarithmic time lookup of lowest element.
Since both std::priority_queue and std::set (and std::multiset) are data containers that store elements and allow you to access them in an ordered fashion, and have same insertion complexity O(log n), what are the advantages of using one over the other (or, what kind of situations call for the one or the other?)?
Even though insert and erase operations for both containers have the same complexity O(log n), these operations for std::set are slower than for std::priority_queue. That's because std::set makes many memory allocations. Every element of std::set is stored at its own allocation. std::priority_queue (with underlying std::vector container by default) uses single allocation to store all elements. On other hand std::priority_queue uses many swap operations on its elements whereas std::set uses just pointers swapping. So if swapping is very slow operation for element type, using std::set may be more efficient. Moreover element may be non-swappable at all.
Memory overhead for std::set is much bigger also because it has to store many pointers between its nodes.