c++ a fixed sized priority queue to store k-nearest neighbors

c++ a fixed sized priority queue to store k-nearest neighbors - c++

I am implementing k nearest neighbor search in a tree data structure. I store the results in a priority queue, which will automatically sort the elements in a ascending order, and so the first k elements are the results. The priority_queue container in STL is really not a good option here because it support only a few functions such as push(), pop(), top(), size() empty(), etc. A big problem here is that when searching the whole tree, I need to visit a lot of nodes, and using push() will make the priority queue longer and longer, which will increase time cost for later operations. What I really want is a fixed-length priority queue, so when push() a new element into the queue, some elements with larger values will be automatically deleted. How can I implement this? Or is there any standard container I can use? Thank you.

What about using std::set? It stores elements in order, and if it grows above k elements you can just remove the largest one (in constant time). Each insertion is O(log k).

One way with priority_queue but changing the ordering (ascending to descending), and if it grows above k elements, remove the top element (which is the farther).

Related

Is the overhead for utilizing a vector with insert() as a priority queue relative to utilizing a heap? (c++)

I am currently working on a project where I implemented a vector of structure pointers to use as a priority queue. I use a for loop to determine position in the vector (if not less than the back) and then use insert() to place the struct pointer in position in queue. I am using back() as the front of queue so I can maintain the pop functionality of the vector.
I was just trying to determine if using the heap library instead would add a speed increase, as this project is dependent on time. Can provide code if you'd like, size of the heap/vector may increase tremendously as this is a tower of hanoi A* search algorithm.
Figured I would ask for future knowledge as well to save me some debug breakpoint shuffles if anyone knew offhand.

I've made a simple benchmark of first inserting N random ints into a priority queue and then popping N top elements.
Expectedly, sorted std::vector with linear search wins when the queue size is small, and std::priority_queue, which is implemented as a max-heap with O(log N) worst-case insertion time, wins when the queue size is large.
Benchmark code can be found here.

Structure like priority queue but with something like lower bound

I want a structure to store (for example) numbers where I can insert and remove elements, my structure remains sorted always (like a priority queue) BUT with the possibility of knowing where is a given number, and every operation in logarithmic time.
Maybe with lower_bound, upper_bound, or just a binary search, but in priority_queue what blocks me to do binary search is that I cannot access the elements with an index, only the first one.

I think you’re looking for an order statistics tree, an augmented BST that supports all the regular BST operations in time O(log n), along with two others:
rank(elem): return which index elem would occupy in the sorted sequence.
index(k): given an index k, return the element at that index in the sorted sequence.
The two above operations run in O(log n) time, making them extremely fast.
You can treat an order statistics tree as a priority queue. Insertions work as normal BST insertions, and to extract the lowest/highest element you just remove the smallest/greatest element from the tree, which you can do in time O(log n) by just walking down the left or right spines of the tree.

A priority queue does not keep things in sorted order. At least, not typically. A priority queue makes it possible for you to quickly obtain the next item in the sequence. But you can't efficiently access, say, the 5th item in the queue.
I know of three different ways to build a priority queue in which you can efficiently access items by key:
Use a balanced binary search tree to implement the queue. Although all operations are O(log n), typical running time is slower than a binary heap.
Implement the heap priority queue as a skip list. This is a good option. I've seen some people report that a skip list priority queue outperforms a binary heap. A search for [C++ skip list] will return you lots of implementations.
What I call an indexed binary heap also works. Basically, you marry a hash map or dictionary with a binary heap. The map is indexed by key, and its value contains the index of the item in the heap array. Such a thing is not difficult to build, and is quite effective.
Come to think of it, you can make an indexed version of any type of heap.
You have a number of options. I rather like the skip list, myself, but your mileage may vary.
The indexed binary heap, as I pointed out, is a hybrid data structure that maintains a dictionary (hash map) and a binary heap. Briefly how it works:
The dictionary key is the field that you use to look up an item that you put into the heap. The value is an integer: the index of that item in the heap.
The heap itself is a standard binary heap implemented in an array. The only difference is that every time you move an item from one place to another in the heap, you update its location in the dictionary. So, for example, if you swap two items, you have to swap not only the items themselves in the array, but also their positions as stored in the dictionary. For example:
heap is an array of string references
dict is a dictionary, keyed by string
swap (a, b)
{
// swap item at heap[a] with item at heap[b]
temp = heap[a]
heap[a] = heap[b]
heap[b] = temp
// update their positions in the dictionary
dict[heap[a]] = b
dict[heap[b]] = a
}
It's a pretty simple modification of a standard binary heap implementation. You just have to be careful to update the position every time you move an item.
You can also do this with node-based heaps like Pairing heap, Fibonacci heap, Skew heap, etc.

Randomly Access in a Priority Queue

How i can access/search randomly in a priority queue?. for example, if have a priority queue like q={5,4,3,2,1} for example, i want to access 3rd value directly which is 3, i could not do this,is there any process to access randomly in priority queue?

Most priority queue implementations, including the C++ std::priority_queue type, don't support random access. The idea behind a priority queue is to sacrifice random access for fast access to the smallest element.
Depending on what you're trying to do, there are a number of other approaches you could use. If you always want access to the third element in the queue (and not any other arbitrary positions), it's probably fast enough to just dequeue two elements, cache them, then dequeue the value you want and put the other two elements back.
If you want access to the kth-smallest element at any point in time, where k is larger, one option is to store two different priority queues: a reverse-sorted priority queue that holds k elements (call it the left queue) and a regular priority queue holding the remaining n-k elements (call it the right queue). To get the kth-smallest element, dequeue from the left queue (giving back the kth-smallest element), then dequeue an element from the right and enqueue into the left to get it back up to k total elements. To do an enqueue, check if the number is less than the top of the left queue. If so, dequeue from the left queue, enqueue the removed element into the right queue, then enqueue the original element into the left. Otherwise, enqueue into the right. This guarantees O(log n) runtimes for each operation.
If you need true random access to a sorted sequence, consider using an order statistics tree. This is an augmented binary search tree that supports O(log n) access to elements by index. You can use this to build a priority queue - the minimum element is always at index 0. The catch (of course there's a catch) is that it's hard to find a good implementation of one and the constant factors hidden in the O(log n) terms are much higher than in a standard binary heap.

To add to the answer of #templatetypedef:
You cannot combine random access of elements with a priority queue, unless you use a very inefficient priority queue. Here's a few options, depending on what you need:
1- An inefficient priority queue would be a std::vector that you keep sorted. Pushing an element means finding where it should be inserted, and move all subsequent elements forward. Popping an element would be simply reading and deleting the last element (back and pop_back, these are efficient). Random access is efficient as well, of course.
2- You could use a std::multiset (or std::multimap) instead of a priority queue. It is a tree structure that keeps things sorted. You can insert instead of push, then read and remove the first (or last) element using a begin (or rbegin) iterator with erase. Insertion and finding the first/last element are log(n) operations. The data structure allows for reading all elements in order, though it doesn't give random access.
3- You could hack your own version of std::priority_queue using a std::vector and the std::push_heap and std::pop_heap algorithms (together with the push_back and pop_back methods of the std::vector). You'd get the same efficient priority queue, but also random access to all elements of the priority_queue. They are not sorted, all you know is that the first element in your array is the top priority element, and that the other elements are stored in a way that the heap property is satisfied. If you only occasionally want to read all elements in order, you can use the function std::sort_heap to sort all elements in your array by priority. The function std::make_heap will return your array to its heap status.
Note that std::priority_queue uses a std::vector by default to store its data. It might be possible to reinterpret_cast the std::priority_queue to a std::vector, so you get random access to the other elements in the queue. But if it works on your implementation of the standard library, it might not on others, or on future versions of the same library, so I don't recommend you do this! It's safer to create your own heap class using the algorithms in the standard library as per #3 above.

Difference between std::set and std::priority_queue

Since both std::priority_queue and std::set (and std::multiset) are data containers that store elements and allow you to access them in an ordered fashion, and have same insertion complexity O(log n), what are the advantages of using one over the other (or, what kind of situations call for the one or the other?)?
While I know that the underlying structures are different, I am not as much interested in the difference in their implementation as I am in the comparison their performance and suitability for various uses.
Note: I know about the no-duplicates in a set. That's why I also mentioned std::multiset since it has the exactly same behavior as the std::set but can be used where the data stored is allowed to compare as equal elements. So please, don't comment on single/multiple keys issue.

A priority queue only gives you access to one element in sorted order -- i.e., you can get the highest priority item, and when you remove that, you can get the next highest priority, and so on. A priority queue also allows duplicate elements, so it's more like a multiset than a set. [Edit: As #Tadeusz Kopec pointed out, building a heap is also linear on the number of items in the heap, where building a set is O(N log N) unless it's being built from a sequence that's already ordered (in which case it is also linear).]
A set allows you full access in sorted order, so you can, for example, find two elements somewhere in the middle of the set, then traverse in order from one to the other.

std::priority_queue allows to do the following:
Insert an element O(log n)
Get the smallest element O(1)
Erase the smallest element O(log n)
while std::set has more possibilities:
Insert any element O(log n) and the constant is greater than in std::priority_queue
Find any element O(log n)
Find an element, >= than the one your are looking for O(log n) (lower_bound)
Erase any element O(log n)
Erase any element by its iterator O(1)
Move to previous/next element in sorted order O(1)
Get the smallest element O(1)
Get the largest element O(1)

set/multiset are generally backed by a binary tree. http://en.wikipedia.org/wiki/Binary_tree
priority_queue is generally backed by a heap. http://en.wikipedia.org/wiki/Heap_(data_structure)
So the question is really when should you use a binary tree instead of a heap?
Both structures are laid out in a tree, however the rules about the relationship between anscestors are different.
We will call the positions P for parent, L for left child, and R for right child.
In a binary tree L < P < R.
In a heap P < L and P < R
So binary trees sort "sideways" and heaps sort "upwards".
So if we look at this as a triangle than in the binary tree L,P,R are completely sorted, whereas in the heap the relationship between L and R is unknown (only their relationship to P).
This has the following effects:
If you have an unsorted array and want to turn it into a binary tree it takes O(nlogn) time. If you want to turn it into a heap it only takes O(n) time, (as it just compares to find the extreme element)
Heaps are more efficient if you only need the extreme element (lowest or highest by some comparison function). Heaps only do the comparisons (lazily) necessary to determine the extreme element.
Binary trees perform the comparisons necessary to order the entire collection, and keep the entire collection sorted all-the-time.
Heaps have constant-time lookup (peek) of lowest element, binary trees have logarithmic time lookup of lowest element.

Since both std::priority_queue and std::set (and std::multiset) are data containers that store elements and allow you to access them in an ordered fashion, and have same insertion complexity O(log n), what are the advantages of using one over the other (or, what kind of situations call for the one or the other?)?
Even though insert and erase operations for both containers have the same complexity O(log n), these operations for std::set are slower than for std::priority_queue. That's because std::set makes many memory allocations. Every element of std::set is stored at its own allocation. std::priority_queue (with underlying std::vector container by default) uses single allocation to store all elements. On other hand std::priority_queue uses many swap operations on its elements whereas std::set uses just pointers swapping. So if swapping is very slow operation for element type, using std::set may be more efficient. Moreover element may be non-swappable at all.
Memory overhead for std::set is much bigger also because it has to store many pointers between its nodes.

Which STL Container?

I need a container (not necessarily a STL container) which let me do the following easily:
Insertion and removal of elements at any position
Accessing elements by their index
Iterate over the elements in any order
I used std::list, but it won't let me insert at any position (it does, but for that I'll have to iterate over all elements and then insert at the position I want, which is slow, as the list may be huge). So can you recommend any efficient solution?

It's not completely clear to me what you mean by "Iterate over the elements in any order" - does this mean you don't care about the order, as long as you can iterate, or that you want to be able to iterate using arbitrarily defined criteria? These are very different conditions!
Assuming you meant iteration order doesn't matter, several possible containers come to mind:
std::map [a red-black tree, typically]
Insertion, removal, and access are O(log(n))
Iteration is ordered by index
hash_map or std::tr1::unordered_map [a hash table]
Insertion, removal, and access are all (approx) O(1)
Iteration is 'random'

This diagram will help you a lot, I think so.

Either a vector or a deque will suit. vector will provide faster accesses, but deque will provide faster instertions and removals.

Well, you can't have all of those in constant time, unfortunately. Decide if you are going to do more insertions or reads, and base your decision on that.
For example, a vector will let you access any element by index in constant time, iterate over the elements in linear time (all containers should allow this), but insertion and removal takes linear time (slower than a list).

You can try std::deque, but it will not provide the constant time removal of elements in middle but it supports
random access to elements
constant time insertion and removal
of elements at the end of the
sequence
linear time insertion and removal of
elements in the middle.

A vector. When you erase any item, copy the last item over one to be erased (or swap them, whichever is faster) and pop_back. To insert at a position (but why should you, if the order doesn't matter!?), push_back the item at that position and overwrite (or swap) with item to be inserted.

By "iterating over the elements in any order", do you mean you need support for both forward and backwards by index, or do you mean order doesn't matter?
You want a special tree called a unsorted counted tree. This allows O(log(n)) indexed insertion, O(log(n)) indexed removal, and O(log(n)) indexed lookup. It also allows O(n) iteration in either the forward or reverse direction. One example where these are used is text editors, where each line of text in the editor is a node.
Here are some references:
Counted B-Trees
Rope (computer science)

An order statistic tree might be useful here. It's basically just a normal tree, except that every node in the tree includes a count of the nodes in its left sub-tree. This supports all the basic operations with no worse than logarithmic complexity. During insertion, anytime you insert an item in a left sub-tree, you increment the node's count. During deletion, anytime you delete from the left sub-tree, you decrement the node's count. To index to node N, you start from the root. The root has a count of nodes in its left sub-tree, so you check whether N is less than, equal to, or greater than the count for the root. If it's less, you search in the left subtree in the same way. If it's greater, you descend the right sub-tree, add the root's count to that node's count, and compare that to N. Continue until A) you've found the correct node, or B) you've determined that there are fewer than N items in the tree.

(source: adrinael.net)
But it sounds like you're looking for a single container with the following properties:
All the best benefits of various containers
None of their ensuing downsides
And that's impossible. One benefit causes a detriment. Choosing a container is about compromise.

std::vector
[padding for "15 chars" here]

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js