Efficiency of the STL priority_queue - c++

I have an application (C++) that I think would be well served by an STL priority_queue. The documentation says:
Priority_queue is a container adaptor, meaning that it is implemented on top of some underlying container type. By default that underlying type is vector, but a different type may be selected explicitly.
and
Priority queues are a standard concept, and can be implemented in many different ways; this implementation uses heaps.
I had previously assumed that top() is O(1), and that push() would be a O(logn) (the two reasons I chose the priority_queue in the first place) - but the documentation neither confirms nor denies this assumption.
Digging deeper, the docs for the Sequence concept say:
The complexities of single-element insert and erase are sequence dependent.
The priority_queue uses a vector (by default) as a heap, which:
... supports random access to elements, constant time insertion and removal of elements at the end, and linear time insertion and removal of elements at the beginning or in the middle.
I'm inferring that, using the default priority_queue, top() is O(1) and push() is O(n).
Question 1: Is this correct? (top() access is O(1) and push() is O(n)?)
Question 2: Would I be able to achieve O(logn) efficiency on push() if I used a set (or multiset) instead of a vector for the implementation of the priority_queue? What would the consequences be of doing this? What other operations would suffer as a consequence?
N.B.: I'm worried about time efficiency here, not space.

The priority queue adaptor uses the standard library heap algorithms to build and access the queue - it's the complexity of those algorithms you should be looking up in the documentation.
The top() operation is obviously O(1) but presumably you want to pop() the heap after calling it which (according to Josuttis) is O(2*log(N)) and push() is O(log(N)) - same source.
And from the C++ Standard, 25.6.3.1, push_heap :
Complexity: At most log(last - first) comparisons.
and pop_heap:
Complexity: At most 2 * log(last - first) comparisons.

top() - O(1) -- as it just returns the element # front.
push() -
insert into vector - 0(1) amortized
push_into_heap - At most, log(n) comparisons. O(logn)
so push() complexity is --
log(n)

No. This is not correct. top() is O(1) and push() is O(log n). Read note 2 in the documentation to see that this adapter does not allow iterating through the vector. Neil is correct about pop(), but still this allows working with the heap doing insertions and removals in O(log n) time.

If the underlying data structure is a heap, top() will be constant time, and push (EDIT: and pop) will be logarithmic (like you are saying). The vector is just used to store these things like this:
HEAP:
1
2 3
8 12 11 9
VECTOR (used to store)
1 2 3 8 12 11 9
You can think of it as the element at position i's children is (2i) and (2i+1)
They use the vector because it stores the data sequentially (which is much more efficient and cache-friendly than discrete)
Regardless of how it is stored, a heap should always be implemented (especially by the gods who made the STD lib) so that pop is constant and push is logarithmic

C++ STL priority_queue underlying data structure is Heap data structure(Heap is a non linear ADT which based on complete binary tree and complete binary tree is implemented through vector(or Array) container.
ex Input data : 5 9 3 10 12 4.
Heap (Considering Min heap) would be :
[3]
[9] [4]
[10] [12] [5]
NOW , we store this min heap in to vector,
[3][9][4][10][12][5].
Using formula ,
Parent : ceiling of n-1/2
Left Child : 2n+1
Right Child : 2n+2 .
Hence ,
Time Complexity for
Top = O(1) , get element from root node.
POP()= O(logn) , During deletion of root node ,there is chance to violation of heap order . hence restructure of heap order takes at most O(logn) time (an element might move down to height of tree).
PUSH()= O(logn) , During insertion also , there might chance to violation of heap order . hence restructure of heap order takes at most O(logn) time (an element might move up to root from leaf node).

Q1: I didn't look in the standard, but AFAIK, using vector (or deque btw), the complexity would be O(1) for top(), O(log n) for push() and pop(). I once compared std::priority_queue with my own heap with O(1) push() and top() and O(log n) pop() and it certainly wasn't as slow as O(n).
Q2: set is not usable as underlying container for priority_queue (not a Sequence), but it would be possible to use set to implement a priority queue with O(log n) push() and pop(). However, this wouldn't probably outperform the std::priority_queue over std::vector implementation.

Related

Why do both insertion and extraction into/from a std::priority_queue take logarithmic time?

A [std::priority_queue] is a container adaptor that provides constant time lookup of the largest (by default) element, at the expense of logarithmic insertion and extraction.
Why is that? I think the sorting either happens on insertion or extraction.
For example, if the sorting happens on insertion and the internal container remains sorted, wouldn't the extraction be able to happen in constant time? The top element to be removed is know and so is the next smaller one.
However, both std::priority_queue::push and std::priority_queue::pop mention in their complexity descriptions:
Complexity
Logarithmic number of comparisons
Why would both have to perform comparisons? With an internal container that stays sorted, extraction should be easy or vice versa, with sorting upon extraction, insertion should be easy.
I guess my assumption about when and how the sorting happens (or if there's any sorting happening at all) is just wrong. Could somebody please shed some light on this?
For example, if the sorting happens on insertion and the internal container remains sorted, wouldn't the extraction be able to happen in constant time?
Extract could happen in constant time, but insertion would become O(n). You'd have to search for the place in the list to insert the new element and then shift all the other elements. O(1) extraction and O(n) insertion might be good for some use-cases, but not the problem that priority_queue is trying to solve.
If sorting, on the other hand, happened on extraction, then you'd have O(n lg n) extraction and O(1) insertion. Which, again, is good for some use-cases, but that's not what priority_queue does.
Rather than sorting elements, std::priority_queue stores its elements† in a heap, which by construction has O(lg n) insertion and extraction. The structure is a tree, and insertion/extraction simply maintain the tree invariant. For some problems (like, say, search), in cases where we need to insert and extract many nodes, having O(lg n) for both operations is far superior than O(n)/O(1).
As an example, and stealing images from Wikipedia, inserting the element 15 into the heap would initially place it at position x:
then swap it with the 8 (because the sorted invariant is broken):
then finally swap it with the 11:
In array form, the initial heap would be stored as:
[11, 5, 8, 3, 4]
and we would end up at:
[15, 5, 11, 3, 4, 8]
Extraction is just the reverse operation - bubbling down instead of bubbling up. As you see, there's no explicit "sorting" going on. We're not even touching most of the elements most of the time.
†std::priority_queue is a container adapter, but the container you provide should be a random access container with O(1) complexities for indexing, push_back, pop_back, front, back, etc. So the choice of container (unless you make a bad one) does not affect the overall complexity of priority_queue's operations.
The pop opration, remove the top element. There are several ways to implement a priority queue but in all of them, deleting the top is logarithmic. From wiki - Priority queue.

Looking for clarification on Hashing and BST functions and Big O notation

So I am trying to understand the data types and Big O notation of some functions for a BST and Hashing.
So first off, how are BSTs and Hashing stored? Are BSTs usually arrays, or are they linked lists because they have to point to their left and right leaves?
What about Hashing? I've had the most trouble finding clear information regarding Hashing in terms of computation-based searching. I understand that Hashing is best implemented with an array of chains. Is this for faster searching or to decrease overhead on creating the allocated data type?
This following question might be just bad interpretation on my part, but what makes a traversal function different from a search function in BSTs, Hashing, and STL containers?
Is traversal Big O(N) for BSTS because you're actually visiting each node/data member, whereas search() can reduce its time by eliminating half the searching field?
And somewhat related, why is it that in the STL, list.insert() and list.erase() have a Big O(1) whereas the vector and deque counterparts are O(N)?
Lastly, why would a vector.push_back() be O(N)? I thought the function could be done something along the lines of this like O(1), but I've come across text saying it is O(N):
vector<int> vic(2,3);
vector<int>::const iterator IT = vic.end();
//wanna insert 4 to the end using push_back
IT++;
(*IT) = 4;
hopefully this works. I'm a bit tired but I would love any explanations why something similar to that wouldn't be efficient or plausible. Thanks
BST's (Ordered Binary Trees) are a series of nodes where a parent node points to its two children, which in turn point to their max-two children, etc. They're traversed in O(n) time because traversal visits every node. Lookups take O(log n) time. Inserts take O(1) time because internally they don't need to a bunch of existing nodes; just allocate some memory and re-aim the pointers. :)
Hashes (unordered_map) use a hashing algorithm to assign elements to buckets. Usually buckets contain a linked list so that hash collisions just result in several elements in the same bucket. Traversal will again be O(n), as expected. Lookups and inserts will be amortized O(1). Amortized means that on average, O(1), though an individual insert might result in a rehashing (redistribution of buckets to minimize collisions). But over time the average complexity is O(1). Note, however, that big-O notation doesn't really deal with the "constant" aspect; only order of growth. The constant overhead in the hashing algorithms can be high enough that for some data-sets the O(log n) binary trees outperform the hashes. Nevertheless, the hash's advantage is that its operations are constant time-complexity.
Search functions take advantage (in the case of binary trees) of the notion of "order"; a search through a BST has the same characteristics as a basic binary search over an ordered array. O(log n) growth. Hashes don't really "search". They compute the bucket, and then quickly run through the collisions to find the target. That's why lookups are constant time.
As for insert and erase; in array-based sequence containers, all elements that come after the target have to be bumped over to the right. Move semantics in C++11 can improve upon the performance, but the operation is still O(n). For linked sequence containers (list, forward_list, trees), insertion and erasing just means fiddling with some pointers internally. It's a constant-time process.
push_back() will be O(1) until you exceed the existing allocated capacity of the vector. Once the capacity is exceeded, a new allocation takes place to produce a container that is large enough to accept more elements. All the elements need to then be moved into the larger memory region, which is an O(n) process. I believe Move Semantics can help here as well, but it's still going to be O(n). Vectors and strings are implemented such that as they allocate space for a growing data set, they allocate more than they need, in anticipation of additional growth. This is an efficiency safeguard; it means that the typical push_back() won't trigger a new allocation and move of the entire data set into a larger container. But eventually after enough push_backs, the limit will be reached, and the vector's elements will be copied into a larger container, which again has some extra headroom left over for more efficient push_backs.
Traversal refers to visiting every node, whereas search is only to find a particular node, so your intuition is spot on there. O(N) complexity because you need to visit N nodes.
std::vector::insert is for insert in the middle, and it involves copying all subsequent elements over by one slot, inorder to make room for the element being inserted, hence O(N). Linked list doesnt have this issue, hence O(1). Similar logic for erase. deque properties are similar to vector
std::vector::push_back is a O(1) operation, for the most part, only deviates if capacity is exceeded and reallocations + copy are needed.

Priority Queue - Binary Heap

I'm trying to implement a priority queue as an sorted array backed minimum binary heap. I'm trying to get the update_key function to run in logarithmic time, but to do this I have to know the position of the item in the array. Is there anyway to do this without the use of a map? If so, how? Thank you
If you really want to be able to change the key of an arbitrary element, a heap is not the best choice of data structure. What it gives you is the combination of:
compact representation (no pointers, just an array and an implicit
indexing scheme)
logarithmic insertion, rebalancing
logarithmic removal of the smallest (largest) element.
O(1) access to the value of the smallest (largest) element. -
A side benefit of 1. is that the lack of pointers means you do substantially fewer calls to malloc/free (new/delete).
A map (represented in the standard library as a balanced binary tree) gives you the middle two of these, adding in
logarithmic find() on any key.
So while you could attach another data structure to your heap, storing pointers in the heap and then making the comparison operator dereference through the pointer, you'd pretty soon find yourself with the complexity in time and space of just using a map in the first place.
Your find key function should operate in log(n) time. Your updating (changing the key) should be constant time. Your remove function should run in log(n) time. Your insert function should be log(n) time.
If these assumptions are true try this:
1) Find your item in your heap (IE: binary search, since it is a sorted array).
2) Update your key (you're just changing a value, constant time)
3) Remove the item from the heap log(n) to reheapify.
4) Insert your item into the heap log(n).
So, you'd have log(n) + 1 + log(n) + log(n) which reduces to log(n).
Note: this is amortized, because if you have to realloc your array, etc... that adds overhead. But you shouldn't do that very often anyway.
That's the tradeoff of the array-backed heap: you get excellent memory use (good locality and minimal overhead), but you lose track of the elements. To solve it, you have to add back some overhead.
One solution would be this. The heap contains objects of type C*. C is a class with an int member heap_index, which is the index of the object in the heap array. Whenever you move an element inside the heap array, you'll have to update its heap_index to set it to the new index.
Update_key (as well as removal of an arbitrary element) is then log(n) time because it takes constant time to find the element (via heap_index), and log(n) time to bubble it into the correct position.

Difference between std::set and std::priority_queue

Since both std::priority_queue and std::set (and std::multiset) are data containers that store elements and allow you to access them in an ordered fashion, and have same insertion complexity O(log n), what are the advantages of using one over the other (or, what kind of situations call for the one or the other?)?
While I know that the underlying structures are different, I am not as much interested in the difference in their implementation as I am in the comparison their performance and suitability for various uses.
Note: I know about the no-duplicates in a set. That's why I also mentioned std::multiset since it has the exactly same behavior as the std::set but can be used where the data stored is allowed to compare as equal elements. So please, don't comment on single/multiple keys issue.
A priority queue only gives you access to one element in sorted order -- i.e., you can get the highest priority item, and when you remove that, you can get the next highest priority, and so on. A priority queue also allows duplicate elements, so it's more like a multiset than a set. [Edit: As #Tadeusz Kopec pointed out, building a heap is also linear on the number of items in the heap, where building a set is O(N log N) unless it's being built from a sequence that's already ordered (in which case it is also linear).]
A set allows you full access in sorted order, so you can, for example, find two elements somewhere in the middle of the set, then traverse in order from one to the other.
std::priority_queue allows to do the following:
Insert an element O(log n)
Get the smallest element O(1)
Erase the smallest element O(log n)
while std::set has more possibilities:
Insert any element O(log n) and the constant is greater than in std::priority_queue
Find any element O(log n)
Find an element, >= than the one your are looking for O(log n) (lower_bound)
Erase any element O(log n)
Erase any element by its iterator O(1)
Move to previous/next element in sorted order O(1)
Get the smallest element O(1)
Get the largest element O(1)
set/multiset are generally backed by a binary tree. http://en.wikipedia.org/wiki/Binary_tree
priority_queue is generally backed by a heap. http://en.wikipedia.org/wiki/Heap_(data_structure)
So the question is really when should you use a binary tree instead of a heap?
Both structures are laid out in a tree, however the rules about the relationship between anscestors are different.
We will call the positions P for parent, L for left child, and R for right child.
In a binary tree L < P < R.
In a heap P < L and P < R
So binary trees sort "sideways" and heaps sort "upwards".
So if we look at this as a triangle than in the binary tree L,P,R are completely sorted, whereas in the heap the relationship between L and R is unknown (only their relationship to P).
This has the following effects:
If you have an unsorted array and want to turn it into a binary tree it takes O(nlogn) time. If you want to turn it into a heap it only takes O(n) time, (as it just compares to find the extreme element)
Heaps are more efficient if you only need the extreme element (lowest or highest by some comparison function). Heaps only do the comparisons (lazily) necessary to determine the extreme element.
Binary trees perform the comparisons necessary to order the entire collection, and keep the entire collection sorted all-the-time.
Heaps have constant-time lookup (peek) of lowest element, binary trees have logarithmic time lookup of lowest element.
Since both std::priority_queue and std::set (and std::multiset) are data containers that store elements and allow you to access them in an ordered fashion, and have same insertion complexity O(log n), what are the advantages of using one over the other (or, what kind of situations call for the one or the other?)?
Even though insert and erase operations for both containers have the same complexity O(log n), these operations for std::set are slower than for std::priority_queue. That's because std::set makes many memory allocations. Every element of std::set is stored at its own allocation. std::priority_queue (with underlying std::vector container by default) uses single allocation to store all elements. On other hand std::priority_queue uses many swap operations on its elements whereas std::set uses just pointers swapping. So if swapping is very slow operation for element type, using std::set may be more efficient. Moreover element may be non-swappable at all.
Memory overhead for std::set is much bigger also because it has to store many pointers between its nodes.

Priority queue structure used?

While searching for some functions in C++ standard library documentation I read that push and pop for priority queues needs constant time.
http://www.cplusplus.com/reference/stl/priority_queue/push/
Constant (in the priority_queue). Although notice that push_heap operates in logarithmic time.
My question is what kind of data structure is used to maintain a priority queue with O(1) for push and pop ?
I assume you are referring to cplusplus.com's page.
Earlier on the page it says:
This member function effectively calls the member function push_back of the underlying container object, and then calls the push_heap algorithm to keep the heap property of priority_queues.
So, after the O(1) push, the data structure has lost its heap property invariant and will then always follow that with an O(log N) call to restore that property. In other words, the O(1) operation is only one part of the entire operation; the full operation is O(1) + O(log N) which is the same as O(log N).
I guess the reason they mention that is that priority queue is an adapter and they are trying to emphasize the difference between what the underlying container does vs what the adapter does.
The underlying storage for priority_queue can be a vector or a deque or anything similar that supports random access iterators. The storage is maintained as a heap, which is not O(N) for push, so I suspect you have read this wrong
Push and Pop run in Logarithmic time according to
http://www.cppreference.com/wiki/stl/priority_queue/pop
http://www.cppreference.com/wiki/stl/priority_queue/push
I'm skeptical about the O(1) claim... where did you see it?
In any case, here are some notes on gcc's implementation of a priority queue.
There is not such a kind of heap. They have written that it calls push_heap which is logarithmic so it is logn.
The standard defines those members in terms of push_heap and pop_heap, which implies the compilexity.
If what that reference says is true (it says top is also constant), why doesn't anybody implement general-purpose O(N) sorting using std::priority_queue?
On second though, this is what the reference may be trying to say, in a very misleading way: the complexity is that of push_back O(1) + push_heap (O(log N))