I am trying to implement my own heap with the method removing any number (not only the min or max) but I can't solve one problem. To write that removing function I need the pointers to the elements in my heap (to have O(logn) time of removing given element). But when I have tried do it this way:
vector<int*> ptr(n);
it of course did not work.
Maybe I should insert into my heap another class or structure containing int, but for now I would like to find any solution with int (because I have already implemented it using int)?
When you need to remove (or change the priority of) other objects than the root, a d-heap isn't necessarily the ideal data structure: the nodes keep changing their position and you need to keep track of various moves. It is doable, however. To use a heap like this you would return a handle to the newly inserted object which identifies some sort of node which stays put. Since the d-heap algorithm relies on the tree being a perfectly balanced tree, you effectively need to implement it using an array. Since these two requirements (using an array and having nodes stay put) are mutually exclusive you need to do both and have an index from the nodes into the array (so you can find the position of the object in the array) and a pointer from the array to the node (so you can update the node when the position changes). Almost certainly you don't want to move your nodes a lot, i.e. you rather accept finding the proper direction to move a nodes by searching multiple nodes, i.e. you want to use a d > 2.
There are alternative approach to implement a heap which are inherently nodes based. In particular Fibonacci heaps which yield for certain usage patterns a better amortized complexity than the usual O(ln(n)) complexity. However, they are somewhat harder to implement and the actual efficiency only pays off if you either need to change the priority of a node frequently or you have fairly large data sets.
A heap is a particular sort of data structure; the elements are stored in a binary tree, and there are well-established procedures for adding or removing elements. Many implementations use an array to hold the tree nodes, and removing an element involved moving log(n) elements around. Normally the way the array is used, the children of the node at array location n are stored at locations 2n and 2n+1; element 0 is left empty.
This Wikipedia page does a fine job of explaining the algorithms.
Related
I was going through Djikstra's algorithm when I noticed, I could update keys in heap(with n keys) in O(logn) time (last line in the pseudocode). How do I update keys in heaps in C++, is there any method in priority_queues to do this? Or do I have to write my own heap class to do achieve updates in O(logn) like this?
Edit 1:
Clarifying my need - for a binary heap with n elements -
1) Should insert new values and find & pop minimum values in O(logn)
2) Should update already present keys in O(logn)
I tried to come up with a way to implement this using make_heap, push_heap, pop_heap, and a custom function for update as John Ding suggested.
However I am facing a problem in making the function, I first need to find the location of the key in the heap. Doing this under O(logn) in a heap requires a lookup array for position of keys in heap, see here (I don't know of any other way). However these lookup tables won't be updated when I call push_heap or pop_heap.
You can optimize dijktra algorithm with priority_queue. It is implemented by a binary heap, where you can pop the top or push in a element in O(logN) time. However, due to the encapsulation of priority_queue, you cannot modify the key(more pricisely, decrease the key) of any element.
So our method is to push multiple elements into the heap regardless of whether we have multiple elements refering to the same node.
for example, when
Node N : distance = 30, GraphNode = A(where A refers to one node in the graph, while N is one node in the heap)
is already in the heap, then using the priority_queue cannot help you do such a operation when we try to relax Node N:
decrease_key_to(N, 20)
by decreasing key can make the heap always include less than N elements, but it's cannot be implemented by priority_queue
What we can do with it is to add another node in the heap:
Node N2 : distance = 20, GraphNode = A
push N2 into the heap
That's corresponding to priority_queue::push
So you may need to implement a binary heap supporting decrease_key yourself or find an implementation online, and store a table of pointers pointing to every element in a heap to know access elements through nodes in the graph.
As an extension, using Fibonacci heap can even make decrease_key faster, that's the ultimate level of Dijkstra, Haha :)
Problem of last version of my answer:
We cannot locate the element pushed in to the heap using push_heap.
In order to do this, you need more than the priority_queue provides: you need to know where in the underlying container the element to be updated is stored. In a binary heap, for example, you need to know the position of the element for which you want to change the priority. With this you can change the priority of the element and then restore the heap property in O(log n) by bubbling the modified element up or down.
For Dijkstra, if I remember correctly, something called Fibonacci heap is more efficient than a binary heap.
Unfortunately, std::priority_queue doesn't support updates of entries in the heap, but you may be able to invalidate entries in the heap and wait for them to percolate up to the root from where they can eventually be deleted. So instead of changing an existing entry, you invalidate it and insert another one with the new priority. Whether you can live with the consequences of having invalid entries filling up the heap, is for you to judge.
For an idea how this might work in practice, see here.
If you have a fifo queue implemented using a linked list, what would be the most efficient way to pop a node with the highest value?
Mergesort would be O(n log n).
Scanning through the list would be O(n).
Can anyone suggest more efficient ways of doing this?
The queue must retain the fifo ordering that operates in the usual manner with enqueue and dequeue, but has an extra method, such as popMax, which pops and returns the node with the highest value.
No code is needed, just some ideas! Thanks!
Is popMax frequent enough that changing it from O(N) to O(logN) justifies extra storage (per node: two pointers plus an index) and extra complexity AND changing enqueue and dequeue from O(1) to O(logN) ??
In the many times I have solved this problem (for different reasons and different employers) the answer to the above has pretty consistently been "Yes". But it might be "no" for you. So first make that decision.
Any improvement on the O(N) needs to be able to remove from the middle of the primary sequence. If the primary sequence would have been a forward only linked list, now it needs links in both directions: One extra pointer.
A heap of pointers, costs another extra pointer (per node, but not in the node). But then dequeue needs to be able to remove from the middle of the heap, which takes an index within the node as a back pointer to its position in the heap.
If it is worth all that, you can easily find (online free) source code for a templated version of a priorityQueue/heap and it should be obvious how to make the heap, objects be node* and the less function given to the heap compare the values inside the nodes pointed to.
Next you change that heap source code (the reason you don't simply use std::priority_queue) such that each time it positions an "object" (meaning a node*) in the queue, it does some kind of callback to notify the object of its new index.
You also need to expose some internals of the heap code. There is a point in any decent version of heap code in which the code deals with a hole (missing element) at index x within the heap, by checking whether the last element of the heap could be correctly moved there (and if so doing that) or if not moving the correct child of the hole into the hole and repeating for the new hole where that child was. Typically that code is not exposed to external callers with x as an input. But it easily can be exposed. Your dequeue function needs that in order to remove from the heap the same element being removed from the list.
For less extra storage, but likely more execution time (though still O(logN) for each function). You could have a heap of nodes instead of a heap of node*. That is the reason in coding this kind of heap, you should code the notify callback generically (similar to less). Then your doubly linked list has indexes instead of pointers (so growth is robust) and the notify function updates the predecessor's forward index and successor's back index. To avoid lots of special casing, you need to have a full circle (maybe including a dummy node) of double links, rather than just end to end.
I haven't worked through the details yet myself (since I've never redone any of this in the post C++11 world) but I think a more elegant alternative to the notify function discussed above would be to wrap the object (that will be in the heap) in a wrapper that allows it to be moved but doesn't allow it to be copied. The action to be done by notify would instead be done during the move. That makes std::priority_queue even closer to what you need, but so far as I understand, it still doesn't expose that key internal point in the code for filling a hole at an arbitrary location.
I am having some confusion on the runtimes of the find_min operation on a Binary Search Tree and a Binary Heap. I understand that returning the min in a Binary Heap is a O(1) operation. I also understand why in theory, returning the minimum element in a Binary Search Tree is O(log(N)) operation. Much to my surprise, when I read up on the data structure in C++ STL, the documentation states that returning an iterator to the first element in a map (which is the same as returning the minimum element) occurs in constant time! Shouldnt this be getting returned in logarithmic time? I need someone to help me understand what C++ is doing under the hood to return this in constant time. Because then, there is no point in really using a binary heap in C++, the map data structure would then support retrieve min and max in both constant time, delete and search in O(log(N)) and keeps everything sorted. This means that the data structure has the benefits of both a BST and Binary Heap all tied up in one!
I had an argument about this with an interviewer (not really an argument) but I was trying to explain to him that in C++ returning min and max from map in C++ (which is a self-balancing binary search tree) occurs in constant time. He was baffled and kept saying I was wrong and that a binary heap was the way to go. Clarification would be much appreciated
The constant-time lookup of the minimum and maximum is achieved by storing references to the leftmost and the rightmost nodes of the RB-tree in the header structure of the map. Here is a comment from the source code of the RB-tree, a template from which the implementation of std::set, std::map, and std::multimap are derived:
the header cell is maintained with links not only to the root but also to the leftmost node of the tree, to enable constant time begin(), and to the rightmost node of the tree, to enable linear time performance when used with the generic set algorithms (set_union, etc.)
The tradeoff here is that these pointers need to be maintained, so insertion and deletion operations would have another "housekeeping operation" to do. However, insertions and deletions are already done in logarithmic time, so there is no additional asymptotic cost for maintaining these pointers.
At least in a typical implementation, std::set (and std::map) will be implemented as a threaded binary tree1. In other words, each node contains not only a pointer to its (up to) two children, but also to the previous and next node in order. The set class itself then has pointers to not only the root of the tree, but also to the beginning and end of the threaded list of nodes.
To search for a node by key, the normal binary pointers are used. To traverse the tree in order, the threading pointers are used.
This does have a number of disadvantages compared to a binary heap. The most obvious is that it stores four pointers for each data item, where a binary heap can store just data, with no pointers (the relationships between nodes are implicit in the positions of the data). In an extreme case (e.g., std::set<char>) this could end up using a lot more storage for the pointers than for the data you actually care about (e.g., on a 64-bit system you could end up with 4 pointers of 64-bits apiece, to store each 8-bit char). This can lead to poor cache utilization, which (in turn) tends to hurt speed.
In addition, each node will typically be allocated individually, which can reduce locality of reference quite badly, again hurting cache usage and further reducing speed.
As such, even though the threaded tree can find the minimum or maximum, or traverse to the next or previous node in O(1), and search for any given item in O(log N), the constants can be substantially higher than doing the same with a heap. Depending on the size of items being stored, total storage used may be substantially greater than with a heap as well (worst case is obviously when only a little data is stored in each node).
1. With some balancing algorithm applied--most often red-black, but sometimes AVL trees or B-trees. Any number of other balanced trees could be using (e.g., alpha-balanced trees, k-neighbor trees, binary b-trees, general balanced trees).
I'm not an expert at maps, but returning the first element of a map would be considered a 'root' of sorts. there is always a pointer to it, so the look up time of it would be instant. The same would go for a BSTree as it clearly has a root node then 2 nodes off of it and so on (which btw I would look into using an AVL Tree as the look up time for the worst case scenario is much better than the BSTree).
The O(log(N)) is normally only used to get an estimate of the worst case scenario. So if you have a completely unbalanced BSTree you'll actually have O(N), so if your searching for the last node you have to do a comparison to every node.
I'm not too sure about your last statement though a map is different from a self-balancing tree, those are called AVL Trees (or thats what I was taught). A map uses 'keys' to organize objects in a specific way. The key is found by serializing the data into a number and the number is for the most part placed in a list.
Is there a data structure like a queue which also supports removal of elements at arbitrary points? Enqueueing and dequeueing occur most frequently, but mid-queue element removal must be similar in speed terms since there may be periods where that is the most common operation. Consistency of performance is more important than absolute speed. Time is more important than memory. Queue length is small, under 1,000 elements at absolute peak load.In case it's not obvious I'll state it explicitly: random insertion is not required.
Have tagged C++ since that is my implementation language, but I'm not using (and don't want to use) any STL or Boost. Pure C or C++ only (I will convert C solutions to a C++ class.)
Edit: I think what I want is a kind of dictionary that also has a queue interface (or a queue that also has a dictionary interface) so that I can do things like this:
Container.enqueue(myObjPtr1);
MyObj *myObjPtr2 = Container.dequeue();
Container.remove(myObjPtr3);
I think that double-link list is exactly what you want (assuming you do not want a priority queue):
Easy and fast adding elements to both ends
Easy and fast removal of elements from anywhere
You can use std::list container, but (in your case) it is difficult to remove an element
from the middle of the list if you only have a pointer (or reference) to the element (wrapped in STL's list element), but
you do not have an iterator. If using iterators (e.g. storing them) is not an option - then implementing a double linked list (even with element counter) should be pretty easy. If you implement your own list - you can directly operate on pointers to elements (each of them contains pointers to both of its neighbours). If you do not want to use Boost or STL this is probably the best option (and the simplest), and you have control of everything (you can even write your own block allocator for list elements to speed up things).
One option is to use an order statistic tree, an augmented tree structure that supports O(log n) random access to each element, along with O(log n) insertion and deletion at arbitrary points. Internally, the order statistic tree is implemented as a balanced binary search treewith extra information associated with it. As a result, lookups are a slower than in a standard dynamic array, but the insertions are much faster.
Hope this helps!
You can use a combination of a linked list and a hash table. In java it is called a LinkedHashSet.
The idea is simple, have a linked list of elements, and also maintain a hash map of (key,nodes), where node is a pointer to the relevant node in the linked list, and key is the key representing this node.
Note that the basic implementation is a set, and some extra work will be needed to make this data structure allow dupes.
This data structure allows you both O(1) head/tail access, and both O(1) access to any element in the list. [all on average armotorized]
I am maintaining a fixed-length table of 10 entries. Each item is a structure of like 4 fields. There will be insert, update and delete operations, specified by numeric position. I am wondering which is the best data structure to use to maintain this table of information:
array - insert/delete takes linear time due to shifting; update takes constant time; no space is used for pointers; accessing an item using [] is faster.
stl vector - insert/delete takes linear time due to shifting; update takes constant time; no space is used for pointers; accessing an item is slower than an array since it is a call to operator[] and a linked list .
stl list - insert and delete takes linear time since you need to iterate to a specific position before applying the insert/delete; additional space is needed for pointers; accessing an item is slower than an array since it is a linked list linear traversal.
Right now, my choice is to use an array. Is it justifiable? Or did I miss something?
Which is faster: traversing a list, then inserting a node or shifting items in an array to produce an empty position then inserting the item in that position?
What is the best way to measure this performance? Can I just display the timestamp before and after the operations?
Use STL vector. It provides an equally rich interface as list and removes the pain of managing memory that arrays require.
You will have to try very hard to expose the performance cost of operator[] - it usually gets inlined.
I do not have any number to give you, but I remember reading performance analysis that described how vector<int> was faster than list<int> even for inserts and deletes (under a certain size of course). The truth of the matter is that these processors we use are very fast - and if your vector fits in L2 cache, then it's going to go really really fast. Lists on the other hand have to manage heap objects that will kill your L2.
Premature optimization is the root of all evil.
Based on your post, I'd say there's no reason to make your choice of data structure here a performance based one. Pick whatever is most convenient and return to change it if and only if performance testing demonstrates it's a problem.
It is really worth investing some time in understanding the fundamental differences between lists and vectors.
The most significant difference between the two is the way they store elements and keep track of them.
- Lists -
List contains elements which have the address of a previous and next element stored in them. This means that you can INSERT or DELETE an element anywhere in the list with constant speed O(1) regardless of the list size. You also splice (insert another list) into the existing list anywhere with constant speed as well. The reason is that list only needs to change two pointers (the previous and next) for the element we are inserting into the list.
Lists are not good if you need random access. So if one plans to access nth element in the list - one has to traverse the list one by one - O(n) speed
- Vectors -
Vector contains elements in sequence, just like an array. This is very convenient for random access. Accessing the "nth" element in a vector is a simple pointer calculation (O(1) speed). Adding elements to a vector is, however, different. If one wants to add an element in the middle of a vector - all the elements that come after that element will have to be re allocated down to make room for the new entry. The speed will depend on the vector size and on the position of the new element. The worst case scenario is inserting an element at position 2 in a vector, the best one is appending a new element. Therefore - insert works with speed O(n), where "n" is the number of elements that need to be moved - not necessarily the size of a vector.
There are other differences that involve memory requirements etc., but understanding these basic principles of how lists and vectors actually work is really worth spending some time on.
As always ... "Premature optimization is the root of all evil" so first consider what is more convenient and make things work exactly the way you want them, then optimize. For 10 entries that you mention - it really does not matter what you use - you will never be able to see any kind of performance difference whatever method you use.
Prefer an std::vector over and array. Some advantages of vector are:
They allocate memory from the free space when increasing in size.
They are NOT a pointer in disguise.
They can increase/decrease in size run-time.
They can do range checking using at().
A vector knows its size, so you don't have to count elements.
The most compelling reason to use a vector is that it frees you from explicit memory management, and it does not leak memory. A vector keeps track of the memory it uses to store its elements. When a vector needs more memory for elements, it allocates more; when a vector goes out of scope, it frees that memory. Therefore, the user need not be concerned with the allocation and deallocation of memory for vector elements.
You're making assumptions you shouldn't be making, such as "accessing an item is slower than an array since it is a call to operator[]." I can understand the logic behind it, but you nor I can know until we profile it.
If you do, you'll find there is no overhead at all, when optimizations are turned on. The compiler inlines the function calls. There is a difference in memory performance. An array is statically allocated, while a vector dynamically allocates. A list allocates per node, which can throttle cache if you're not careful.
Some solutions are to have the vector allocate from the stack, and have a pool allocator for a list, so that the nodes can fit into cache.
So rather than worry about unsupported claims, you should worry about making your design as clean as possible. So, which makes more sense? An array, vector, or list? I don't know what you're trying to do so I can't answer you.
The "default" container tends to be a vector. Sometimes an array is perfectly acceptable too.
First a couple of notes:
A good rule of thumb about selecting data structures: Generally, if you examined all the possibilities and determined that an array is your best choice, start over. You did something very wrong.
STL lists don't support operator[], and if they did the reason that it would be slower than indexing an array has nothing to do with the overhead of a function call.
Those things being said, vector is the clear winner here. The call to operator[] is essentially negligible since the contents of a vector are guaranteed to be contiguous in memory. It supports insert() and erase() operations which you would essntially have to write yourself if you used an array. Basically it boils down to the fact that a vector is essentially an upgraded array which already supports all the operations you need.
I am maintaining a fixed-length table of 10 entries. Each item is a
structure of like 4 fields. There will be insert, update and delete
operations, specified by numeric position. I am wondering which is the
best data structure to use to maintain this table of information:
Based on this description it seems like list might be the better choice since its O(1) when inserting and deleting in the middle of the data structure. Unfortunately you cannot use numeric positions when using lists to do inserts and deletes like you can for arrays/vectors. This dilemma leads to a slew of questions which can be used to make an initial decision of which structure may be best to use. This structure can later be changed if testing clearly shows its the wrong choice.
The questions you need to ask are three fold. The first is how often are you planning on doing deletes/inserts in the middle relative to random reads. The second is how important is using a numeric position compared to an iterator. Finally, is order in your structure important.
If the answer to the first question is random reads will be more prevalent than a vector/array will probably work well. Note iterating through a data structure is not considered a random read even if the operator[] notation is used. For the second question, if you absolutely require numeric position than a vector/array will be required even though this may lead to a performance hit. Later testing may show this performance hit is negligible relative to the easier coding with numerical positions. Finally if order is unimportant you can insert and delete in a vector/array with an O(1) algorithm. A sample algorithm is shown below.
template <class T>
void erase(vector<T> & vect, int index) //note: vector cannot be const since you are changing vector
{
vect[index]= vect.back();//move the item in the back to the index
vect.pop_back(); //delete the item in the back
}
template <class T>
void insert(vector<T> & vect, int index, T value) //note: vector cannot be const since you are changing vector
{
vect.push_back(vect[index]);//insert the item at index to the back of the vector
vect[index] = value; //replace the item at index with value
}
I Believe it's as per your need if one needs more insert/to delete in starting or middle use list(doubly-linked internally) if one needs to access data randomly and addition to last element use array ( vector have dynamic allocation but if you require more operation as a sort, resize, etc use vector)