I need a container (not necessarily a STL container) which let me do the following easily:
Insertion and removal of elements at any position
Accessing elements by their index
Iterate over the elements in any order
I used std::list, but it won't let me insert at any position (it does, but for that I'll have to iterate over all elements and then insert at the position I want, which is slow, as the list may be huge). So can you recommend any efficient solution?
It's not completely clear to me what you mean by "Iterate over the elements in any order" - does this mean you don't care about the order, as long as you can iterate, or that you want to be able to iterate using arbitrarily defined criteria? These are very different conditions!
Assuming you meant iteration order doesn't matter, several possible containers come to mind:
std::map [a red-black tree, typically]
Insertion, removal, and access are O(log(n))
Iteration is ordered by index
hash_map or std::tr1::unordered_map [a hash table]
Insertion, removal, and access are all (approx) O(1)
Iteration is 'random'
This diagram will help you a lot, I think so.
Either a vector or a deque will suit. vector will provide faster accesses, but deque will provide faster instertions and removals.
Well, you can't have all of those in constant time, unfortunately. Decide if you are going to do more insertions or reads, and base your decision on that.
For example, a vector will let you access any element by index in constant time, iterate over the elements in linear time (all containers should allow this), but insertion and removal takes linear time (slower than a list).
You can try std::deque, but it will not provide the constant time removal of elements in middle but it supports
random access to elements
constant time insertion and removal
of elements at the end of the
sequence
linear time insertion and removal of
elements in the middle.
A vector. When you erase any item, copy the last item over one to be erased (or swap them, whichever is faster) and pop_back. To insert at a position (but why should you, if the order doesn't matter!?), push_back the item at that position and overwrite (or swap) with item to be inserted.
By "iterating over the elements in any order", do you mean you need support for both forward and backwards by index, or do you mean order doesn't matter?
You want a special tree called a unsorted counted tree. This allows O(log(n)) indexed insertion, O(log(n)) indexed removal, and O(log(n)) indexed lookup. It also allows O(n) iteration in either the forward or reverse direction. One example where these are used is text editors, where each line of text in the editor is a node.
Here are some references:
Counted B-Trees
Rope (computer science)
An order statistic tree might be useful here. It's basically just a normal tree, except that every node in the tree includes a count of the nodes in its left sub-tree. This supports all the basic operations with no worse than logarithmic complexity. During insertion, anytime you insert an item in a left sub-tree, you increment the node's count. During deletion, anytime you delete from the left sub-tree, you decrement the node's count. To index to node N, you start from the root. The root has a count of nodes in its left sub-tree, so you check whether N is less than, equal to, or greater than the count for the root. If it's less, you search in the left subtree in the same way. If it's greater, you descend the right sub-tree, add the root's count to that node's count, and compare that to N. Continue until A) you've found the correct node, or B) you've determined that there are fewer than N items in the tree.
(source: adrinael.net)
But it sounds like you're looking for a single container with the following properties:
All the best benefits of various containers
None of their ensuing downsides
And that's impossible. One benefit causes a detriment. Choosing a container is about compromise.
std::vector
[padding for "15 chars" here]
Related
Assume I have a std::set (which is by definition sorted), and I have another range of sorted elements (for the sake of simplicity, in a different std::set object). Also, I have a guarantee that all values in the second set are larger than all the values in the first set.
I know I can efficiently insert one element into std::set - if I pass a correct hint, this will be O(1). I know I can insert any range into std::set, but as no hint is passed, this will be O(k logN) (where k is number of new elements, and N number of old elements).
Can I insert a range in a std::set and provide a hint? The only way I can think of is to do k single inserts with a hint, which does push the complexity of the insert operations in my case down to O(k):
std::set <int> bigSet{1,2,5,7,10,15,18};
std::set <int> biggerSet{50,60,70};
for(auto bigElem : biggerSet)
bigSet.insert(bigSet.end(), bigElem);
First of all, to do the merge you're talking about, you probably want to use set (or map's) merge member function, which will let you merge some existing map into this one. The advantage of doing this (and the reason you might not want to, depending your usage pattern) is that the items being merged in are actually moved from one set to the other, so you don't have to allocate new nodes (which can save a fair amount of time). The disadvantage is that the nodes then disappear from the source set, so if you need each local histogram to remain intact after being merged into the global histogram, you don't want to do this.
You can typically do better than O(log N) when searching a sorted vector. Assuming reasonably predictable distribution you can use an interpolating search to do a search in (typically) around O(log log N), often called "pseudo-constant" complexity.
Given that you only do insertions relatively infrequently, you might also consider a hybrid structure. This starts with a small chunk of data that you don't keep sorted. When you reach an upper bound on its size, you sort it and insert it into a sorted vector. Then you go back to adding items to your unsorted area. When it reaches the limit, again sort it and merge it with the existing sorted data.
Assuming you limit the unsorted chunk to no larger than log(N), search complexity is still O(log N)--one log(n) binary search or log log N interpolating search on the sorted chunk, and one log(n) linear search on the unsorted chunk. Once you've verified that an item doesn't exist yet, adding it has constant complexity (just tack it onto the end of the unsorted chunk). The big advantage is that this can still easily use a contiguous structure such as a vector, so it's much more cache friendly than a typical tree structure.
Since your global histogram is (apparently) only ever populated with data coming from the local histograms, it might be worth considering just keeping it in a vector, and when you need to merge in the data from one of the local chunks, just use std::merge to take the existing global histogram and the local histogram, and merge them together into a new global histogram. This has O(N + M) complexity (N = size of global histogram, M = size of local histogram). Depending on the typical size of a local histogram, this could pretty easily work out as a win.
Merging two sorted containers is much quicker than sorting. It's complexity is O(N), so in theory what you say makes sense. It's the reason why merge-sort is one of the quickest sorting algorithms. If you follow the link, you will also find pseudo-code, what you are doing is just one pass of the main loop.
You will also find the algorithm implemented in STL as std::merge. This takes ANY container as an input, I would suggest using std::vector as default container for new element. Sorting a vector is a very fast operation. You may even find it better to use a sorted-vector instead of a set for output. You can always use std::lower_bound to get O(Nlog(N)) performance from a sorted-vector.
Vectors have many advantages compared with set/map. Not least of which is they are very easy to visualise in a debugger :-)
(The code at the bottom of the std::merge shows an example of using vectors)
You can merge the sets more efficiently using special functions for that.
In case you insist, insert returns information about the inserted location.
iterator insert( const_iterator hint, const value_type& value );
Code:
std::set <int> bigSet{1,2,5,7,10,15,18};
std::set <int> biggerSet{50,60,70};
auto hint = bigSet.cend();
for(auto& bigElem : biggerSet)
hint = bigSet.insert(hint, bigElem);
This assumes, of course, that you are inserting new elements that will end up together or close in the final set. Otherwise there is not much to gain, only the fact that since the source is a set (it is ordered) then about half of the three will not be looked up.
There is also a member function
template< class InputIt > void insert( InputIt first, InputIt last );.
That might or might not do something like this internally.
In data structures, we say pushing an element before a node in singly-linked lists are O(n) operation! since there is no backward pointers, we have to walk all the way through the elements to get to the key we are going to add before the new element. Therefore, it has a linear run time.
Then, when we introduce doubly-linked lists, we say the problem is resolved and now since we have pointers in both directions pushing before becomes a constant time operation O(1).
I understand the logic but still, something is confusing to me! Since we DO NOT have constant time access to the elements of the list, for finding the element we want to add before, we have to walk through the previous element to get there! that is true that in the doubly-linked list it is now faster to implement the add-before command, but still, the action of finding the interested key is O(n)! then why we say with the doubly-linked list the operation of add before becomes O(1)?
Thanks,
In C++, the std::list::insert() function takes an iterator to indicate where the insert should occur. That means the caller already has this iterator, and the insert operation is not doing a search and therefore runs in constant time.
The find() algorithm, however, is linear, and is the normal way to search for a list element. If you need to find+insert, the combination is O(n).
However, there is no requirement to do a search before an insert. For example, if you have a cached (valid) iterator, you can insert in front of (or delete) the element it corresponds with in constant time.
Since both std::priority_queue and std::set (and std::multiset) are data containers that store elements and allow you to access them in an ordered fashion, and have same insertion complexity O(log n), what are the advantages of using one over the other (or, what kind of situations call for the one or the other?)?
While I know that the underlying structures are different, I am not as much interested in the difference in their implementation as I am in the comparison their performance and suitability for various uses.
Note: I know about the no-duplicates in a set. That's why I also mentioned std::multiset since it has the exactly same behavior as the std::set but can be used where the data stored is allowed to compare as equal elements. So please, don't comment on single/multiple keys issue.
A priority queue only gives you access to one element in sorted order -- i.e., you can get the highest priority item, and when you remove that, you can get the next highest priority, and so on. A priority queue also allows duplicate elements, so it's more like a multiset than a set. [Edit: As #Tadeusz Kopec pointed out, building a heap is also linear on the number of items in the heap, where building a set is O(N log N) unless it's being built from a sequence that's already ordered (in which case it is also linear).]
A set allows you full access in sorted order, so you can, for example, find two elements somewhere in the middle of the set, then traverse in order from one to the other.
std::priority_queue allows to do the following:
Insert an element O(log n)
Get the smallest element O(1)
Erase the smallest element O(log n)
while std::set has more possibilities:
Insert any element O(log n) and the constant is greater than in std::priority_queue
Find any element O(log n)
Find an element, >= than the one your are looking for O(log n) (lower_bound)
Erase any element O(log n)
Erase any element by its iterator O(1)
Move to previous/next element in sorted order O(1)
Get the smallest element O(1)
Get the largest element O(1)
set/multiset are generally backed by a binary tree. http://en.wikipedia.org/wiki/Binary_tree
priority_queue is generally backed by a heap. http://en.wikipedia.org/wiki/Heap_(data_structure)
So the question is really when should you use a binary tree instead of a heap?
Both structures are laid out in a tree, however the rules about the relationship between anscestors are different.
We will call the positions P for parent, L for left child, and R for right child.
In a binary tree L < P < R.
In a heap P < L and P < R
So binary trees sort "sideways" and heaps sort "upwards".
So if we look at this as a triangle than in the binary tree L,P,R are completely sorted, whereas in the heap the relationship between L and R is unknown (only their relationship to P).
This has the following effects:
If you have an unsorted array and want to turn it into a binary tree it takes O(nlogn) time. If you want to turn it into a heap it only takes O(n) time, (as it just compares to find the extreme element)
Heaps are more efficient if you only need the extreme element (lowest or highest by some comparison function). Heaps only do the comparisons (lazily) necessary to determine the extreme element.
Binary trees perform the comparisons necessary to order the entire collection, and keep the entire collection sorted all-the-time.
Heaps have constant-time lookup (peek) of lowest element, binary trees have logarithmic time lookup of lowest element.
Since both std::priority_queue and std::set (and std::multiset) are data containers that store elements and allow you to access them in an ordered fashion, and have same insertion complexity O(log n), what are the advantages of using one over the other (or, what kind of situations call for the one or the other?)?
Even though insert and erase operations for both containers have the same complexity O(log n), these operations for std::set are slower than for std::priority_queue. That's because std::set makes many memory allocations. Every element of std::set is stored at its own allocation. std::priority_queue (with underlying std::vector container by default) uses single allocation to store all elements. On other hand std::priority_queue uses many swap operations on its elements whereas std::set uses just pointers swapping. So if swapping is very slow operation for element type, using std::set may be more efficient. Moreover element may be non-swappable at all.
Memory overhead for std::set is much bigger also because it has to store many pointers between its nodes.
Here's an interesting problem:
Let's say we have a set A for which the following are permitted:
Insert x
Find-min x
Delete the n-th inserted element in A
Create a data structure to permit these in logarithmic time.
The most common solution is with a heap. AFAIK, heaps with decrease-key (based on a value - generally the index when an element was added) keep a table with the Pos[1...N] meaning the i-th added value is now on index Pos[i], so it can find the key to decrease in O(1). Can someone confirm this?
Another question is how we solve the problem with STL containers? i.e. with sets, maps or priority queues. A partial solution i found is to have a priority queue with indexes but ordered by the value to these indexes. I.e. A[1..N] are our added elements in order of insertion. pri-queue with 1..N based on comparison of (A[i],A[j]). Now we keep a table with the deleted indexes and verify if the min-value index was deleted. Unfortunately, Find-min becomes slightly proportional with no. of deleted values.
Any alternative ideas?
Now I thought how to formulate a more general problem.
Create a data structure similar to multimap with <key, value> elements. Keys are not unique. Values are. Insert, find one (based on key), find (based on value), delete one (based on key) and delete (based on value) must be permitted O(logN).
Perhaps a bit oddly, this is possible with a manually implemented Binary Search Tree with a modification: for every node operation a hash-table or a map based on value is updated with the new pointer to the node.
Similar to having a strictly ordered std::set (if equal key order by value) with a hash-table on value giving the iterator to the element containing that value.
Possible with std::set and a (std::map/hash table) as described by Chong Luo.
You can use a combination of two containers to solve your problem - A vector in which you add each consecutive element and a set:
You use the set to execute find_min while
When you insert an element you execute push_back in the vector and insert in the set
When you delete the n-th element, you see it's value in the vector and erase it from the set. Here I assume the number of elements does not change after executing delete n-th element.
I think you can not solve the problem with only one container from STL. However there are some data structures that can solve your problem:
Skip list - can find the minimum in constant time and will perform the other two operations with amortized complexity O(log(n)). It is relatively easy to implement.
Tiered vector is easy to implement and will perform find_min in constant time and the other two operations in O(sqrt(n))
And of course the approach you propose - write your own heap that keeps track of where is the n-th element in it.
I'm looking for a C++ implementation of a data structure ( or a combination of data structures ) that meet the following criteria:
items are accessed in the same way as in std::vector
provides random access iterator ( along with iterator comparison <,> )
average item access(:lookup) time is at worst of O(log(n)) complexity
items are iterated over in the same order as they were added to the container
given an iterator, i can find out the ordinal position of the item pointed to in the container, at worst of O(log(n)) complexity
provides item insertion and removal at specific position of at worst O(log(n)) complexity
removal/insertion of items does not invalidate previously obtained iterators
Thank you in advance for any suggestions
Dalibor
(Edit) Answers:
The answer I selected describes a data structure that meet all these requirements. However, boost::multi_index, as suggested by Maxim Yegorushkin, provides features very close to those above.
(Edit) Some of the requirements were not correctly specified. They are modified according to correction(:original)
(Edit) I've found an implementation of the data structure described in the accepted answer. So far, it works as expected. It's called counter tree
(Edit) Consider using the AVL-Array suggested by sp2danny
Based on your requirements boost::multi_index with two indices does the trick.
The first index is ordered index. It allows for O(log(n)) insert/lookup/remove. The second index is random access index. It allows for random access and the elements are stored in the order of insertion. For both indices iterators don't get invalidated when other elements are removed. Converting from one iterator to another is O(1) operation.
Let's go through these...
average item lookup time is at worst of O(log(n)) complexity
removal/insertion of items does not invalidate previously obtained iterators
provides item insertion and removal of at worst O(log(n)) complexity
That pretty much screams "tree".
provides random access iterator ( along with iterator comparison <,> )
given an iterator, i can find out the ordinal position of the item pointed to in the container, at worst of O(log(n)) complexity
items are iterated over in the same order as they were added to the container
I'm assuming that the index you're providing your random-access iterator is by order of insertion, so [0] would be the oldest element in the container, [1] would be the next oldest, etc. This means that, on deletion, for the iterators to be valid, the iterator internally cannot store the index, since it could change without notice. So just using a map with the key being the insertion order isn't going to work.
Given that, each node of your tree needs to keep track of how many elements are in each subtree, in addition to its usual members. This will allow random-access with O(log(N)) time. I don't know of a ready-to-go set of code, but subclassing std::rb_tree and std::rb_node would be my starting point.
See here: STL Containers (scroll down the page to see information on algorithmic complexity) and I think std::deque fits your requirements.
AVL-Array should fit the bill.
Here's my "lv" container that fit the requirement, O(log n) insert/delete/access time.
https://github.com/xhawk18/lv
The container is header only C++ libraries,
and has the same iterator and functions with other C++ containers, such as list and vector.
"lv" container is based on rb-tree, each node of which has a size value about the amount of nodes in the sub-tree. By check the size of left/right child of a tree, we can fast access the node randomly.