C++ - List with logarithmic read, insertion at given position - c++

I'm looking for data structure that behaves like a list, where we can insert an element at ANY given position and then read an element at ANY given position, where insertion and reading should be in logarithmic time. Is there something like this in the standard library or maybe I'm stuck with having to write this on my own (I know it can be implemented as a tree)?

std::multiset behaves pretty much like the logarithmic std::list that you are looking for
iteration is bidirectional
insertion / reading are O(log N)
Note however (as pointed out by #SergeRogatch) that the "price" you pay for O(log N) lookup (instead of O(N) for list) multiset will order elements as they are inserted. This behaves differently than std::list. This also means that your elements need to be comparable using std::less<> or you need to provide your own comparator.
An alternative would be to use std::unordered_multiset (i.e. a hash table), which has amortized O(1) element acces, but then there is no deterministic order either. But again, then your elements need to be usable with std::hash<> or you need to write your own hash function.

Related

find() vs binary_search() in STL

Which function is more efficient in searching an element in vector find() or binary_search() ?
The simple answer is: std::find for unsorted data and std::binary_search for sorted data. But I think there's much more to this:
Both methods take a range [start, end) with n elements and and a value x that is to be found as input. But note the important difference that std::binary_search only returns a bool that tells you wether the range contained the element, or not. std::find() returns an iterator. So both have different, but overlapping use cases.
std::find() is pretty straight forward. O(n) iterator increments and O(n) comparisons. Also it doesn't matter wether the input data is sorted or not.
For std::binary_search() you need to consider multiple factors:
It only works on sorted data. You need to take the cost of sorting into account.
The number of comparisons is always O(log n).
If the iterator does not satisfy LegacyRandomAccessIterator the number of iterator increments is O(n), it will be logarithmic when they do satisfy this requirement.
Conclusion (a bit opinionated):
when you operate on un-sorted data or need the location of the item you searched for you must use std::find()
when your data is already sorted or needs to be sorted anyway and you simply want to check if an element is present or not, use std::binary_search()
If you want to search containers like std::set, std::map or their unordered counterparts also consider their builtin methods like std::set::find
When you are not sure if the data is sorted or not, You have to use find() and If the data will be sorted you should use binary_search().
For more information, You can refer find() and binary_search()
If your input is sorted then you can use binary_search as it will take O(lg n) time. if your input is unsorted you can use find, which will take O(n) time.

What is the most efficient std container for non-duplicated items?

What is the most efficient way of adding non-repeated elements into STL container and what kind of container is the fastest? I have a large amount of data and I'm afraid each time I try to check if it is a new element or not, it takes a lot of time. I hope map be very fast.
// 1- Map
map<int, int> Map;
...
if(Map.find(Element)!=Map.end()) Map[Element]=ID;
// 2-Vector
vector<int> Vec;
...
if(find(Vec.begin(), Vec.end(), Element)!=Vec.end()) Vec.push_back(Element);
// 3-Set
// Edit: I made a mistake: set::find is O(LogN) not O(N)
Both set and map has O(log(N)) performance for looking up keys. vector has O(N).
The difference between set and map, as far as you should be concerned, is whether you need to associate a key with a value, or just store a value directly. If you need the former, use a map, if you need the latter, use a set.
In both cases, you should just use insert() instead of doing a find().
The reason is insert() will insert the value into the container if and only if the container does not already contain that value (in the case of map, if the container does not contain that key). This might look like
Map.insert(std::make_pair(Element, ID));
for a map or
Set.insert(Element);
for a set.
You can consult the return value to determine whether or not an insertion was actually performed.
If you're using C++11, you have two more choices, which are std::unordered_map and std::unordered_set. These both have amortized O(1) performance for insertions and lookups. However, they also require that the key (or value, in the case of set) be hashable, which means you'll need to specialize std::hash<> for your key. Conversely, std::map and std::set require that your key (or value, in the case of set) respond to operator<().
If you're using C++11, you can use std::unordered_set. That would allow you O(1) existence-checking (technically amortized O(1) -- O(n) in the worst case).
std::set would probably be your second choice with O(lg n).
Basically, std::unordered_set is a hash table and std::set is a tree structure (a red black tree in every implementation I've ever seen)1.
Depending on how well your hashes distribute and how many items you have, a std::set might actually be faster. If it's truly performance critical, then as always, you'll want to do benchmarking.
1) Technically speaking, I don't believe either are required to be implemented as a hash table or as a balanced BST. If I remember correctly, the standard just mandates the run time bounds, not the implementation -- it just turns out that those are the only viable implementations that fit the bounds.
You should use a std::set; it is a container designed to hold a single (equivalent) copy of an object and is implemented as a binary search tree. Therefore, it is O(log N), not O(N), in the size of the container.
std::set and std::map often share a large part of their underlying implementation; you should check out your local STL implementation.
Having said all this, complexity is only one measure of performance. You may have better performance using a sorted vector, as it keeps the data local to one another and, therefore, more likely to hit the caches. Cache coherence is a large part of data structure design these days.
Sounds like you want to use a std::set. It's elements are unique, so you don't need to care about uniqueness when adding elements, and a.find(k) (where a is an std::set and k is a value) is defined as being logarithmic in complexity.
if your elements can be hashed for O(1), then better to use an index in a unordered_map or unordered_set (not in a map/set because they use RB tree in implementation which is O(logN) find complexity)
Your examples show a definite pattern:
check if the value is already in container
if not, add the value to the container.
Both of these operation would potentially take some time. First, looking up an element can be done in O(N) time (linear search) if the elements are not arranged in any particular manner (e.g., just a plain std::vector), it could be done in O(logN) time (binary search) if the elements are sorted (e.g., either std::map or std::set), and it could be done in O(1) time if the elements are hashed (e.g., either std::unordered_map or std::unordered_set).
The insertion will be O(1) (amortized) for a plain vector or an unordered container (hash container), although the hash container will be a bit slower. For a sorted container like set or map, you'll have log-time insertions because it needs to look for the place to insert it before inserting it.
So, the conclusion, use std::unordered_set or std::unordered_map (if you need the key-value feature). And you don't need to check before doing the insertion, these are unique-key containers, they don't allow duplicates.
If std::unordered_set / std::unordered_map (from C++11) or std::tr1::unordered_set / std::tr1::unordered_map (since 2007) are not available to you (or any equivalent), then the next best alternative is std::set / std::map.

Difference between std::set and std::priority_queue

Since both std::priority_queue and std::set (and std::multiset) are data containers that store elements and allow you to access them in an ordered fashion, and have same insertion complexity O(log n), what are the advantages of using one over the other (or, what kind of situations call for the one or the other?)?
While I know that the underlying structures are different, I am not as much interested in the difference in their implementation as I am in the comparison their performance and suitability for various uses.
Note: I know about the no-duplicates in a set. That's why I also mentioned std::multiset since it has the exactly same behavior as the std::set but can be used where the data stored is allowed to compare as equal elements. So please, don't comment on single/multiple keys issue.
A priority queue only gives you access to one element in sorted order -- i.e., you can get the highest priority item, and when you remove that, you can get the next highest priority, and so on. A priority queue also allows duplicate elements, so it's more like a multiset than a set. [Edit: As #Tadeusz Kopec pointed out, building a heap is also linear on the number of items in the heap, where building a set is O(N log N) unless it's being built from a sequence that's already ordered (in which case it is also linear).]
A set allows you full access in sorted order, so you can, for example, find two elements somewhere in the middle of the set, then traverse in order from one to the other.
std::priority_queue allows to do the following:
Insert an element O(log n)
Get the smallest element O(1)
Erase the smallest element O(log n)
while std::set has more possibilities:
Insert any element O(log n) and the constant is greater than in std::priority_queue
Find any element O(log n)
Find an element, >= than the one your are looking for O(log n) (lower_bound)
Erase any element O(log n)
Erase any element by its iterator O(1)
Move to previous/next element in sorted order O(1)
Get the smallest element O(1)
Get the largest element O(1)
set/multiset are generally backed by a binary tree. http://en.wikipedia.org/wiki/Binary_tree
priority_queue is generally backed by a heap. http://en.wikipedia.org/wiki/Heap_(data_structure)
So the question is really when should you use a binary tree instead of a heap?
Both structures are laid out in a tree, however the rules about the relationship between anscestors are different.
We will call the positions P for parent, L for left child, and R for right child.
In a binary tree L < P < R.
In a heap P < L and P < R
So binary trees sort "sideways" and heaps sort "upwards".
So if we look at this as a triangle than in the binary tree L,P,R are completely sorted, whereas in the heap the relationship between L and R is unknown (only their relationship to P).
This has the following effects:
If you have an unsorted array and want to turn it into a binary tree it takes O(nlogn) time. If you want to turn it into a heap it only takes O(n) time, (as it just compares to find the extreme element)
Heaps are more efficient if you only need the extreme element (lowest or highest by some comparison function). Heaps only do the comparisons (lazily) necessary to determine the extreme element.
Binary trees perform the comparisons necessary to order the entire collection, and keep the entire collection sorted all-the-time.
Heaps have constant-time lookup (peek) of lowest element, binary trees have logarithmic time lookup of lowest element.
Since both std::priority_queue and std::set (and std::multiset) are data containers that store elements and allow you to access them in an ordered fashion, and have same insertion complexity O(log n), what are the advantages of using one over the other (or, what kind of situations call for the one or the other?)?
Even though insert and erase operations for both containers have the same complexity O(log n), these operations for std::set are slower than for std::priority_queue. That's because std::set makes many memory allocations. Every element of std::set is stored at its own allocation. std::priority_queue (with underlying std::vector container by default) uses single allocation to store all elements. On other hand std::priority_queue uses many swap operations on its elements whereas std::set uses just pointers swapping. So if swapping is very slow operation for element type, using std::set may be more efficient. Moreover element may be non-swappable at all.
Memory overhead for std::set is much bigger also because it has to store many pointers between its nodes.

Set find member vs. using find on list

Since the items in a Standard Library set container are sorted, will using the find member on the set, in general, perform faster than using the find algorithm on the same items in a sorted list?
Since the list is linear and the set is often implemented using a sorted tree, it seems as though the set-find should be faster.
With a linked list, even a sorted one, finding an element is O(n). A set can be searched in O(log n). Therefore yes, finding an element in a set is asymptotically faster.
A sorted array/vector can be searched in O(log n) by using binary search. Unfortunately, since a linked list doesn't support random access, the same method can't be used to search a sorted linked list in O(log n).
It's actually in the standard: std::set::find() has complexity O(log n), where n is the number of elements in the set. std::find() on the other hand is linear in the length of the search range.
If your generic search range happens to be sorted and has random access (e.g. a sorted vector), then you can use std::lower_bound() to find an element (or rather a position) efficiently.
Note that std::set comes with its own member-lower_bound(), which works the same way. Having an insertion position may be useful even in a set, because insert() with a correct hint has complexity O(1).
You can generally expect a find operation to be faster on a Set than on a List, since lists are linear access (O(n)), while sets may have near-constant access for HashSets (O(1)), or logarithmic access for TreeSets (O(log n)).
set::find has a complexity of O(log(n)), while std::find has a complexity of O(n). This means that std::set::find() is asymptotically faster than std::find(std::list), but that doesn't mean it is faster for any particular data set, or for any particular search.
I found this article helpful on the topic. http://lafstern.org/matt/col1.pdf
You could reconsider your requirements for just a "list" vs. a "set". According to that article, if your program consists primarily of a bunch of insertions at the start, and then after that, only comparisons to what you have stored, then you are better off with adding everything to a vector, using std::sort (vector.begin(), vector.end()) once, and then using lower_bound. In my particular application, I load from a text file a list of names when the program starts up, and then during program execution I determine if a user is in that list. If they are, I do something, otherwise, I do nothing. In other words, I had a single discrete insertion phase, then I sorted, then after that I used std::binary_search (vector.begin(), vector.end(), std::string username) to determine whether the user is in the list.

STL like container with O(1) performance

I couldn't find an answer but I am pretty sure I am not the first one looking for this.
Did anyone know / use / see an STL like container with bidirectional access iterator that has O(1) complexity for Insert/Erase/Lookup ?
Thank you.
There is no abstract data type with O(1) complexity for Insert, Erase AND Lookup which also provides a bi-directional access iterator.
Edit:
This is true for an arbitrarily large domain. Given a sufficiently small domain you can implement a set with O(1) complexity for Insert, Erase and Lookup and a bidirectional access iterator using an array and a doubly linked list:
std::list::iterator array[MAX_VALUE];
std::list list;
Initialise:
for (int i=0;i<MAX_VALUE;i++)
array[i] = list.end();
Insert:
if (array[value] != list.end())
array[value] = list.insert(value);
Erase:
if (array[value] != list.end()) {
array[value].erase();
array[value] = list.end();
}
Lookup:
array[value] != list.end()
tr1's unordered_set (also available in boost) is probably what you are looking for. You don't specify whether or not you want a sequence container or not, and you don't specify what you are using to give O(1) lookup (ie. vectors have O(1) lookup on index, unordered_set mentioned above has O(1) average case lookup based on the element itself).
In practice, it may be sufficient to use array (vector) and defer costs of inserts and deletes.
Delete element by marking it as deleted, insert element into bin at desired position and remember offset for larger indices.
Inserts and deletes will O(1) plus O(N) cleanup at convenient time; lookup will be O(1) average, O(number of changes since last cleanup) worst case.
Associative arrays (hashtable) have O(1) lookup complexity, while doubly linked lists have O(1) bidi iteration.
One trick I've done when messing about storage optimization is to implement a linked list with an add of O(1)[1], then have a caching operation which provides a structure with a faster O(n) lookup[2]. The actual cache takes some O(n) time to build, and I didn't focus on erase. So I 'cheated' a bit and pushed the work into another operation. But if you don't have to do a ton of adds/deletes, it's not a bad way to do it.
[1] Store end pointer and only add onto the end. No traversal required.
[2] I created a dynamic array[3] and searched against it. Since the data wasn't sorted, I couldn't binsearch against it for O(lg n) time. Although I suppose I could have sorted it.
[3]Arrays have better cache performance than lists.
Full list of all the complexity gurantees for the STL can be found here:
What are the complexity guarantees of the standard containers?
Summary:
Insert: No container gurantees O(1) for generic insert.
The only container that has a genric insert gurtantee is: the 'Associative Container'. And this is O(ln(n))
There are containers the provide limited insert gurantees
Forward sequece gurantee an insert at head of O(1)
Back sequence gurantee an insert at tail of O(1)
Erase
The Associative containers gurantee O(1) for erase (If you have an iterator).
Lookup:
If you mean element access by lookup (as no container has O(1) find capabilities).
Then Random Access container is the only container with O(1) accesses
So the answer is based on container types.
This is what the standard gurantees are defiend for how does this translate to real containers:
std::vector: Sequence, Back Sequence, Forward/Reverse/Random Container
std::deque: Sequence, Front/Back Sequence, Forward/Reverse/Random Container
std::list: Sequence, Front/Back Seuqence, Forward/Reverse Container
std::set: Sorted/Simple/Unique Associative Container, Forward Container
std::map: Sorted/Pair/Unique Associative Container, Forward Container
std::multiset: Sorted/Simple/Multiple Associative Container, Forward Container
std::multimap: Sorted/Pair/Multiple Associative Container, Forward Container
You won't be able to fit all of your requirements into one container... something's gotta give ;)
However, maybe this is interesting for you:
http://www.cplusplus.com/reference/stl/