Optimal way to search a std::set - c++

How should one search a std::set, when speed is the critical criterion for his/her project?
set:find?
Complexity:
Logarithmic in size.
std::binary_search?
Complexity:
On average, logarithmic in the distance between first and last: Performs approximately log2(N)+2 element comparisons (where N is this distance).
On non-random-access iterators, the iterator advances produce themselves an additional linear complexity in N on average.
Just a binary search implemented by him/her (like this one)? Or the STL's one is good enough?
Is there a way to answer this theoretically? Or we have to test ourselves? If someone has, it would be nice if (s)he would share this information with us (if no, we are not lazy :) ).

The iterator type provided by std::set is a bidirectional_iterator, a category which does not require random access to elements, but only element-wise movements in both directions. All random_access_iterator's are bidirectional_iterators, but not vice versa.
Using std::binary_search on a std::set can therefore yield O(n) runtime as per the remarks you quoted, while std::set::find has guaranteed O(logn).
So to search a set, use set::find.

It's unlikely that std::set has a random access iterator. Even if it did, std::binary_search would access at least as many nodes as .find, since .find accesses only the ancestors of the target node.

Related

Time complexity of std::lower_bound on a sorted vector

I was studying std::upper_bound from http://www.cplusplus.com/reference/algorithm/upper_bound/
and I came across the fact that this might run in linear time on non-random access iterators.
I need to use this for a sorted vector. Now I don't know what are non-random access iterators and whether this will run in logarithmic time on the sorted vector.
Can anyone clear this for me.
§ 23.3.6.1 [vector.overview]/p1:
A vector is a sequence container that supports random access iterators.
A random access iterator is the one that is able to compute the offset of an arbitrary element in a constant time, without a need to iterate from one place to another (what would result in linear complexity).
std::lower_bound itself provides generic implementation of the binary search algorithm, that doesn't care much about what iterator is used to indicate ranges (it only requires the iterator to be of at least a forward category). It uses helper functions like std::advance to iteratively limit the ranges in its binary search. With std::vector<T>::iterator which is of a random access category, std::lower_bound runs with logarithmic time complexity with regards to the number of steps required to iterate over elements, as it can partition the range by half in each step in a constant time.
§ 25.4.3 [alg.binary.search]/p1:
All of the algorithms in this section are versions of binary search and assume that the sequence being
searched is partitioned with respect to an expression formed by binding the search key to an argument of
the implied or explicit comparison function. They work on non-random access iterators minimizing the
number of comparisons, which will be logarithmic for all types of iterators. They are especially appropriate
for random access iterators, because these algorithms do a logarithmic number of steps through the data
structure. For non-random access iterators they execute a linear number of steps.

std::map get the lowest n elements time

std::map should be implemented with a binary search tree as I read in the documentation and it sorts them too.
I need to insert rapidly and retrieve rapidly elements. I also need to get the first lowest N elements from time to time.
I was thinking about using a std::map, is it a good choice? If it is, what is the time I would need to retrieve the lowest N elements? O(n*logn)?
Given you need both retrieval and n smallest, I would say std::map is reasonable choice. But depending on the exact access pattern std::vector with sorting might be a good choice too.
I am not sure what you mean by retrieve. Time to read k elements is O(k) (provided you do it sequentially using iterator), time to remove them is O(k log n) (n is the total amount of elements; even if you do it sequentially using iterators).
You can use iterators to rapidly read through the lowest N elements. Going from begin() to the N-1th element will take O(n) time (getting the next element is amortised constant time for a std::map).
I'd note, however, that it is often actually faster to use a sorted std::vector with a binary chop search method to implement what it sounds like you are doing so depending on your exact requirements this might be worth investigating.
The C++ standard requires that all required iterator operations (including iterator increment) be amortized constant time. Consequently, getting the first N items in a container must take amortized O(N) time.
I would say yes to both questions.

Set find member vs. using find on list

Since the items in a Standard Library set container are sorted, will using the find member on the set, in general, perform faster than using the find algorithm on the same items in a sorted list?
Since the list is linear and the set is often implemented using a sorted tree, it seems as though the set-find should be faster.
With a linked list, even a sorted one, finding an element is O(n). A set can be searched in O(log n). Therefore yes, finding an element in a set is asymptotically faster.
A sorted array/vector can be searched in O(log n) by using binary search. Unfortunately, since a linked list doesn't support random access, the same method can't be used to search a sorted linked list in O(log n).
It's actually in the standard: std::set::find() has complexity O(log n), where n is the number of elements in the set. std::find() on the other hand is linear in the length of the search range.
If your generic search range happens to be sorted and has random access (e.g. a sorted vector), then you can use std::lower_bound() to find an element (or rather a position) efficiently.
Note that std::set comes with its own member-lower_bound(), which works the same way. Having an insertion position may be useful even in a set, because insert() with a correct hint has complexity O(1).
You can generally expect a find operation to be faster on a Set than on a List, since lists are linear access (O(n)), while sets may have near-constant access for HashSets (O(1)), or logarithmic access for TreeSets (O(log n)).
set::find has a complexity of O(log(n)), while std::find has a complexity of O(n). This means that std::set::find() is asymptotically faster than std::find(std::list), but that doesn't mean it is faster for any particular data set, or for any particular search.
I found this article helpful on the topic. http://lafstern.org/matt/col1.pdf
You could reconsider your requirements for just a "list" vs. a "set". According to that article, if your program consists primarily of a bunch of insertions at the start, and then after that, only comparisons to what you have stored, then you are better off with adding everything to a vector, using std::sort (vector.begin(), vector.end()) once, and then using lower_bound. In my particular application, I load from a text file a list of names when the program starts up, and then during program execution I determine if a user is in that list. If they are, I do something, otherwise, I do nothing. In other words, I had a single discrete insertion phase, then I sorted, then after that I used std::binary_search (vector.begin(), vector.end(), std::string username) to determine whether the user is in the list.

Looking for special C++ data structure

I'm looking for a C++ implementation of a data structure ( or a combination of data structures ) that meet the following criteria:
items are accessed in the same way as in std::vector
provides random access iterator ( along with iterator comparison <,> )
average item access(:lookup) time is at worst of O(log(n)) complexity
items are iterated over in the same order as they were added to the container
given an iterator, i can find out the ordinal position of the item pointed to in the container, at worst of O(log(n)) complexity
provides item insertion and removal at specific position of at worst O(log(n)) complexity
removal/insertion of items does not invalidate previously obtained iterators
Thank you in advance for any suggestions
Dalibor
(Edit) Answers:
The answer I selected describes a data structure that meet all these requirements. However, boost::multi_index, as suggested by Maxim Yegorushkin, provides features very close to those above.
(Edit) Some of the requirements were not correctly specified. They are modified according to correction(:original)
(Edit) I've found an implementation of the data structure described in the accepted answer. So far, it works as expected. It's called counter tree
(Edit) Consider using the AVL-Array suggested by sp2danny
Based on your requirements boost::multi_index with two indices does the trick.
The first index is ordered index. It allows for O(log(n)) insert/lookup/remove. The second index is random access index. It allows for random access and the elements are stored in the order of insertion. For both indices iterators don't get invalidated when other elements are removed. Converting from one iterator to another is O(1) operation.
Let's go through these...
average item lookup time is at worst of O(log(n)) complexity
removal/insertion of items does not invalidate previously obtained iterators
provides item insertion and removal of at worst O(log(n)) complexity
That pretty much screams "tree".
provides random access iterator ( along with iterator comparison <,> )
given an iterator, i can find out the ordinal position of the item pointed to in the container, at worst of O(log(n)) complexity
items are iterated over in the same order as they were added to the container
I'm assuming that the index you're providing your random-access iterator is by order of insertion, so [0] would be the oldest element in the container, [1] would be the next oldest, etc. This means that, on deletion, for the iterators to be valid, the iterator internally cannot store the index, since it could change without notice. So just using a map with the key being the insertion order isn't going to work.
Given that, each node of your tree needs to keep track of how many elements are in each subtree, in addition to its usual members. This will allow random-access with O(log(N)) time. I don't know of a ready-to-go set of code, but subclassing std::rb_tree and std::rb_node would be my starting point.
See here: STL Containers (scroll down the page to see information on algorithmic complexity) and I think std::deque fits your requirements.
AVL-Array should fit the bill.
Here's my "lv" container that fit the requirement, O(log n) insert/delete/access time.
https://github.com/xhawk18/lv
The container is header only C++ libraries,
and has the same iterator and functions with other C++ containers, such as list and vector.
"lv" container is based on rb-tree, each node of which has a size value about the amount of nodes in the sub-tree. By check the size of left/right child of a tree, we can fast access the node randomly.

Complexity of STL max_element

So according to the link here: http://www.cplusplus.com/reference/algorithm/max_element/ , the max_element function is O(n), apparently for all STL containers. Is this correct? Shouldn't it be O(log n) for a set (implemented as a binary tree)?
On a somewhat related note, I've always used cplusplus.com for questions which are easier to answer, but I would be curious what others think of the site.
It's linear because it touches every element.
It's pointless to even use it on a set or other ordered container using the same comparator because you can just use .rbegin() in constant time.
If you're not using the same comparison function there's no guarantee that the orders will coincide so, again, it has to touch every element and has to be at least linear.
Although algorithms may be specialized for different iterator categories there is no way to specialize them base on whether an iterator range is ordered.
Most algorithms work on unordered ranges (max_element included), a few require the ranges to be ordered (e.g. set_union, set_intersection) some require other properties for the range (e.g. push_heap, pop_heap).
The max_element function is O(n) for all STL containers.
This is incorrect, because max_element applies to iterators, not containers. Should you give it iterators from a set, it has no way of knowing they come from a set and will therefore traverse all of them in order looking for the maximum. So the correct sentence is:
The max_element function is O(n) for all forward iterators
Besides, if you know that you're manipulating a set, you already have access to methods that give you the max element faster than O(n), so why use max_element ?
It is an STL algorithm, so it does not know anything about the container. So this linear search is the best it can do on a couple on forward iterators.
STL algorithms do not know what container you took the iterators from, whether or not it is ordered and what order constraints were used. It is a linear algorithm that checks all elements in the range while keeping track of the maximum value seen so far.
Note that even if you could use metaprogramming techniques to detect what type of container where the iterators obtained from that is not a guarantee that you can just skip to the last element to obtain the maximum:
int values[] = { 1, 2, 3, 4, 5 };
std::set<int, greater<int> > the_set( values, values+5 );
std::max_element( the_set.begin(), the_set.end() ); //??
Even if the iterators come from a set, it is not the last, but the first element the one that holds the maximum. With more complex data types the set can be ordered with some other key that can be unrelated to the min/max values.