Set find member vs. using find on list - c++

Since the items in a Standard Library set container are sorted, will using the find member on the set, in general, perform faster than using the find algorithm on the same items in a sorted list?
Since the list is linear and the set is often implemented using a sorted tree, it seems as though the set-find should be faster.

With a linked list, even a sorted one, finding an element is O(n). A set can be searched in O(log n). Therefore yes, finding an element in a set is asymptotically faster.
A sorted array/vector can be searched in O(log n) by using binary search. Unfortunately, since a linked list doesn't support random access, the same method can't be used to search a sorted linked list in O(log n).

It's actually in the standard: std::set::find() has complexity O(log n), where n is the number of elements in the set. std::find() on the other hand is linear in the length of the search range.
If your generic search range happens to be sorted and has random access (e.g. a sorted vector), then you can use std::lower_bound() to find an element (or rather a position) efficiently.
Note that std::set comes with its own member-lower_bound(), which works the same way. Having an insertion position may be useful even in a set, because insert() with a correct hint has complexity O(1).

You can generally expect a find operation to be faster on a Set than on a List, since lists are linear access (O(n)), while sets may have near-constant access for HashSets (O(1)), or logarithmic access for TreeSets (O(log n)).

set::find has a complexity of O(log(n)), while std::find has a complexity of O(n). This means that std::set::find() is asymptotically faster than std::find(std::list), but that doesn't mean it is faster for any particular data set, or for any particular search.

I found this article helpful on the topic. http://lafstern.org/matt/col1.pdf
You could reconsider your requirements for just a "list" vs. a "set". According to that article, if your program consists primarily of a bunch of insertions at the start, and then after that, only comparisons to what you have stored, then you are better off with adding everything to a vector, using std::sort (vector.begin(), vector.end()) once, and then using lower_bound. In my particular application, I load from a text file a list of names when the program starts up, and then during program execution I determine if a user is in that list. If they are, I do something, otherwise, I do nothing. In other words, I had a single discrete insertion phase, then I sorted, then after that I used std::binary_search (vector.begin(), vector.end(), std::string username) to determine whether the user is in the list.

Related

Is there a data structure like a C++ std set which also quickly returns the number of elements in a range?

In a C++ std::set (often implemented using red-black binary search trees), the elements are automatically sorted, and key lookups and deletions in arbitrary positions take time O(log n) [amortised, i.e. ignoring reallocations when the size gets too big for the current capacity].
In a sorted C++ std::vector, lookups are also fast (actually probably a bit faster than std::set), but insertions are slow (since maintaining sortedness takes time O(n)).
However, sorted C++ std::vectors have another property: they can find the number of elements in a range quickly (in time O(log n)).
i.e., a sorted C++ std::vector can quickly answer: how many elements lie between given x,y?
std::set can quickly find iterators to the start and end of the range, but gives no clue how many elements are within.
So, is there a data structure that allows all the speed of a C++ std::set (fast lookups and deletions), but also allows fast computation of the number of elements in a given range?
(By fast, I mean time O(log n), or maybe a polynomial in log n, or maybe even sqrt(n). Just as long as it's faster than O(n), since O(n) is almost the same as the trivial O(n log n) to search through everything).
(If not possible, even an estimate of the number to within a fixed factor would be useful. For integers a trivial upper bound is y-x+1, but how to get a lower bound? For arbitrary objects with an ordering there's no such estimate).
EDIT: I have just seen the
related question, which essentially asks whether one can compute the number of preceding elements. (Sorry, my fault for not seeing it before). This is clearly trivially equivalent to this question (to get the number in a range, just compute the start/end elements and subtract, etc.)
However, that question also allows the data to be computed once and then be fixed, unlike here, so that question (and the sorted vector answer) isn't actually a duplicate of this one.
The data structure you're looking for is an Order Statistic Tree
It's typically implemented as a binary search tree in which each node additionally stores the size of its subtree.
Unfortunately, I'm pretty sure the STL doesn't provide one.
All data structures have their pros and cons, the reason why the standard library offers a bunch of containers.
And the rule is that there is often a balance between quickness of modifications and quickness of data extraction. Here you would like to easily access the number of elements in a range. A possibility in a tree based structure would be to cache in each node the number of elements of its subtree. That would add an average log(N) operations (the height of the tree) on each insertion or deletion, but would highly speedup the computation of the number of elements in a range. Unfortunately, few classes from the C++ standard library are tailored for derivation (and AFAIK std::set is not) so you will have to implement your tree from scratch.
Maybe you are looking for LinkedHashSet alternate for C++ https://docs.oracle.com/javase/7/docs/api/java/util/LinkedHashSet.html.

Binary Search on a Doubly Linked List

It is possible to perform binary search on a doubly-linked list in Θ(log 𝑛) time?
My answer is yes because if the list is already somewhat ordered it could be faster than just O(n).
In order to do a binary search on a doubly-linked list, you're going to have to first iterate to the halfway-point of the list, so that you can do your first recursion on the two halves of the list.
Iterating to the halfway-point of a linked list is already an O(n) operation, since the time it takes to iterate to the halfway-point will grow linearly as the list itself gets longer.
So you're already at O(n) time, even before you've done any actual searching. Hence, the answer is no.
As you asked the question, the answer is no. You cannot have O(lg(n)) time for a linked list since traversal is linear, it cannot be better than O(n) in general, but binary search would be worse than a linear scan in that case since it must iterate multiple times to "jump" around. It would be better to do a single linear scan to find the element.
However, the C++ standard specifies that std::lower_bound algorithm (which does a binary search) has the following complexity:
[lower.bound]
Complexity: At most log2(last - first) + O(1) comparisons and projections.
That is, it is counting the element comparisons, not time, if you are measuring time by number of iterator advancements. That is, it finds the proper place by calling std::advance() on an iterator many times, but each of those calls on a list will be O(N) iterator advancements but on random access containers it's a constant, and for each call to advance there would be a corresponding call to the comparator.
That's why it is always so important to be clear what big-oh notation is measuring. Often the comparisons are a proxy for time, but not always!

Is find() function efficient for sets?

As far as I am concerned, binary search stands for the most efficient way to determine whethere there exists a certain element x in a sorted array. Thus, I was wondering if it is a good idea to make use of the find() or count() functions in order to perform this process of seeking for an element or it is more reasonable to use a sorted array rather than a set and apply the binary search method.
Yes it is efficient.
A set contains unique and sorted elements. Therefore find() uses binary search and has a O(logN) complexity in a set of N elements. Insertion is logarithmic too, in order to keep it sorted and unique.
set::find() is fairly efficient, O(log n).
If you don't need to access the elements in order, you should consider using an unordered_set. unordered_set::find() is O(1) on average.

C++ - List with logarithmic read, insertion at given position

I'm looking for data structure that behaves like a list, where we can insert an element at ANY given position and then read an element at ANY given position, where insertion and reading should be in logarithmic time. Is there something like this in the standard library or maybe I'm stuck with having to write this on my own (I know it can be implemented as a tree)?
std::multiset behaves pretty much like the logarithmic std::list that you are looking for
iteration is bidirectional
insertion / reading are O(log N)
Note however (as pointed out by #SergeRogatch) that the "price" you pay for O(log N) lookup (instead of O(N) for list) multiset will order elements as they are inserted. This behaves differently than std::list. This also means that your elements need to be comparable using std::less<> or you need to provide your own comparator.
An alternative would be to use std::unordered_multiset (i.e. a hash table), which has amortized O(1) element acces, but then there is no deterministic order either. But again, then your elements need to be usable with std::hash<> or you need to write your own hash function.

Optimal way to search a std::set

How should one search a std::set, when speed is the critical criterion for his/her project?
set:find?
Complexity:
Logarithmic in size.
std::binary_search?
Complexity:
On average, logarithmic in the distance between first and last: Performs approximately log2(N)+2 element comparisons (where N is this distance).
On non-random-access iterators, the iterator advances produce themselves an additional linear complexity in N on average.
Just a binary search implemented by him/her (like this one)? Or the STL's one is good enough?
Is there a way to answer this theoretically? Or we have to test ourselves? If someone has, it would be nice if (s)he would share this information with us (if no, we are not lazy :) ).
The iterator type provided by std::set is a bidirectional_iterator, a category which does not require random access to elements, but only element-wise movements in both directions. All random_access_iterator's are bidirectional_iterators, but not vice versa.
Using std::binary_search on a std::set can therefore yield O(n) runtime as per the remarks you quoted, while std::set::find has guaranteed O(logn).
So to search a set, use set::find.
It's unlikely that std::set has a random access iterator. Even if it did, std::binary_search would access at least as many nodes as .find, since .find accesses only the ancestors of the target node.