Binary Search on a Doubly Linked List - c++

It is possible to perform binary search on a doubly-linked list in Θ(log 𝑛) time?
My answer is yes because if the list is already somewhat ordered it could be faster than just O(n).

In order to do a binary search on a doubly-linked list, you're going to have to first iterate to the halfway-point of the list, so that you can do your first recursion on the two halves of the list.
Iterating to the halfway-point of a linked list is already an O(n) operation, since the time it takes to iterate to the halfway-point will grow linearly as the list itself gets longer.
So you're already at O(n) time, even before you've done any actual searching. Hence, the answer is no.

As you asked the question, the answer is no. You cannot have O(lg(n)) time for a linked list since traversal is linear, it cannot be better than O(n) in general, but binary search would be worse than a linear scan in that case since it must iterate multiple times to "jump" around. It would be better to do a single linear scan to find the element.
However, the C++ standard specifies that std::lower_bound algorithm (which does a binary search) has the following complexity:
[lower.bound]
Complexity: At most log2(last - first) + O(1) comparisons and projections.
That is, it is counting the element comparisons, not time, if you are measuring time by number of iterator advancements. That is, it finds the proper place by calling std::advance() on an iterator many times, but each of those calls on a list will be O(N) iterator advancements but on random access containers it's a constant, and for each call to advance there would be a corresponding call to the comparator.
That's why it is always so important to be clear what big-oh notation is measuring. Often the comparisons are a proxy for time, but not always!

Related

Is there a data structure like a C++ std set which also quickly returns the number of elements in a range?

In a C++ std::set (often implemented using red-black binary search trees), the elements are automatically sorted, and key lookups and deletions in arbitrary positions take time O(log n) [amortised, i.e. ignoring reallocations when the size gets too big for the current capacity].
In a sorted C++ std::vector, lookups are also fast (actually probably a bit faster than std::set), but insertions are slow (since maintaining sortedness takes time O(n)).
However, sorted C++ std::vectors have another property: they can find the number of elements in a range quickly (in time O(log n)).
i.e., a sorted C++ std::vector can quickly answer: how many elements lie between given x,y?
std::set can quickly find iterators to the start and end of the range, but gives no clue how many elements are within.
So, is there a data structure that allows all the speed of a C++ std::set (fast lookups and deletions), but also allows fast computation of the number of elements in a given range?
(By fast, I mean time O(log n), or maybe a polynomial in log n, or maybe even sqrt(n). Just as long as it's faster than O(n), since O(n) is almost the same as the trivial O(n log n) to search through everything).
(If not possible, even an estimate of the number to within a fixed factor would be useful. For integers a trivial upper bound is y-x+1, but how to get a lower bound? For arbitrary objects with an ordering there's no such estimate).
EDIT: I have just seen the
related question, which essentially asks whether one can compute the number of preceding elements. (Sorry, my fault for not seeing it before). This is clearly trivially equivalent to this question (to get the number in a range, just compute the start/end elements and subtract, etc.)
However, that question also allows the data to be computed once and then be fixed, unlike here, so that question (and the sorted vector answer) isn't actually a duplicate of this one.
The data structure you're looking for is an Order Statistic Tree
It's typically implemented as a binary search tree in which each node additionally stores the size of its subtree.
Unfortunately, I'm pretty sure the STL doesn't provide one.
All data structures have their pros and cons, the reason why the standard library offers a bunch of containers.
And the rule is that there is often a balance between quickness of modifications and quickness of data extraction. Here you would like to easily access the number of elements in a range. A possibility in a tree based structure would be to cache in each node the number of elements of its subtree. That would add an average log(N) operations (the height of the tree) on each insertion or deletion, but would highly speedup the computation of the number of elements in a range. Unfortunately, few classes from the C++ standard library are tailored for derivation (and AFAIK std::set is not) so you will have to implement your tree from scratch.
Maybe you are looking for LinkedHashSet alternate for C++ https://docs.oracle.com/javase/7/docs/api/java/util/LinkedHashSet.html.

Is find() function efficient for sets?

As far as I am concerned, binary search stands for the most efficient way to determine whethere there exists a certain element x in a sorted array. Thus, I was wondering if it is a good idea to make use of the find() or count() functions in order to perform this process of seeking for an element or it is more reasonable to use a sorted array rather than a set and apply the binary search method.
Yes it is efficient.
A set contains unique and sorted elements. Therefore find() uses binary search and has a O(logN) complexity in a set of N elements. Insertion is logarithmic too, in order to keep it sorted and unique.
set::find() is fairly efficient, O(log n).
If you don't need to access the elements in order, you should consider using an unordered_set. unordered_set::find() is O(1) on average.

Looking for clarification on Hashing and BST functions and Big O notation

So I am trying to understand the data types and Big O notation of some functions for a BST and Hashing.
So first off, how are BSTs and Hashing stored? Are BSTs usually arrays, or are they linked lists because they have to point to their left and right leaves?
What about Hashing? I've had the most trouble finding clear information regarding Hashing in terms of computation-based searching. I understand that Hashing is best implemented with an array of chains. Is this for faster searching or to decrease overhead on creating the allocated data type?
This following question might be just bad interpretation on my part, but what makes a traversal function different from a search function in BSTs, Hashing, and STL containers?
Is traversal Big O(N) for BSTS because you're actually visiting each node/data member, whereas search() can reduce its time by eliminating half the searching field?
And somewhat related, why is it that in the STL, list.insert() and list.erase() have a Big O(1) whereas the vector and deque counterparts are O(N)?
Lastly, why would a vector.push_back() be O(N)? I thought the function could be done something along the lines of this like O(1), but I've come across text saying it is O(N):
vector<int> vic(2,3);
vector<int>::const iterator IT = vic.end();
//wanna insert 4 to the end using push_back
IT++;
(*IT) = 4;
hopefully this works. I'm a bit tired but I would love any explanations why something similar to that wouldn't be efficient or plausible. Thanks
BST's (Ordered Binary Trees) are a series of nodes where a parent node points to its two children, which in turn point to their max-two children, etc. They're traversed in O(n) time because traversal visits every node. Lookups take O(log n) time. Inserts take O(1) time because internally they don't need to a bunch of existing nodes; just allocate some memory and re-aim the pointers. :)
Hashes (unordered_map) use a hashing algorithm to assign elements to buckets. Usually buckets contain a linked list so that hash collisions just result in several elements in the same bucket. Traversal will again be O(n), as expected. Lookups and inserts will be amortized O(1). Amortized means that on average, O(1), though an individual insert might result in a rehashing (redistribution of buckets to minimize collisions). But over time the average complexity is O(1). Note, however, that big-O notation doesn't really deal with the "constant" aspect; only order of growth. The constant overhead in the hashing algorithms can be high enough that for some data-sets the O(log n) binary trees outperform the hashes. Nevertheless, the hash's advantage is that its operations are constant time-complexity.
Search functions take advantage (in the case of binary trees) of the notion of "order"; a search through a BST has the same characteristics as a basic binary search over an ordered array. O(log n) growth. Hashes don't really "search". They compute the bucket, and then quickly run through the collisions to find the target. That's why lookups are constant time.
As for insert and erase; in array-based sequence containers, all elements that come after the target have to be bumped over to the right. Move semantics in C++11 can improve upon the performance, but the operation is still O(n). For linked sequence containers (list, forward_list, trees), insertion and erasing just means fiddling with some pointers internally. It's a constant-time process.
push_back() will be O(1) until you exceed the existing allocated capacity of the vector. Once the capacity is exceeded, a new allocation takes place to produce a container that is large enough to accept more elements. All the elements need to then be moved into the larger memory region, which is an O(n) process. I believe Move Semantics can help here as well, but it's still going to be O(n). Vectors and strings are implemented such that as they allocate space for a growing data set, they allocate more than they need, in anticipation of additional growth. This is an efficiency safeguard; it means that the typical push_back() won't trigger a new allocation and move of the entire data set into a larger container. But eventually after enough push_backs, the limit will be reached, and the vector's elements will be copied into a larger container, which again has some extra headroom left over for more efficient push_backs.
Traversal refers to visiting every node, whereas search is only to find a particular node, so your intuition is spot on there. O(N) complexity because you need to visit N nodes.
std::vector::insert is for insert in the middle, and it involves copying all subsequent elements over by one slot, inorder to make room for the element being inserted, hence O(N). Linked list doesnt have this issue, hence O(1). Similar logic for erase. deque properties are similar to vector
std::vector::push_back is a O(1) operation, for the most part, only deviates if capacity is exceeded and reallocations + copy are needed.

trivial singly linked list complexity query

We know that lookup on a singly linked list is O(n) given a head pointer. Let us say I maintain a pointer at half the linked list at all times. Would I be improving any lookup times?
Yes, it can reduce the complexity by a constant factor of 2, provided you have some way of determining whether to start from the beginning or middle of the list (typically, but not necessarily, the list being sorted). This is, however, a constant factor, so in terms of big-O complexity, it's irrelevant.
To be relevant to big-O complexity, you need more than a constant factor change. If, for example, you had a pointer to bisect each half, and again each half of that, and so on, you'd end up with logarithmic complexity instead of linear -- and you'd have transformed your "linked list" into an (already well known) threaded tree.
Nice thought, but this still does not improve the search operation. No matter how many pointers you have at different portions of the list, you still have to analyze each element in the list. However, you -could- two threads to search each half of the list making the operation twice as fast in theory.
Only if your linked list's data is sorted. Otherwise, as already said in the other reply.
It would, but asymptotically it would be still the same. However, there is a data structure that uses this idea, it is called skip list. Skip list is a linked list where some nodes have more pointers that are pointing in some sense to the middle of the rest of list. The idea is well illustrated on this image. This structure usually has logarithmic insert find and delete.

Set find member vs. using find on list

Since the items in a Standard Library set container are sorted, will using the find member on the set, in general, perform faster than using the find algorithm on the same items in a sorted list?
Since the list is linear and the set is often implemented using a sorted tree, it seems as though the set-find should be faster.
With a linked list, even a sorted one, finding an element is O(n). A set can be searched in O(log n). Therefore yes, finding an element in a set is asymptotically faster.
A sorted array/vector can be searched in O(log n) by using binary search. Unfortunately, since a linked list doesn't support random access, the same method can't be used to search a sorted linked list in O(log n).
It's actually in the standard: std::set::find() has complexity O(log n), where n is the number of elements in the set. std::find() on the other hand is linear in the length of the search range.
If your generic search range happens to be sorted and has random access (e.g. a sorted vector), then you can use std::lower_bound() to find an element (or rather a position) efficiently.
Note that std::set comes with its own member-lower_bound(), which works the same way. Having an insertion position may be useful even in a set, because insert() with a correct hint has complexity O(1).
You can generally expect a find operation to be faster on a Set than on a List, since lists are linear access (O(n)), while sets may have near-constant access for HashSets (O(1)), or logarithmic access for TreeSets (O(log n)).
set::find has a complexity of O(log(n)), while std::find has a complexity of O(n). This means that std::set::find() is asymptotically faster than std::find(std::list), but that doesn't mean it is faster for any particular data set, or for any particular search.
I found this article helpful on the topic. http://lafstern.org/matt/col1.pdf
You could reconsider your requirements for just a "list" vs. a "set". According to that article, if your program consists primarily of a bunch of insertions at the start, and then after that, only comparisons to what you have stored, then you are better off with adding everything to a vector, using std::sort (vector.begin(), vector.end()) once, and then using lower_bound. In my particular application, I load from a text file a list of names when the program starts up, and then during program execution I determine if a user is in that list. If they are, I do something, otherwise, I do nothing. In other words, I had a single discrete insertion phase, then I sorted, then after that I used std::binary_search (vector.begin(), vector.end(), std::string username) to determine whether the user is in the list.