When is a hash table better to use than a search tree? - c++

When is a hash table better to use than a search tree?

Depends on what you want to do with the data structure.
Operation Hash table Search Tree
Search O(1) O(log(N))
Insert O(1) O(log(N))
Delete O(1) O(log(N))
Traversal O(N) O(N)
Min/Max-Key -hard- O(log(N))
Find-Next-Key -hard- O(1)
Insert, Search on Hashtable depend on the load factor of the hash
table and its design. Poorly designed hastables can have O(N) search and insert. The same is true for your Search Tree.
Deleting in a hash table can be cumbersome depending on your collision
resolution stategy.
Traversing the container, Finding Min/Max, Find Next/Prev sort of
operations are better on a search tree because of its ordering.
All estimates of search tree above are for 'balanced' search trees.

When the average access and insertion time are more important than the best access and insertion time. Practically I think search trees are usually as good a solution as hash tables, because even though in theory big theta of one is better than big theta of log n, log n is very fast, and as you start dealing with large values of n the effect on the practical difference shrinks. Also, big theta of one says nothing about the value of the constant. Granted, this holds for the complexity of trees as well, but the constant factors of trees are much more fixed, usually at a very low number, among implementations than those of hash tables.
Again, I know theorists will disagree with me here, but it's computers we're dealing with here, and for log n to be of any significance burden for a computer n must be unrealistically large. If n is a trillion then log of n is 40, and a computer today can perform 40 iterations rather quickly. For log of n to grow to 50 you already have over a quadrillion elements.
The C++ standard as it stands today doesn't provide a hash-table among its containers and I think there's a reason people were fine with it as it is for over a decade.

My take on things:
Operation Hash table(1) SBB Search Tree(2)
.find(obj) -> obj O(1) O(1)*
.insert(obj) O(1) O(log(N))
.delete(obj) O(1) O(log(N))
.traverse / for x in... O(N) O(N)
.largerThan(obj) -> {objs} unsupported O(log(N))
\
union right O(1) + parent O(1)
.sorted() -> [obj] unsupported no need
\
already sorted so no need
to print out, .traverse() is O(N)
.findMin() -> obj unsupported** O(log(N)), maybe O(1)
\
descend from root, e.g.:
root.left.left.left...left -> O(log(N))
might be able to cache for O(1)
.findNext(obj) -> obj unsupported O(log(N))
\
first perform x=.find(obj) which is O(1)
then descend from that node, e.g.:
x.right.left.left...right -> O(log(N))
(1) http://en.wikipedia.org/wiki/Hash_table
(2) http://en.wikipedia.org/wiki/Self-balancing_binary_search_tree , e.g. http://en.wikipedia.org/wiki/Tango_tree or http://en.wikipedia.org/wiki/Splay_tree
(*) You can use a hash table in conjunction with a search tree to obtain this. There is no asymptotic speed or space penalty. Otherwise, it's O(log(N)).
(**) Unless you never delete, in which case just cache smallest and largest elements and it's O(1).
These costs may be amortized.
Conclusion:
You want to use trees when the ordering matters.

Among many issues, it depends on how expensive the hash function is. In my experience, hashes are generally about twice as fast as balanced trees for a sensible hash function, but it's certainly possible for them to be slower.

Related

How is binary search faster than linear search?

We need a sorted array to perform a binary search. In that case, the time complexity is already greater than the linear search, so isn't linear search a better option in that case?
A linear search runs in O(N) time, because it scans through the array from start to end.
On the other hand, a binary search first sorts the array in O(NlogN) time (if it is not already sorted), then performs lookups in O(logN) time.
For a small number of lookups, using a linear search would be faster than using binary search. However, whenever the number of lookups is greater than logN, binary search will theoretically have the upper hand in performance.
So, the answer to your question is: Linear search and binary search perform lookups in different ways. Linear search scans through the whole array, while binary search sorts the array first. These two search techniques have differing time complexities, but that does not mean that one will always be better than the other.
Specifically, linear search works well when the size of the list is small and/or you only need to perform a small number of lookups. Binary search should perform better in all other situations.
It'll be better if your container is sorted already or if you want to search for many values.
First of all for Binary Search the precondition is that the array is sorted, which means you do not need to resort it. Secondly if you are talking about integer arrays, you can use RadixSort O(d*n) or CountingSort O(n+l) which are similar to linear search in terms of complexity....
Binary search is faster than linear when the given array is already sorted.
For a sorted array, binary search offers an average O(log n) meanwhile linear offers O(n).
For any given array that is not sorted, linear search becomes best since O(n) is better than sorting the array ( using quicksort for example O(n log n) ) and then applying binary search after that, thus given O(n log n + log n) complexity.

Order-maintenance data structure in C++

I'm looking for a data structure which would efficiently solve Order-maintenance problem. In the other words, I need to efficiently
insert (in the middle),
delete (in the middle),
compare positions of elements in the container.
I found good articles which discuss this problem:
Two Algorithms for Maintaining Order in a List,
Two Simplified Algorithms for Maintaining Order in a List.
The algorithms are quite efficient (some state to be O(1) for all operations), but they do not seem to be trivial, and I'm wondering if there is an open source C++ implementation of these or similar data structures.
I've seen related topic, some simpler approaches with time complexity O(log n) for all operations were suggested, but here I'm looking for existing implementation.
If there was an example in some other popular languages it would be good too, this way I would be able at least to experiment with it before trying to implement it myself.
Details
I'm going to
maintain a list of pointers to objects,
from time to time I will need to change object's order (delete+insert),
given a subset of objects I need to be able to quickly sort them and process them in correct order.
Note
The standard ordering containers (std::set, std::map) is not what I'm looking for because they will maintain order for me, but I need to order elements myself. Similar to what I would do with std::list, but there position comparison would be linear, which is not acceptable.
If you are looking for easy-to-implement and efficient solution at the same time you could build this structure using a balanced binary search tree (AVL or Red-Black tree). You could implement the operations as follows:
insert(X, Y) (inserts X immediately after Y in the total order) - if X doesn't have a right child set the right child of X to be Y, else let Z be the leftmost node of tree with root X.right (that means the lowest Z = X.right.left.left.left... which is not NULL) and set it's left child of Z to be Y. Balance if you have to. You can see that the total complexity would be O(log n).
delete(X) - just delete the node X as you'd normally will from the tree. Complexity O(log n).
compare(X,Y) - find the path from X to the root and the path from Y to the root. You can find Z , the lowest common ancestor of X and Y, from those two paths. Now, you can compare X and Y depending on whether they are in the left or in the right subtree of Z (they can't be in the same subtree at the same time since then Z won't be their lowest common ancestor). Complexity O(log n).
So you can see that the advantage of this implementation is that all operations would have complexity O(log n) and it's easy to implement.
You can use skip list similar to how you use std::list
Skip lists were first described in 1989 by William Pugh.
To quote the author:
Skip lists are a probabilistic data structure that seem likely to supplant balanced trees as the implementation method of choice for many applications. Skip list algorithms have the same asymptotic expected time bounds as balanced trees and are simpler, faster and use less space.
http://drum.lib.umd.edu/handle/1903/542
STL is the solution to your problem.
It's the standard, proven and efficient containers and the algorithms that support them. almost all of the containers in STL support the actions you have mentioned.
It's seems like std::deque has the best qualities to the tasks you are referring to:
1) Insertion : both from to the back and to the front in O(1) complexity
2) Deletion : unlike contiguous containers, std::deque::erase is O(N) where N is the number of items deleted. meaning that erasing only one item has the complexity of O(1)
3) Position comparison : using std::advance, the complexity on std::deque is O(N)
4) Sorting : using std::sort, usually will use quick sort for the task, and will run in O(n* log n). In MSVC++ at least, the function tries to guess what is the best sorting algorithm for the given container.
do not try to use open source solution/building your own library before you have tried using STL thoroughly!

Data structure with constant time operations

I need to use a data structure, implementable in C++, that can do basic operations, such as lookup, insertion and deletion, in constant time. I, however, also need to be able to find the maximum value in constant time.
This data structure should probably be sorted to find the maximum values and I have looked into red-black trees, however they have logarithmic-time operations.
I would propose
You could use a hash table which gives O(1) expected time
Regarding the maximum, you could store it in attribute and be aware at each insertion if the maximum changes. With the deletion is some more complicated because if the maximum is deleted, then you must perform a linear search, but this only would happen if the maximum is deleted. Any other element could be deleted in O(1) expected time
Yes I agree with Irleon.You can use a hash table to perform these operations.Let us analyze this step by step:
1.If we take arrays,the time complexity of insertion will be O(1) at the end.
2.Take linked lists and it will be O(n) due to the traversal that you need to do.
3.Take binary search trees and it will be O(logn) where logn is the height of the tree.
4.Now we can use hash tables.We know that it works on keys and values.So,here the key will be 'number_to_be_inserted % n' where 'n' is the number of elements we have.
But as the list grows on the same index,you will be needing to traverse the list.So it will O(numbers_at_that_index).
Same will be the case in deletion operation.
Ofcourse there are other cases to consider in case of collisions ,but we can ignore that for now and we will get our basic hash table.
If you could do such a thing, then you could sort in linear time: simply insert all of your items, then, do the following n times:
Find maximum
Print maximum
Delete maximum
Therefore, in a model of computation in which you can't sort in linear time, you also can't solve your problem with all operations in O(1) time.

Heaps and Binary Search Trees

What is the run-time associated with (Max-heapify) that is implemented using k-ary heap.
Is a k-ary heap more efficient than a binary heap asymptotically speaking?
Is a k-ary heap more efficient than a binary heap in practice?
can a search tree be implemented as k-arry?
You've asked a lot of questions, so I'll try to answer all of them in turn.
The runtime of the heapify operation on a k-ary heap is O(n), which is independent of k. This isn't immediately obvious, but most introductory algorithms textbooks have a proof of this result for the case where k = 2.
Let's do the analysis for a k-ary heap in general, which we can then compare against a binary heap by just setting k = 2. In a k-ary heap, the cost of a find-min operation is O(1) (just look at the top of the heap) and the cost of a heapify operation is O(n), as mentioned above. When adding a new element to a k-ary heap, the runtime is proportional to the height of the heap, which is O(logk n) = O(log n / log k) (that follows from using the change-of-base formula for logarithms). It's not common to include the base of a logarithm inside big-O notation, but in this case because k is a parameter we can't ignore its contribution. In an extract-min operation, we need to work from the top of the tree down to the bottom. At each level, we look at up to k children of the current node to find the largest, then potentially do a swap down. This means that there is O(k) work per layer and there are O(log n / log k) layers, so the work done is O(k log n / log k). Asymptotically, for any fixed k, the runtimes of these operations are O(1), O(n), O(log n), and O(log n), respectively, so there's no asymptotic difference between a k-ary heap and a binary heap.
In practice, though, there are differences. One good way to see this is to make k really, really big (say, 10100). In that case, the cost of a deletion will be quite large because there will be up to 10100 children per node, which will dwarf the height of the corresponding binary tree. For middling values of k (k = 3 or 4), there's a chance that it may actually be faster to use a 3-ary or 4-ary tree over a binary tree, but really the best way to find out would be to profile it and see what happens. The interactions of factors like locality of reference, caching, and division speed will all be competing with one another to affect the runtime.
Yes! There are such things as multiway search trees. One of the most famous of these is the B-tree, which is actually a pretty fun data structure to read up on.
Hope this helps!

An efficient sorting algorithm for almost sorted list containing time data?

The name says it all really. I suspect that insertion sort is best, since it's the best sort for mostly-sorted data in general. However, since I know more about the data there is a chance there are other sorts woth looking at. So the other relevant pieces of information are:
1) this is time data, which means I presumable could create an effective hash for ordering of data.
2) The data won't all exist at one time. instead I'll be reading in records which may contain a single vector, or dozen or hundreds of vectors. I want to output all time within a 5 second window. So it's possible that a sort that does the sorting as I insert the data would be a better option.
3) memory is not a big issue, but CPU speed is as this may be a bottleneck of the system.
Given these conditions can anyone suggest an algorithm that may be worth considering in addition to insertion sort? Also, How does one defined 'mostly sorted' to decide what is a good sort option? What I mean by that is how do I look at my data and decided 'this isn't as sorted as I thought it as, maybe insertion sort is no longer the best option'? Any link to an article which considered process complexity which better defines the complexity relative to the degree data is sorted would be appreciated.
Thanks
Edit:
thank you everyone for your information. I will be going with an easy insertion or merge sort (whichever I have already pre-written) for now. However, I'll be trying some of the other methods once were closer to the optimization phase (since they take more effort to implement). I appreciate the help
You could adopt option (2) you suggested - sort the data while you insert elements.
Use a skip list, sorted according to time, ascending to maintain your data.
Once a new entree arrives - check if it is larger then the last
element (easy and quick) if it is - simply append it (easy to do in a skip list). The
skip list will need to add 2 nodes on average for these cases, and will be O(1) on
average for these cases.
If the element is not larger then the last element - add it to the
skip list as a standard insert op, which will be O(logn).
This approach will yield you O(n+klogn) algorithm, where k is the number of elements inserted out of order.
I would throw in merge sort if you implement the natural version you get a best case of O(N) with a typical and worst case of O(N log N) if you have any problems. Insertion you get a worst case of O(N^2) and a best case of O(N).
You can sort a list of size n with k elements out of place in O(n + k lg k) time.
See: http://www.quora.com/How-can-I-quickly-sort-an-array-of-elements-that-is-already-sorted-except-for-a-small-number-of-elements-say-up-to-1-4-of-the-total-whose-positions-are-known/answer/Mark-Gordon-6?share=1
The basic idea is this:
Iterate over the elements of the array, building an increasing subsequence (if the current element is greater than or equal to the last element of the subsequence, append it to the end of the subsequence. Otherwise, discard both the current element and the last element of the subsequence). This takes O(n) time.
You will have discarded no more than 2k elements since k elements are out of place.
Sort the 2k elements that were discarded using an O(k lg k) sorting algorithm like merge sort or heapsort.
You now have two sorted lists. Merge the lists in O(n) time like you would in the merge step of merge sort.
Overall time complexity = O(n + k lg k)
Overall space complexity = O(n)
(this can be modified to run in O(1) space if you can merge in O(1) space, but it's by no means trivial)
Without fully understanding the problem, Timsort may fit the bill as you're alleging that your data is mostly sorted already.
There are many adaptive sorting algorithms out there that are specifically designed to sort mostly-sorted data. Ignoring the fact that you're storing dates, you might want to look at smoothsort or Cartesian tree sort as algorithms that can sort data that is reasonable sorted in worst-case O(n log n) time and best-case O(n) time. Smoothsort also has the advantage of requiring only O(1) space, like insertion sort.
Using the fact that everything is a date and therefore can be converted into an integer, you might want to look at binary quicksort (MSD radix sort) using a median-of-three pivot selection. This algorithm has best-case O(n log n) performance, but has a very low constant factor that makes it pretty competitive. Its worst case is O(n log U), where U is the number of bits in each date (probably 64), which isn't too bad.
Hope this helps!
If your OS or C library provides a mergesort function, it is very likely that it already handles the case where the data given is partially ordered (in any direction) running in O(N) time.
Otherwise, you can just copy the mergesort available from your favorite BSD operating system.