I need to use a data structure, implementable in C++, that can do basic operations, such as lookup, insertion and deletion, in constant time. I, however, also need to be able to find the maximum value in constant time.
This data structure should probably be sorted to find the maximum values and I have looked into red-black trees, however they have logarithmic-time operations.
I would propose
You could use a hash table which gives O(1) expected time
Regarding the maximum, you could store it in attribute and be aware at each insertion if the maximum changes. With the deletion is some more complicated because if the maximum is deleted, then you must perform a linear search, but this only would happen if the maximum is deleted. Any other element could be deleted in O(1) expected time
Yes I agree with Irleon.You can use a hash table to perform these operations.Let us analyze this step by step:
1.If we take arrays,the time complexity of insertion will be O(1) at the end.
2.Take linked lists and it will be O(n) due to the traversal that you need to do.
3.Take binary search trees and it will be O(logn) where logn is the height of the tree.
4.Now we can use hash tables.We know that it works on keys and values.So,here the key will be 'number_to_be_inserted % n' where 'n' is the number of elements we have.
But as the list grows on the same index,you will be needing to traverse the list.So it will O(numbers_at_that_index).
Same will be the case in deletion operation.
Ofcourse there are other cases to consider in case of collisions ,but we can ignore that for now and we will get our basic hash table.
If you could do such a thing, then you could sort in linear time: simply insert all of your items, then, do the following n times:
Find maximum
Print maximum
Delete maximum
Therefore, in a model of computation in which you can't sort in linear time, you also can't solve your problem with all operations in O(1) time.
Related
Given an input stream of numbers ranging from 1 to 10^5 (non-repeating) we need to be able to tell at each point how many numbers smaller than this have been previously encountered.
I tried to use the set in C++ to maintain the elements already encountered and then taking upper_bound on the set for the current number. But upper_bound gives me the iterator of the element and then again I have to iterate through the set or use std::distance which is again linear in time.
Can I maintain some other data structure or follow some other algorithm in order to achieve this task more efficiently?
EDIT : Found an older question related to fenwick trees that is helpful here. Btw I have solved this problem now using segment trees taking hints from #doynax comment.
How to use Binary Indexed tree to count the number of elements that is smaller than the value at index?
Regardless of the container you are using, it is very good idea to enter them as sorted set so at any point we can just get the element index or iterator to know how many elements are before it.
You need to implement your own binary search tree algorithm. Each node should store two counters with total number of its child nodes.
Insertion to binary tree takes O(log n). During the insertion counters of all parents of that new element should be incremented O(log n).
Number of elements that are smaller than the new element can be derived from stored counters O(log n).
So, total running time O(n log n).
Keep your table sorted at each step. Use binary search. At each point, when you are searching for the number that was just given to you by the input stream, binary search is going to find either the next greatest number, or the next smallest one. Using the comparison, you can find the current input's index, and its index will be the numbers that are less than the current one. This algorithm takes O(n^2) time.
What if you used insertion sort to store each number into a linked list? Then you can count the number of elements less than the new one when finding where to put it in the list.
It depends on whether you want to use std or not. In certain situations, some parts of std are inefficient. (For example, std::vector can be considered inefficient in some cases due to the amount of dynamic allocation that occurs.) It's a case-by-case type of thing.
One possible solution here might be to use a skip list (relative of linked lists), as it is easier and more efficient to insert an element into a skip list than into an array.
You have to use the skip list approach, so you can use a binary search to insert each new element. (One cannot use binary search on a normal linked list.) If you're tracking the length with an accumulator, returning the number of larger elements would be as simple as length-index.
One more possible bonus to using this approach is that std::set.insert() is log(n) efficient already without a hint, so efficiency is already in question.
I have to maintain a list of unordered integers , where number of integers are unknown. It may increase or decrease over the time. I need to update this list of integers frequently. I have tried using vector . But it is really slow . Array appears to be faster , but since the length of list is not fixed, it takes significant amount of time to resize it . Please suggest any other option .
Use a hash table, if order of the values in unimportant. Time is O(1). I'm pretty sure you'll find an implementation in the standard template libraries.
Failing that, a splay tree is extremely fast, especially if you want to keep the list ordered: amortized cost of O(ln n) per operation, with a very low constant factor. I think C++ stdlib map is something like this.
Know thy data structures.
If you are interested in Dynamic increments of Arrays size you can do this .
current =0;
x = (int**)malloc(temp * sizeof(int*));
x[current]=(int*)malloc(RequiredLength * sizeof(int));
So add elements to array and when elements are filled in x[current]
You can add more space for elements by doing
x[++current]=(int*)malloc(RequiredLength * sizeof(int));
Doing this you can accommodate for RequiredLength more elements .
You can repeat this upto 1024 times which means 1024*RequiredLength elements can be
accommodated , here it gives you chance to increase size of array whenever you want it .
You can always access the n th element by X[ n / 1024 ][ n % 1024] ;
Considering your comments, it looks like it is std::set or std::unordered_set fits your needs better than std::vector.
If sequential data structures fails to meet requirements, you could try looking at trees (binary, AVL, m-way, red-black ect ...). I would suggest you try to implement AVL tree since it yields a balanced or near balanced binary search tree which would optimize your operation. For more on AVL tree: http://en.wikipedia.org/wiki/AVL_tree
well,deque has no resize cost,but if it's unordered,it's search time is linear ,and its delete and insert operation time in the middle of its self is even worth than vector.
if you don't need search by the value of the number,hashmap or map may be your choice .No resize cost.,then you set the key of the map to number's index,and the value to the number's value.the search and insert operation is better than linear.
std::list is definitely created for such problems, adding and deleting elements in list do not necessitate memory re-allocations like in vector. However, due to the noncontagious memory allocation of the list, searching elements may prove to be a painful experience ofcourse but if you do not search its entries frequently, it can be used.
The name says it all really. I suspect that insertion sort is best, since it's the best sort for mostly-sorted data in general. However, since I know more about the data there is a chance there are other sorts woth looking at. So the other relevant pieces of information are:
1) this is time data, which means I presumable could create an effective hash for ordering of data.
2) The data won't all exist at one time. instead I'll be reading in records which may contain a single vector, or dozen or hundreds of vectors. I want to output all time within a 5 second window. So it's possible that a sort that does the sorting as I insert the data would be a better option.
3) memory is not a big issue, but CPU speed is as this may be a bottleneck of the system.
Given these conditions can anyone suggest an algorithm that may be worth considering in addition to insertion sort? Also, How does one defined 'mostly sorted' to decide what is a good sort option? What I mean by that is how do I look at my data and decided 'this isn't as sorted as I thought it as, maybe insertion sort is no longer the best option'? Any link to an article which considered process complexity which better defines the complexity relative to the degree data is sorted would be appreciated.
Thanks
Edit:
thank you everyone for your information. I will be going with an easy insertion or merge sort (whichever I have already pre-written) for now. However, I'll be trying some of the other methods once were closer to the optimization phase (since they take more effort to implement). I appreciate the help
You could adopt option (2) you suggested - sort the data while you insert elements.
Use a skip list, sorted according to time, ascending to maintain your data.
Once a new entree arrives - check if it is larger then the last
element (easy and quick) if it is - simply append it (easy to do in a skip list). The
skip list will need to add 2 nodes on average for these cases, and will be O(1) on
average for these cases.
If the element is not larger then the last element - add it to the
skip list as a standard insert op, which will be O(logn).
This approach will yield you O(n+klogn) algorithm, where k is the number of elements inserted out of order.
I would throw in merge sort if you implement the natural version you get a best case of O(N) with a typical and worst case of O(N log N) if you have any problems. Insertion you get a worst case of O(N^2) and a best case of O(N).
You can sort a list of size n with k elements out of place in O(n + k lg k) time.
See: http://www.quora.com/How-can-I-quickly-sort-an-array-of-elements-that-is-already-sorted-except-for-a-small-number-of-elements-say-up-to-1-4-of-the-total-whose-positions-are-known/answer/Mark-Gordon-6?share=1
The basic idea is this:
Iterate over the elements of the array, building an increasing subsequence (if the current element is greater than or equal to the last element of the subsequence, append it to the end of the subsequence. Otherwise, discard both the current element and the last element of the subsequence). This takes O(n) time.
You will have discarded no more than 2k elements since k elements are out of place.
Sort the 2k elements that were discarded using an O(k lg k) sorting algorithm like merge sort or heapsort.
You now have two sorted lists. Merge the lists in O(n) time like you would in the merge step of merge sort.
Overall time complexity = O(n + k lg k)
Overall space complexity = O(n)
(this can be modified to run in O(1) space if you can merge in O(1) space, but it's by no means trivial)
Without fully understanding the problem, Timsort may fit the bill as you're alleging that your data is mostly sorted already.
There are many adaptive sorting algorithms out there that are specifically designed to sort mostly-sorted data. Ignoring the fact that you're storing dates, you might want to look at smoothsort or Cartesian tree sort as algorithms that can sort data that is reasonable sorted in worst-case O(n log n) time and best-case O(n) time. Smoothsort also has the advantage of requiring only O(1) space, like insertion sort.
Using the fact that everything is a date and therefore can be converted into an integer, you might want to look at binary quicksort (MSD radix sort) using a median-of-three pivot selection. This algorithm has best-case O(n log n) performance, but has a very low constant factor that makes it pretty competitive. Its worst case is O(n log U), where U is the number of bits in each date (probably 64), which isn't too bad.
Hope this helps!
If your OS or C library provides a mergesort function, it is very likely that it already handles the case where the data given is partially ordered (in any direction) running in O(N) time.
Otherwise, you can just copy the mergesort available from your favorite BSD operating system.
I do not have formal CS training, so bear with me.
I need to do a simulation, which can abstracted away to the following (omitting the details):
We have a list of real numbers representing the times of events. In
each step, we
remove the first event, and
as a result of "processing" it, a few other events may get inserted into the list at a strictly later time
and repeat this many times.
Questions
What data structure / algorithm can I use to implement this as efficiently as possible? I need to increase the number of events/numbers in the list significantly. The priority is to make this as fast as possible for a long list.
Since I'm doing this in C++, what data structures are already available in the STL or boost that will make it simple to implement this?
More details:
The number of events in the list is variable, but it's guaranteed to be between n and 2*n where n is some simulation parameter. While the event times are increasing, the time-difference of the latest and earliest events is also guaranteed to be less than a constant T. Finally, I suspect that the density of events in time, while not constant, also has an upper and lower bound (i.e. all the events will never be strongly clustered around a single point in time)
Efforts so far:
As the title of the question says, I was thinking of using a sorted list of numbers. If I use a linked list for constant time insertion, then I have trouble finding the position where to insert new events in a fast (sublinear) way.
Right now I am using an approximation where I divide time into buckets, and keep track of how many event are there in each bucket. Then process the buckets one-by-one as time "passes", always adding a new bucket at the end when removing one from the front, thus keeping the number of buckets constant. This is fast, but only an approximation.
A min-heap might suit your needs. There's an explanation here and I think STL provides the priority_queue for you.
Insertion time is O(log N), removal is O(log N)
It sounds like you need/want a priority queue. If memory serves, the priority queue adapter in the standard library is written to retrieve the largest items instead of the smallest, so you'll have to specify that it use std::greater for comparison.
Other than that, it provides just about exactly what you've asked for: the ability to quickly access/remove the smallest/largest item, and the ability to insert new items quickly. While it doesn't maintain all the items in order, it does maintain enough order that it can still find/remove the one smallest (or largest) item quickly.
I would start with a basic priority queue, and see if that's fast enough.
If not, then you can look at writing something custom.
http://en.wikipedia.org/wiki/Priority_queue
A binary tree is always sorted and has faster access times than a linear list. Search, insert and delete times are O(log(n)).
But it depends whether the items have to be sorted all the time, or only after the process is finished. In the latter case a hash table is probably faster. At the end of the process you then would copy the items to an array or a list and sort it.
When is a hash table better to use than a search tree?
Depends on what you want to do with the data structure.
Operation Hash table Search Tree
Search O(1) O(log(N))
Insert O(1) O(log(N))
Delete O(1) O(log(N))
Traversal O(N) O(N)
Min/Max-Key -hard- O(log(N))
Find-Next-Key -hard- O(1)
Insert, Search on Hashtable depend on the load factor of the hash
table and its design. Poorly designed hastables can have O(N) search and insert. The same is true for your Search Tree.
Deleting in a hash table can be cumbersome depending on your collision
resolution stategy.
Traversing the container, Finding Min/Max, Find Next/Prev sort of
operations are better on a search tree because of its ordering.
All estimates of search tree above are for 'balanced' search trees.
When the average access and insertion time are more important than the best access and insertion time. Practically I think search trees are usually as good a solution as hash tables, because even though in theory big theta of one is better than big theta of log n, log n is very fast, and as you start dealing with large values of n the effect on the practical difference shrinks. Also, big theta of one says nothing about the value of the constant. Granted, this holds for the complexity of trees as well, but the constant factors of trees are much more fixed, usually at a very low number, among implementations than those of hash tables.
Again, I know theorists will disagree with me here, but it's computers we're dealing with here, and for log n to be of any significance burden for a computer n must be unrealistically large. If n is a trillion then log of n is 40, and a computer today can perform 40 iterations rather quickly. For log of n to grow to 50 you already have over a quadrillion elements.
The C++ standard as it stands today doesn't provide a hash-table among its containers and I think there's a reason people were fine with it as it is for over a decade.
My take on things:
Operation Hash table(1) SBB Search Tree(2)
.find(obj) -> obj O(1) O(1)*
.insert(obj) O(1) O(log(N))
.delete(obj) O(1) O(log(N))
.traverse / for x in... O(N) O(N)
.largerThan(obj) -> {objs} unsupported O(log(N))
\
union right O(1) + parent O(1)
.sorted() -> [obj] unsupported no need
\
already sorted so no need
to print out, .traverse() is O(N)
.findMin() -> obj unsupported** O(log(N)), maybe O(1)
\
descend from root, e.g.:
root.left.left.left...left -> O(log(N))
might be able to cache for O(1)
.findNext(obj) -> obj unsupported O(log(N))
\
first perform x=.find(obj) which is O(1)
then descend from that node, e.g.:
x.right.left.left...right -> O(log(N))
(1) http://en.wikipedia.org/wiki/Hash_table
(2) http://en.wikipedia.org/wiki/Self-balancing_binary_search_tree , e.g. http://en.wikipedia.org/wiki/Tango_tree or http://en.wikipedia.org/wiki/Splay_tree
(*) You can use a hash table in conjunction with a search tree to obtain this. There is no asymptotic speed or space penalty. Otherwise, it's O(log(N)).
(**) Unless you never delete, in which case just cache smallest and largest elements and it's O(1).
These costs may be amortized.
Conclusion:
You want to use trees when the ordering matters.
Among many issues, it depends on how expensive the hash function is. In my experience, hashes are generally about twice as fast as balanced trees for a sensible hash function, but it's certainly possible for them to be slower.