Why does inserting sequential elements in a tree require more time than inserting random elements into a tree? - c++

This is not homework I'm taking a data structures class and we recently finished trees. At the end of class, my professor showed this image.
ConcreteBTree is a binary tree that doesnt self balance. I have a few questions about the times it took to complete these procedures.
Why does it take so much more time to insert 100,000 sequential elements into ConcreteBTree than it takes to insert random elements into it? My intuition would be that since elements are sequential, it should take less time than it takes to insert 1,000,000 random elements.
Why are the times of insert() and find() of ConcreteBTree with random elements so close together? Is it because both have the same time complexity? I thought insert was O(1) and find was O(n)
I'd really like to understand what is going on here, any explanation would be greatly appreciated. Thanks

Inserting sequential items( 1,2,3,4...) to a binary tree will cause it to always add the nodes to the same side( left for example ) .
When you insert random items you will add nodes randomly left and right.
Adding sequentially will cause the list to behave as a ordinary linked list ( for the sequential items) because new items will have to visit every previously added item and that will take O(n) steps , when adding randomly it will take O( log N) steps on average.

Armin's answered Q1.
2.Why are the times of insert() and find() of ConcreteBTree with random elements so close together? Is it because both have the same time complexity? I thought insert was O(1) and find was O(n)
insert and find have to do the same work - they go down through whatever weird tree you've put together looking for that last node under which the value either is linked or would be (and will be in the case of insert), so they do the same number of comparisons and node traversals, taking similar time.
Insertion of random elements in a balanced tree is O(log2N). Your insertions of random values into an tree that doesn't self-rebalance will be a bit but not dramatically worse as some branches will end up considerably longer than others - you'll probably get some kind of bell curve of branch lengths. insert's only O(1) if you already know the node in the tree under which the insert is to be done (i.e. that find step above is normally needed). find's only O(n) if every node in the tree has to be visited, which is only the case for a pathologically unbalanced tree, effectively forming a linked list, as you've already been told you can generate by inserting pre-sorted elements.

Related

How to find specific string in array or linkedlist

I want to find string has a specific short string array or linked list. I make a small program that search conference or workshop like http://dblp.uni-trier.de/ using c++. What I wonder is how to fast search string in an array or linked list. When use string.find() function, I think this function's performance have O(n) time complexity if array's length is n. Can I improve performance lower than O(n)?? Help me, please
For an array, unless it is sorted, the best you can do is O(n) average/worst case because you have to look linearly until you find the desired string. If it is sorted (which would take O(nlog(n)) to do the sorting), you can make it O(log(n)) searching using a binary search. For linked lists, the best you can do, regardless of sorted-ness, is O(n).
If you really want to complicate your code, at each insert to list store pointer to node in some balanced tree.Where nodes will be inserted based on string from that node comparisions. Then you can get string in O(logn) time.
If you want to get fast retrievals use hash map it will give you O(1) Time.

Binary Tree Questions

Currently studying for an exam, and whilst reading through some notes, I had a few questions.
I know that the height of a Binary Search Tree is Log(n). Does this mean the depth is also Log(n)?
What is maximum depth of a node in a full binary tree with n nodes? This related to the first question; if the height of a Binary Tree is Log(n), would the maximum depth also be Log(n)?
I know that the time complexity of searching for a node in a binary search tree is O(Log(n)), which I understand. However, I read that the worst case time complexity is O(N). In what scenario would it take O(N) time to find an element?
THIS IS A PRIORITY QUEUE/ HEAP QUESTION. In my lecture notes, it says the following statement:
If we use an array for Priority Queues, en-queuing takes O(1) and de-queuing takes O(n). In a sorted Array, en-queue takes O(N) and de-queue takes O(1).
I'm having a hard time understanding this. Can anyone explain?
Sorry for all the questions, really need some clarity on a few of these topics.
Caveat: I'm a little rusty, but here goes ...
Height and depth of a binary tree are synonymous--more or less. height is the maximum depth along any path from root to leaf. But, when you traverse a tree, you have a concept of current depth. root node has depth 0, its children have depth 1, its grandchildren depth 2. If we stop here, the height of the tree is 3, but the maximum depth [we visited] is 2. Otherwise, they are often interchanged when talking about the tree overall.
Before we get to some more of your questions, it's important to note that binary trees come in various flavors. Balanced or unbalanced. With a perfectly balanced tree, all nodes except those at maximum height will have their left/right links non-null. For example, with n nodes in the tree, let n = 1024. Perfectly balanced the height is log2(n) which is 10 (e.g. 1024 == 2^10).
When you search a perfectly balanced tree, the search is O(log2(n)) because starting from the root node, you choose to follow either left or right, and each time you do, you eliminate 1/2 of the nodes. In such a tree with 1024 elements, the depth is 10 and you make 10 such left/right decisions.
Most tree algorithms, when you add a new node, will rebalance the tree on the fly (e.g. AVL or RB (red black)) trees. So, you get a perfectly balanced tree, all the time, more or less.
But ...
Let's consider a really bad algorithm. When you add a new node, it just appends it to the left link on the child with the greatest depth [or the new node becomes the new root]. The idea is fast append, and "we'll rebalance later".
If we search this "bad" tree, if we've added n nodes, the tree looks like a doubly linked list using the parent link and the left link [remember all right links are NULL]. This is linear time search or O(n).
We did this deliberately, but it can still happen with some tree algorithm and/or combinations of data. That is, the data is such that it gets naturally placed on the left link because that's where it's correct to place it based on the algorithm's placement function.
Priority queues are like regular queues except each piece of data has a priority number associated with it.
In an ordinary queue, you just push/append onto the end. When you dequeue, you shift/pop from the front. You never need to insert anything in the middle. Thus, enqueue and dequeue are both O(1) operations.
The O(n) comes from the fact that if you have to do an insertion into the middle of an array, you have to "part the waters" to make space for the element you want to insert. For example, if you need to insert after the first element [which is array[0]], you will be placing the new element at array[1], but first you have to move array[1] to array[2], array[2] to array[3], ... For an array of n, this is O(n) effort.
When removing an element from an array, it is similar, but in reverse. If you want to remove array[1], you grab it, then you must "close the gap" left by your removal by array[1] = array[2], array[2] = array[3], ... Once again, an O(n) operation.
In a sorted array, you just pop off the end. It's the one you want already. Hence O(1). To add an element, its an insertion into the correct place. If your array is 1,2,3,7,9,12,17 and you want to add 6, that's new value for array[4], and you have to move 7,9,12,17 out of the way as above.
Priority queue just appends to the array, hence O(1). But to find the correct element to dequeue, you scan the array array[0], array[1], ... remembering the first element position for a given priority, if you find a better priority, you remember that. When you hit the end, you know which element you need, say it's j. Now you have to remove j from array, and that an O(n) operation as above.
It's slightly more complex than all that, but not by two much.

Count of previously smaller elements encountered in an input stream of integers?

Given an input stream of numbers ranging from 1 to 10^5 (non-repeating) we need to be able to tell at each point how many numbers smaller than this have been previously encountered.
I tried to use the set in C++ to maintain the elements already encountered and then taking upper_bound on the set for the current number. But upper_bound gives me the iterator of the element and then again I have to iterate through the set or use std::distance which is again linear in time.
Can I maintain some other data structure or follow some other algorithm in order to achieve this task more efficiently?
EDIT : Found an older question related to fenwick trees that is helpful here. Btw I have solved this problem now using segment trees taking hints from #doynax comment.
How to use Binary Indexed tree to count the number of elements that is smaller than the value at index?
Regardless of the container you are using, it is very good idea to enter them as sorted set so at any point we can just get the element index or iterator to know how many elements are before it.
You need to implement your own binary search tree algorithm. Each node should store two counters with total number of its child nodes.
Insertion to binary tree takes O(log n). During the insertion counters of all parents of that new element should be incremented O(log n).
Number of elements that are smaller than the new element can be derived from stored counters O(log n).
So, total running time O(n log n).
Keep your table sorted at each step. Use binary search. At each point, when you are searching for the number that was just given to you by the input stream, binary search is going to find either the next greatest number, or the next smallest one. Using the comparison, you can find the current input's index, and its index will be the numbers that are less than the current one. This algorithm takes O(n^2) time.
What if you used insertion sort to store each number into a linked list? Then you can count the number of elements less than the new one when finding where to put it in the list.
It depends on whether you want to use std or not. In certain situations, some parts of std are inefficient. (For example, std::vector can be considered inefficient in some cases due to the amount of dynamic allocation that occurs.) It's a case-by-case type of thing.
One possible solution here might be to use a skip list (relative of linked lists), as it is easier and more efficient to insert an element into a skip list than into an array.
You have to use the skip list approach, so you can use a binary search to insert each new element. (One cannot use binary search on a normal linked list.) If you're tracking the length with an accumulator, returning the number of larger elements would be as simple as length-index.
One more possible bonus to using this approach is that std::set.insert() is log(n) efficient already without a hint, so efficiency is already in question.

Find the median of binary search tree, C++

Once I was interviewed by "One well known company" and the interviewer asked me to find the median of BST.
int median(treeNode* root)
{
}
I started to implement the first brute-force solution that I came up with. I fill all the data into a std::vector<int> with inorder traversal (to get everything sorted in the vector) and got the middle element.
So my algo is O(N) for inserting every element in the vector and query of middle element with O(1), + O(N) of memory.
So is there more effective way (in terms of memory or in terms of complexity) to do the same thing.
Thanks in advance.
It can be done in O(n) time and O(logN) space by doing an in-order traversal and stopping when you reach the n/2th node, just carry a counter that tells you how many nodes have been already traversed - no need to actually populate any vector.
If you can modify your tree to ranks-tree (each node also has information about the number of nodes in the subtree it's a root of) - you can easily solve it in O(logN) time, by simply moving torward the direction of n/2 elements.
Since you know that the median is the middle element of a sorted list of elements, you can just take the middle element of your inorder traversal and stop there, without storing the values in a vector. You might need two traversals if you don't know the number of nodes, but it will make the solution use less memory (O(h) where h is the height of your tree; h = O(log n) for balanced search trees).
If you can augment the tree, you can use the solution I gave here to get an O(log n) algorithm.
The binary tree offers a sorted view for your data but in order to take advantage of it, you need to know how many elements are in each subtree. So without this knowledge your algorithm is fast enough.
If you know the size of each subtree, you select each time to visit the left or the right subtree, and this gives an O(log n) algorithm if the binary tree is balanced.

Insertion into a skip list

A skip list is a data structure in which the elements are stored in sorted order and each node of the list may contain more than 1 pointer, and is used to reduce the time required for a search operation from O(n) in a singly linked list to O(lg n) for the average case. It looks like this:
Reference: "Skip list" by Wojciech Muła - Own work. Licensed under Public domain via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Skip_list.svg#mediaviewer/File:Skip_list.svg
It can be seen as an analogy to a ruler:
In a skip list, searching an element and deleting one is fine, but when it comes to insertion, it becomes difficult, because according to Data Structures and Algorithms in C++: 2nd edition by Adam Drozdek:
To insert a new element, all nodes following the node just inserted have to be restructured; the number of pointers and the value of pointers have to be changed.
I can construe from this that although choosing a node with random number of pointers based on the likelihood of nodes to insert a new element, doesn't create a perfect skip list, it gets really close to it when large number of elements (~9 million for example) are considered.
My question is: Why can't we insert the new element in a new node, determine its number of pointers based on the previous node, attach it to the end of the list, and then use efficient sorting algorithms to sort just the data present in the nodes, thereby maintaining the perfect structure of the skip list and also achieving the O(lg n) insert complexity?
Edit: I haven't tried any code yet, I'm just presenting a view. Simply because implementing a skip list is somewhat difficult. Sorry.
There is no need to modify any following nodes when you insert a node. See the original paper, Skip Lists: A Probabilistic Alternative to Balanced Trees, for details.
I've implemented a skip list from that reference, and I can assure you that my insertion and deletion routines do not modify any nodes forward of the insertion point.
I haven't read the book you're referring to, but out of context the passage you highlight is just flat wrong.
You have a problem on this point and then use efficient sorting algorithms to sort just the data present in the nodes. Sorting the data will have complexity O(n*lg(n)) and thus it will increase the complexity of insertion. In theory you can choose "perfect" number of links for each node being inserted, but even if you do that, when you perform remove operations, the perfectness will be "broken". Using the randomized approach is close enough to perfect structure to perform well.
You need to have function / method that search for location.
It need to do following:
if you insert unique keys, it need to locate the node. then you keep everything, just change the data (baggage). e.g. node->data = data.
if you allow duplicates, or if key is not found, then this function / method need to give you previous node on each height (lane). Then you determine height of new node and insert it after the found nodes.
Here is my C realisation:
https://github.com/nmmmnu/HM2/blob/master/hm_skiplist.c
You need to check following function:
static const hm_skiplist_node_t *_hm_skiplist_locate(const hm_skiplist_t *l, const char *key, int complete_evaluation);
it stores the position inside hm_skiplist_t struct.
complete_evaluation is used to save time in case you need the data and you are not intend to insert / delete.