Upper Bound for Scapegoat tree - c++

Pat Morin's free textbook: Open Data Structures: Scapegoat trees.
http://opendatastructures.org/ods-cpp.pdf
Page 174-175
Scapegoat trees track n=number of nodes and q=upper bound.
What is this upper bound? I thought it was the maximum number of nodes that could be in the tree depending on it's height. It is not. How do I find the Upper bound so that I can make this tree.

In the context, q is what the Wikipedia article calls MaxNodeCount:
[..] MaxNodeCount simply represents the highest achieved NodeCount. It is set to NodeCount whenever the entire tree is rebalanced, and after insertion is set to max(MaxNodeCount, NodeCount).
(where NodeCount is n in the book)
Also, if after deletion
NodeCount <= α * MaxNodeCount
then the whole tree is rebalanced, and MaxNodeCount is reset to the value of NodeCount. In the book, the value of α is 0.5.

Related

unbalanced avl tree check function

I'm implementing an AVL tree and wrote that function that will calculate the balance factor of a given tree:
int avlTree::balanceFactor(avlNode *tree){
return height(tree->left) - height(tree->right);
}
but it seems like while indeed return me the right balance factor of the tree, it won't let me determine weather the tree is AVL balanced, because according to defintion, for every sub tree the balance factor should be checked. i.e that tree:
would have, according to the function a balance factor of 0, which doesn't give me a lot when it comes to balancig the tree. what can i add?
Your balanceFactor function is correct. You just apply it to nodes starting from the root, going down the chain of unbalanced nodes, as described here, for example.

What if the root of a tree is changed?

Given a tree(By tree , I mean N Nodes , N-1 edges and it is connected) the root of tree is changed to lets say r. Now given another node lets say t , you have to find the sum of all nodes in the subtree rooted at t.
I was trying to implement it in c++ .
std::map<int, std::vector< pair<int,int> > > map;
if I iterate over the vector map[t] , I have to ensure that it does not go to a path which leads to r . How would I ensure that ?
Also is there a better way to store a tree in c++ , given the conditions that the root of the tree might change ? I think there will be because the map does not convey anything about a root . :)
The problem is that you have your tree stored as a general graph. Hence for a given vertex you don't know which is the arc leading towards the root (i.e. the parent).
The solution very much depends on the context. A simple solution would be a depth first search starting from r and looking for t. As soon as you find t you have found also the parent of t, and you can easily identify the subtree starting from t.
Alternatively you can start from t and look for r. When you find r you have found the parent of t and you can traverse all other arcs to find the subtree of t.
About an alternative representation of the graph, usually it is better to keep a list of vertices and for each vertex keep a list of neighbour vertices as in:
std::map<int, std::list<int> >

AVL Insert and balance loop

I am implementing AVL Trees in C++ on my own code but this problem is more about understanding AVL trees rather than the code itself. I am sorry if it doesn't fit here but I have crawled through the Internet and I have still not found a solution to my problem.
My code works as expected with relatively small inputs (~25-30 digits) so I suppose it should work for more. I am using an array in which I hold the nodes I have visited during Insertion and then using a while loop I am raising the heights of each node when needed, I know that this procedure has to end when I find a node whose heights are equal (their subtraction result is 0 that is).
The problem is when it comes to balancing. While I can find the Balance Factor of each node and balance the tree correctly I am not sure if I should stop adjusting the heights after balancing and just end the Insertion loop or keep going until the condition is meant, and I just can't figure it out now. I know that during deletion of a node and re-balancing the tree I should keep checking but I am not sure about Insertion and balancing.
Anyone can provide any insight to this and perhaps some documentation?
If you insert only one item at a time: Only one (single or double) rotation is needed to readjust an AVL tree after an insertion throws it out of balance. http://cis.stvincent.edu/html/tutorials/swd/avltrees/avltrees.html You can probably prove it by yourself after you know the conclusion.
Just for reference of future readers there is no need to edit the heights of the nodes above the node you balanced if you have implemented the binary tree like my example:
10
(1)/ \(2)
8 16
(1)/ \(0)
11
(Numbers in parenthesis are the height of each sub tree)
Supposing than on the tree above we insert a node with data=15 Then the resulting subtree is as following:
10
(1)/ \(2)
8 16
(1)/ \(0)
11
(0)/ \(1)
15
Notice how previous heights of sub trees are not yet edited. After a successful insertion we run back through the insertion path, in this case its (11, 16, 10). After running back through this path we edit the heights when needed. That means the left height of the sub tree of 16 will be 2 while it's right height of sub tree is 0 resulting in an imbalanced AVL tree. After balancing the tree with a double rotation the sub tree is:
15
(1)/ \(1)
11 16
So the subtree height is maximum 1, as it was before, therefore heights above the root of this subtree haven't altered and the function changing the heights must return now.

O(log n) index update and search

I need to keep track of indexes in a large text file. I have been keeping a std::map of indexes and accompanying data as a quick hack. If the user is on character 230,400 in the text, I can display any meta-data for the text.
Now that my maps are getting larger, I'm hitting some speed issues (as expected).
For example, if the text is modified at the beginning, I need to increment the indexes after that position in the map by one, an O(N) operation.
What's a good way to change this to O(log N) complexity? I've been looking at AVL Arrays, which is close.
I'm hoping for O(log n) time for updates and searches. For example, if the user is on character 500,000 in the text array, I want to very quickly find if there is any meta data for that character.
(Forgot to add: The user can add meta data whenever they like)
Easy. Make a binary tree of offsets.
The value of any offset is computed by traversing the tree from the leaf to the root adding offsets any time a node is a right child.
Then if you add text early in the file you only need to update the offsets for nodes which are parents of the offsets that change. That is say you added text before the very first offset, you add the number of characters added to the root node. now one half of your offsets have been corrected. Now traverse to the left child and add the offset again. Now 3/4s of offsets have been updated. Continue traversing left children adding the offset until all the offsets are updated.
#OP:
Say you have a text buffer with 8 characters, and 4 offsets into the odd bytes:
the tree: 5
/ \
3 2
/ \ / \
1 0 0 0
sum of right
children (indices) 1 3 5 7
Now say you inserted 2 bytes at offset 4. Buffer was:
01234567
Now its
0123xx4567
So you modify just nodes that dominate parts of the array that changed. In this case just
the root node needs to be modified.
the tree: 7
/ \
3 2
/ \ / \
1 0 0 0
sum of right
children (indices) 1 3 7 9
The summation rule is walking from leaf to root I sum to myself, the value of my parent if I am that parent's right child.
To find if there is an index at my current location I start at the root and ask is this offset greater smaller than my location. If yes I traverse left and add nothing. If no I traverse right and add the value to my index. If at the end of traversal my value is equal to my index then yes there is an annotation. You can do a similar traverals with a minimum and maximum index to find the node that dominates all the indices in the range, finding all the indices to the text I'm displaying.
Oh.. and this is just a toy example. In reality you need to periodically rebalance the tree otherwise there is a chance that if you keep adding new indices just in one part of the file you will get a tree which is way out of balance, and worst case performance would no longer be O(log2 n) but would be O(n). To keep the tree balanced you would need to implement a balanced binary tree like a "red/black tree". That would guarantee O(log2 n) performance where N is the number of metadatas.
Don't store indices! There's no way to possibly do that and simultaneously have performance better than O(n) - add a character at the beginning of the array and you'll have to increment n - 1 indices, no way around it.
But if you store substring lengths instead, you'd only have to change one per level of tree structure, bringing you up to O(log n). My (untested) solution would be to use a Rope with metadata attached to the nodes - you might need to play around with that a bit, but I think it's a solid foundation.
Hope that helps!

find mode in a rolling window of a long sequence of data with duplicates

Give a sequence of data (with duplicates), move a fix-sized window along the data sequence and find mode in the window at each iteraion, where the oldest data is removed and a new data is inserted to the window.
I cannot find better solutions here.
My idea:
Use a hashtable, key is the data, key's data is the frequency of the data occuring in the window.
At the first iteration, iterate each data in the window and put it to the hashtable, meanwhile cout the frequency of each data. After that, traverse the hashtable and return the data with the highest frequency.
For each following iteration, search the oldest data in the hashtable and reduce its frequency by 1 if it is in the hahstable if it becoms 0 use new data to replace the old one. Otherwise, just insert the new data into the hahstable. Traverse the table and return the mode.
It is O(n * m) where n is data seq size and m is window size.
The drawback is : The hashtable size is not fixed, it may have resize overhead. Each iteration, the table has to be traversed, it is not effcient.
Is it possble to do it with O(n lg m) or O(n) ?
Any help is appreciated.
thanks
Another solution:
At the first iteration, build up a hashtable with data as key and its frequency as value associated with the key. On the base of the hashtable, build up a multimap with frequency as key and associated data as value.
After that, at each iteration, in the window, remove the oldest data and update the hashtable, and then update the multimap with the newest updated one in hashtable. If the map key has multiple data, replace it the new one with only the data whose frequency not changed. But, add a new pair with the new frequency and data.
In the window, get a new data and also update the hashtable, update the multimap with the newest updated one in hashtable.
The entry located at the most right hand side on the multimap (a binary search tree) is the mode because its value is the highest frequency in the current window.
Time O(m + m * lg m + n * lg m) if n >> m, O(n lg m).
Space : O(m)
Any better idea ?
Space O(M):
One ring buffer to hold the M values.
One BST holding M {value, PQ pointer} pairs.
One Priority Queue holding M Counts.
Update in time O(lg M):
Find the departing value in the ring buffer O(1),
Find the same value in the BST O(lg M),
Adjust the count in the linked PQ node.
Do an adjust Priority on that node O(lg M)
Replace the old ring buffer entry with the new one O(1)
Find the new value in the BST O(lg M),
Adjust the count in the linked PQ node.
Do an adjust Priority on that node O(lg M)
GetFirst on PQ to find mode O(1)
You could get rid of the ring buffer by adding a nextItem pointer to the BST structure, and keeping an external pointer to the oldest item. This speeds it up by one BST lookup, and may be a space win if the value size is larger than the pointer size. But the algorithm becomes more complicated to code.
Recalling the solution to the previous question...
Maintain a ring buffer of data values. Make a type that is a frequency/data value pair.
Maintain a map keyed on data to these pairs; make a multimap, keyed on frequency, also containing such pairs.
At each step advancing through the data, the mode is the last entry in the map keyed on frequency. There may be a tie - what to do with that is an exercise for the reader. After reporting the mode, you need to use the map to find the pair belonging to the value to delete; then retrieve from the multimap all the entries that have the frequency of the value being deleted and find the one with the right value. Delete, alter, and re-insert nodes into the map and multimap to deal with the removed data; use a similar process to deal with the added data. Depending on your expected data, it might be worth checking first whether the value to be inserted matches the one to delete.