Efficient Huffman tree search while remembering path taken - c++

As a follow up question related to my question regarding efficient way of storing huffman tree's I was wondering what would be the fastest and most efficient way of searching a binary tree (based on the Huffman coding output) and storing the path taken to a particular node.
This is what I currently have:
Add root node to queue
while queue is not empty, pop item off queue
check if it is what we are looking
yes:
Follow a head pointer back to the root node, while on each node we visit checking whether it is the left or right and making a note of it.
break out of the search
enqueue left, and right node
Since this is a Huffman tree, all of the entries that I am looking for will exist. The above is a breadth first search, which is considered the best for Huffman trees since items that are in the source more often are higher up in the tree to get better compression, however I can't figure out a good way to keep track of how we got to a particular node without backtracking using the head pointer I put in the node.
In this case, I am also getting all of the right/left paths in reverse order, for example, if we follow the head to the root, and we find out that from the root it is right, left, left, we get left, left, right. or 001 in binary, when what I am looking for is to get 100 in an efficient way.
Storing the path from root to the node as a separate value inside the node was also suggested, however this would break down if we ever had a tree that was larger than however many bits the variable we created for that purpose could hold, and at that point storing the data would also take up huge amounts of memory.

Create a dictionary of value -> bit-string, that would give you the fastest lookup.
If the values are a known size, you can probably get by with just an array of bit-strings and look up the values by their index.

If you're decoding Huffman-encoded data one bit at a time, your performance will be poor. As much as you'd like to avoid using lookup tables, that's the only way to go if you care about performance. The way Huffman codes are created, they are left-to-right unique and lend themselves perfectly to a fast table lookup.

Related

Ukkonen's suffix tree algorithm, what is necessary?

Yes I have read this: Ukkonen's suffix tree algorithm in plain English?
It is a great explanation of the algorithm but it is not so much the algorithm itself that is killing me but rather the data structure used to implement it.
I need the data structure to be as minimal and as fast as possible and I have seen many implementations using only Nodes, some with only edges, some with edges and nodes, etc. Then there are variations, a website I was reading claimed that a node need not have a pointer to its parent, and other places don't account for how children of a node are managed.
My idea is to have a Node structure with int start, and int * end (points to the current end or phase i). Each node will have a suffix_link pointer, a pointer to its parent, and a pointer to a vector containing its child nodes.
My question is, are these things sufficient and necessary to implement a suffix tree? Can I minimize it in any way? I haven't seen an implementation with children in vectors yet so I am skeptical as to my own thinking. Could someone explain what one would need to implement a suffix tree in this manner using only nodes?
Following may be helpful:
Ukkonen’s Suffix Tree Construction
Here we have
1. start, end to represent edge label
2. suffix link
3. an array for children
When i have to implement that algorithm the better explained document was the original Ukkonen paper and there's one newer with images.
Yes in this documents are all the inside to implement Ukkonen's Suffix Tree algorithm.

Kd tree: data stored only in leaves vs stored in leaves and nodes

I am trying to implement a Kd tree to perform the nearest neighbor and approximate nearest neighbor search in C++. So far I came across 2 versions of the most basic Kd tree.
The one, where data is stored in nodes and in leaves, such as here
The one, where data is stored only in leaves, such as here
They seem to be fundamentally the same, having the same asymptotic properties.
My question is: are there some reasons why choose one over another?
I figured two reasons so far:
The tree which stores data in nodes too is shallower by 1 level.
The tree which stores data only in leaves has easier to
implement delete data function
Are there some other reasons I should consider before deciding which one to make?
You can just mark nodes as deleted, and postpone any structural changes to the next tree rebuild. k-d-trees degrade over time, so you'll need to do frequent tree rebuilds. k-d-trees are great for low-dimensional data sets that do not change, or where you can easily afford to rebuild an (approximately) optimal tree.
As for implementing the tree, I recommend using a minimalistic structure. I usually do not use nodes. I use an array of data object references. The axis is defined by the current search depth, no need to store it anywhere. Left and right neighbors are given by the binary search tree of the array. (Otherwise, just add an array of byte, half the size of your dataset, for storing the axes you used). Loading the tree is done by a specialized QuickSort. In theory it's O(n^2) worst-case, but with a good heuristic such as median-of-5 you can get O(n log n) quite reliably and with minimal constant overhead.
While it doesn't hold as much for C/C++, in many other languages you will pay quite a price for managing a lot of objects. A type*[] is the cheapest data structure you'll find, and in particular it does not require a lot of management effort. To mark an element as deleted, you can null it, and search both sides when you encounter a null. For insertions, I'd first collect them in a buffer. And when the modification counter reaches a threshold, rebuild.
And that's the whole point of it: if your tree is really cheap to rebuild (as cheap as resorting an almost pre-sorted array!) then it does not harm to frequently rebuild the tree.
Linear scanning over a short "insertion list" is very CPU cache friendly. Skipping nulls is very cheap, too.
If you want a more dynamic structure, I recommend looking at R*-trees. They are actually desinged to balance on inserts and deletions, and organize the data in a disk-oriented block structure. But even for R-trees, there have been reports that keeping an insertion buffer etc. to postpone structural changes improves performance. And bulk loading in many situations helps a lot, too!

How to store a tree on the disk and make add/delete/swap operations easy

All right, this question requires a bit of reading on your side. I'll try to keep this short and simple.
I have a tree (not a binary tree, just a tree) with data associated to each node (binary data, I don't know what they are AND I don't know how long they are)
Each node of the tree also has an index which isn't related to how it appears in the tree, to make it short it could be like that:
The index number represents the order the user WANTS the tree to be navigated and cannot be duplicated.
I need to store this structure in a file on the disk.
My problem is: how to design a flexible disk storing format that can make loading and working on the tree as easy as possible.
In fact the user should be allowed to
Create a child block to an element (and this should be easy enough, it's sufficient to add data to the file paying attention to avoiding duplicated indices)
Delete a child (I should prompt the user "do you want to delete all this node's children as well? or should I add its children to its parent?"). The tricky part about this is that deleting a node could also free up an index, and I can't let the user use that index again when adding another node (or the order he set could be messed up), I need to update the entire tree!
Swap an index with another one
I'm using C++ and Qt and by now I thought of a lot of structures with a lot of fields like this one
struct dataToBeStoredInTheFile
{
long data_size;
byte *data; //... the data here
int index;
int number_of_children;
int *children_indices; // ... array of integers
}
this has the advantage to identify each node with its respective index, but it's highly slow when swapping indices between two nodes or deleting a node and updating each other node's index because you have to traverse all the nodes and all their "children_indices" arrays.
Would using something like an "hash" to identify each node be more flexible?
Should I use two indices, one for the position in the tree and one for the user's index? If you have any better idea to store the data, you're welcome
I would suggest using something like boost.serialization, then you don't have to worry about the actual format when save on disk, and can concentrate on effective in-memory solution.
Edit: Re-reading your question I see you are using Qt, in that case it should have it's own serialization framework that you can use.
If it doesn't have to be a SINGLE file, you could use the file/directory structure to represent your tree, where each node corresponds to a single file (w/ a directory for each interior node). Maybe not the most efficient, but incredibly easy to do.
Again, if you have some flexibility on the number of files (but not as much as above), you could have one file for the tree structure (so that each node is a fixed size, simplifying its manipulation) and a separate one for storing node contents. To speed up working with the "content file", you could treat it the way a garbage collecting system would: just keep adding new/updated nodes on the end, marking old nodes as no longer in use, and periodically clearing things out.
Better yet, follow #JoachimPileborg's advice :)
I don't think you should use the user-specified index to identify the nodes, as that's not directly related to the way you're storing the tree, and you don't have an efficient way of accessing the nodes by index. You should either keep two indices for each node - the user-specified one, and another one that's implementation dependent; or maintain an array mapping the user-specified index to one you're using for the implementation.
Also, it might be better if you use a different structure to store the tree. For each node, store the following:
the index of the parent
the index of the leftmost son
the index of the left brother
the index of the right brother
This way adding a node and swapping two nodes could be done with some simple pointer manipulations (I don't mean explicit pointers - the indices are somewhat like pointers anyway). Deleting a node would still probably be slow as you have to visit all the children.
As a bonus, if you use this structure, every node has a fixed size (unlike with the linked list you're proposing). This means that you can access a node directly by seeking in the file.
You should also maintain the smallest index the user can use for new nodes - so, for example, even if the largest index was 5 and it was deleted, you still keep 6 as the next free index so 5 cannot be reused.

Given a binary search tree and a number, find a path whose node's data added to be the given number.

Given a binary search tree and a number, find if there is a path from root to a leaf such that all numbers on the path added up to be the given number.
I know how to do it by recursively. But, I prefer an iterative solution.
If we iterate from root to a leaf each time, there will be overlap because some paths may have overlap.
What if the tree is not binary search ?
Thanks
Basically this problem can be solved using Dynamic Programming on tree to avoid those overlapping paths.
The basic idea is to keep track of the possible lengths from each leaf to a given node in a table f[node]. If we implement it in a 2-dimensional boolean array, it is something like f[node][len], which indicates whether there is a path from a leaf to node with length equal to len. We can also use a vector<int> to store the value in each f[node] instead of using a boolean array. No matter what kind of representation you use, the way you calculate between different f are straightforward, in the form of
f[node] is the union of f[node->left] + len_left[node] and f[node->right] + len_right[node].
This is the case of binary tree, but it is really easy to extend it to non-binary-tree cases.
If there is anything unclear, please feel free to comment.
Anything you can do recursively, you can also do iteratively. However you are not having performance issues with the recursive solution, then I would leave it as is. It would more likely than not be more difficult to code/read if you try to do it iteratively.
However if you insist, you can transform your recursive solution into an iterative one by using a stack. Every time you make a recursive call, push the state variables in your current function call onto the stack. When you are done with a call, pop off the variables.
For BST:
Node current,final = (initialize)
List nodesInPath;
nodesInPath.add(current);
while(current != final) {
List childrenNodes = current.children;
if(noChildren) noSolution;
if(current < final) {
//choose right child if there is one, otherwise no solution
current = children[right];
} else if(current > final){
//choose left child if there is one, otherwise no solution
current = children[left];
}
nodesInPath.add(current);
}
check sum in the nodesInPath
However, for non BST you should apply a solution using dynamic programming as derekhh suggests if you don't want to calculate same paths over and over again. I think, you can store the total length between a certain processed node and the root node. You calculate the distances when you expand them. Then you would apply Breadth-first search to not to traverse same paths again and use previously computed total distances. The algorithm comes to my mind is a little complex, sorry but not have enough time to write it.

Binary tree number of nodes with a given level

I need to write a program that counts the number of nodes from a certain level given in binary
tree.
I mean < numberofnodes(int level){} >
I tried writing it without any success because I don't how to get to a certain level then go
to count number of nodes.
Do it with a recursive function which only descends to a certain level.
Well, there are many ways you can do this. Best is to have a single dimensional array that keep track of the number of nodes that you add/remove at each level. Considering your requirement that would be the simplest way.
However, if provided with just a binary tree, you WILL have to traverse and go to that many levels and count the nodes, I do not see any other alternative.
To go to a certain level, you will typically need to have a variable called as 'current_depth' which will be tracking the level you are in. Once you reach your level of interest and that the nodes are visited once (typically a In order traversal would suffice) you can increment your count. Hope this helped.
I'm assuming that your binary tree is not necessarily complete (i.e., not every node has two or zero children or this becomes trivial). I'm also assuming that you are supposed to count just the nodes at a certain level, not at that level or deeper.
There are many ways to do what you're asked, but you can think of this as a graph search problem - you are given a starting node (the root of the tree), a way to traverse edges (the child links), and a criteria - a certain distance from the root.
At this point you probably learned graph search algorithms - which algorithm sounds like a natural fit for your task?
In general terms:
Recursion.
In each iteration of the recursion, you need to measure somehow what level are you on, and therefore to know how far down the tree you need to go beyond where you are now.
Recursion parts:
What are you base cases? at what conditions do you say "Okay, time to stop the recursion" ?
How can you count something in a recursion without any global count?
I think the easiest is to simply follow the recursive nature of the tree with an accumulator to track the current level (or maybe the number of levels remaining to be descended).
The non-recursive branch of that function is when you hit the level in question. At that point you simply return 1 (because you found one node at that level).
The recursive branch, simply sums the number of nodes at that level returned from the left and right recursive invocations.
This is the pseudo-code, this assumes the root has level 0
int count(x,currentLevel,desiredLevel)
if currentLevel = desiredLevel
return 1
left = 0
right = 0
if x.left != null
left = count(x.left, currentLevel+1, desiredLevel
if x.right != null
right = count(x.right, currentLevel+1, desiredLevel
return left + right
So to get the number of nodes for level 3 you call
count(root,0,3)