alphabetic binary search tree BST algorithm - c++

I want to declare a class of an alphabetic BST where you can store the nodes by Name -strings or char array-. what is the best algorithm for insertion method in order to have the best search time and have a ideal case BST.
also good to remind that names are not all in same length and may start with same words, they will not be sorted before entering the BST.

Insertion is fast in balanced binary search trees, so either implement a Red-Black tree or an AVL tree. You may also go for B-Trees if you wish to.
Next, you need to look at what to store in a node of the BST. In your case, store the string or char array. To compare two keys, you already have functions defined for both string and char array, i.e string::compare and strcmp respectively.
These two things are all you need to do what you asked, a balanced BST and a datatype for nodes which is comparable.

Related

Is there any data structure in C++ STL for performing insertion, searching and retrieval of kth element in log(n)?

I need a data structure in c++ STL for performing insertion, searching and retrieval of kth element in log(n)
(Note: k is a variable and not a constant)
I have a class like
class myClass{
int id;
//other variables
};
and my comparator is just based on this id and no two elements will have the same id.
Is there a way to do this using STL or I've to write log(n) functions manually to maintain the array in sorted order at any point of time?
Afaik, there is no such datastructure. Of course, std::set is close to this, but not quite. It is a red black tree. If each node of this red black tree was annotated with the tree weight (the number of nodes in the subtree rooted at this node), then a retrieve(k) query would be possible. As there is no such weight annotation (as it takes valuable memory and makes insert/delete more complex as weights have to be updated), it is impossible to answer such a query efficently with any search tree.
If you want to build such a datastructure, use a conventional search tree implementation (red-black,AVL,B-Tree,...) and add a weight field to each node that counts the number of entries in its subtree. Then searching for the k-th entry is quite simple:
Sketch:
Check the weight of the child nodes, and find the child c which has the largest weight (accumulated from left) that is not greater than k
Subtract from k all weights of children that are left of c.
Descend down to c and call this procedure recursively.
In case of a binary search tree, the algorithm is quite simple since each node only has two children. For a B-tree (which is likely more efficient), you have to account as many children as the node contains.
Of course, you must update the weight on insert/delete: Go up the tree from the insert/delete position and increment/decrement the weight of each node up to the root. Also, you must exchange the weights of nodes when you do rotations (or splits/merges in the B-tree case).
Another idea would be a skip-list where the skips are annotated with the number of elements they skip. But this implementation is not trivial, since you have to update the skip length of each skip above an element that is inserted or deleted, so adjusting a binary search tree is less hassle IMHO.
Edit: I found a C implementation of a 2-3-4 tree (B-tree), check out the links at the bottom of this page: http://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtree.html
You can not achieve what you want with simple array or any other of the built-in containers. You can use a more advanced data structure for instance a skip list or a modified red-black tree(the backing datastructure of std::set).
You can get the k-th element of an arbitrary array in linear time and if the array is sorted you can do that in constant time, but still the insert will require shifting all the subsequent elements which is linear in the worst case.
As for std::set you will need additional data to be stored at each node to be able to get the k-th element efficiently and unfortunately you can not modify the node structure.

Is a Trie a K-ary tree?

If you look at the node definitions for a simple Trie and a simple K-ary tree, they look the same.
(using C++ notation)
template <size_t K>
trieNode
{
trieNode *[K]
};
template <size_t K>
KaryNode
{
KaryNode *[K]
};
At its simplest a K-ary tree has multiple children per node (2 for a binary tree)
And a Trie has "multiple children per node"
It seems that a K-ary tree makes it's choice of child based on comparison( < or > ) of Keys
While a Trie makes it's choice of child based on (unary) equality of sub-spans of the Key
Since neither data structure has made it into any standards, what would be best definition of each, and how would they be differentiated?
From the point of view of the shape of the data structure, a trie is clearly an N-ary tree, in the same way that a balanced binary search tree is a binary tree, the difference being in how the data structure manages the data.
A binary search tree is a binary tree with additional constraint that the keys in the nodes are ordered, a balanced binary tree adds on top of that a constraint on the difference between the lengths of different branches.
Similarly, a trie is a N-ary tree with additional constrains that determine how the keys are managed.
Let's try a definition of what a trie is:
A trie is an efficient data structure used to implement a dictionary in which keys are sequences lexicographically. The implementation uses an N-ary tree where the branching factor is the range of valid values for each element in the key sequence[1] and each node may or not hold a value, but always holds a subsequence of the key being stored [2]. For each node in the tree, the concatenation of the subsequences of keys stored in the nodes from the root to any given node represent the key for the value stored, if the node holds a value, and/or a common prefix for all nodes in this subtree.
This layout of data allows for linear lookups on the size of the keys, and sharing the prefix allows for compact representations for many natural languages (like Spanish, where different forms of each verb differ only on the last few suffix characters).
1: That keys are sequences is an important premise, as the main advantage of the tries is that they split the key into different nodes along the path.
2: Depending on the implementation each node might maintain a single element (character) from the sequence or a combination.
A binary tree refers to the shape of the tree without saying anything about how the tree will be used. A binary search tree is a binary tree that is being used in a particular way.
Similarly, a k-ary tree = n-ary tree = multi-way tree refers to the shape of the tree. A trie is a multiway tree that is being used in a particular way.
(But, be careful, just like there are many variations on binary search trees, there are many different variations on tries.)
So, what makes a trie a trie?
A trie is usually used to represent a collection of sequences, such as strings. A particular key is stored, not in a single node like in a binary search tree, but rather split up across many levels of the tree. Here's a picture of a trie containing the strings "can", "car", "cat", and "do".
.
/ \
c/ \d
/ \
. .
| |
a| |o
| |
. .
/|\
n/r| \t
/ | \
. . .
As you can see, it may easier to think of the characters as being associated with the edges instead of the nodes, but any particular implementation might represent it either way.
The many varieties of tries differ in things like how they handle cases where one key is a prefix of another (eg, "cat" and "catastrophe"), and how/whether to compress long common substrings.
K-nary tree: each node has at most K children.
Trie: the children of each node is not limited to a number (theoretically). In practice of course there's always a limit. For example for an asian word trie, the number of children of each node is limited to the size of asian characters, which is probably say 5000 or 10000.
Thanks to user534498's comment about Knuth's "Taocp volume 3, chapter 6.2 & 6.3"
Knuth claims - Ch 6.3
A trie is essentially an M-ary tree, whose nodes are M-place vectors
with components corresponding to digits or characters. each node on
level l represent the set of all keys that begin with a certain
sequence of l characters; the node specifies an M-way branch,
depending on the (l +1)st character.
K-ary, M-ary and N-ary being synonyms, it seems the answer is yes.

Range tree construction

Let us consider the following picture
this is a so called range tree. I don't understand one thing, it looks similar to a binary search tree, so if we are inserting elements, we can use the same procedure as during binary search tree insertion. So what is the difference?
I have read a tutorial and guess that it is a varation of kd trees, query search trees (like geometric point searching, and so on) but how to construct it? Like binary search tree or does it need additional parameters? Maybe like this
struct range
{
int lowerbound;
int upperbound,
int element;
};
and during insertion we have to check
if(element>lowerbound && element <upperbound)
then insert node
Please help me to understand correctly how to construct a range tree.
In binary search tree when you insert a value you insert a new node.
The range search tree is more similar to binary index tree. These two data structures have fixed structures. When you add / subtract a point to a given range you update the values in the nodes, but do not introduce new nodes.
The construction of this structure is much similar to that of KD-tree: based on the given points you choose the most appropriate points of splitting.
When you learn new data structure always consider the supported operations - this will help you understand the structure faster. In your case it would have helped you distinguish between binary search tree and range tree.

Inserting strings into an AVL tree in C++?

I understand how an AVL tree works with integers.. but I'm having a hard time figuring out a way to insert strings into one instead. How would the strings be compared?
I've thought of just using the ASCII total value and sorting that way.. but in that situation, inserting two identical ASCII words (such as "tied" and "diet") would seem to return an error.
How do you get around this? Am I thinking about it in the wrong way, and need a different way to sort the nodes?
And no they don't need to be alphabetical or anything... just in an AVL tree so I can search for them quickly.
When working with strings, you normally use a lexical comparison -- i.e., you start with the first character of each string. If one is less than the other (e.g., with "diet" vs. "tied", "d" is less than "t") the comparison is based on that letter. If and only if the first letters are equal, you go to the second letter, and so on. The two are equal only if every character (in order) from beginning to end of the strings are equal.
Well, since an AVL tree is an ordered structure, the int string::compare(const string&) const routine should be able to give you an indication of how to order the strings.
If order of the items is actually irrelevant, you'll get better performance out of an unordered structure that can take better advantage of what you're trying to do: a hash table.
The mapping of something like a string to a fixed-size key is called a hash function, and the phenomenon where multiple keys are mapped to the same value is called a collision. Collisions are expected to happen occasionally when hashing, and a basic data structure would needs to be extended to handle it, perhaps by making each node a "bucket" (linked list, vector, array, what have you) of all the items that have colliding hash values that is then searched linearly.

Binary Search Tree - to Make a dictionary

I wanted to make a dictionary using BST but I did not have any Idea how to store them in the tree
struct node
{
char word[50];
char meaning[256];
struct node *left, *right;
};
I started like that but I dont know which words to put in the left and which on the right...
Instead of a binary tree, you should use something like a suffix tree. BSTs are really more for "greater/less-than" relationships, which would be hard to map with words. With suffix trees your nodes are characters and branches eventually lead to leaves representing an actual word.
Which words to put left and which to put right would still follow the basic rules of a BST: All nodes to the left of a given root are guaranteed to be less than that root's value, and all nodes to the right of a given root are guaranteed to be greater than or equal to that root's value.
Apply that same principle to your dictionary. I don't know if you're using C or C++, but if you're using C++, I would recommend making a "Word" struct, and overloading it's equality operators. Then in your "node" struct, just have a Word, a left Node, and a right Node.
A BST is not the best choice of a data structure for a dictionary though. I would look into different types of maps and hashing.
Typically, the words that are lexicograpically smaller than the word in the current node go left, the rest goes right. Use < to do the comparison on C++'s std::string, strcmp for C-style strings (NUL-terminated char arrays).