Inserting strings into an AVL tree in C++? - c++

I understand how an AVL tree works with integers.. but I'm having a hard time figuring out a way to insert strings into one instead. How would the strings be compared?
I've thought of just using the ASCII total value and sorting that way.. but in that situation, inserting two identical ASCII words (such as "tied" and "diet") would seem to return an error.
How do you get around this? Am I thinking about it in the wrong way, and need a different way to sort the nodes?
And no they don't need to be alphabetical or anything... just in an AVL tree so I can search for them quickly.

When working with strings, you normally use a lexical comparison -- i.e., you start with the first character of each string. If one is less than the other (e.g., with "diet" vs. "tied", "d" is less than "t") the comparison is based on that letter. If and only if the first letters are equal, you go to the second letter, and so on. The two are equal only if every character (in order) from beginning to end of the strings are equal.

Well, since an AVL tree is an ordered structure, the int string::compare(const string&) const routine should be able to give you an indication of how to order the strings.
If order of the items is actually irrelevant, you'll get better performance out of an unordered structure that can take better advantage of what you're trying to do: a hash table.
The mapping of something like a string to a fixed-size key is called a hash function, and the phenomenon where multiple keys are mapped to the same value is called a collision. Collisions are expected to happen occasionally when hashing, and a basic data structure would needs to be extended to handle it, perhaps by making each node a "bucket" (linked list, vector, array, what have you) of all the items that have colliding hash values that is then searched linearly.

Related

how does a language *know* when a list is sorted?

Forgive me if this question is dumb, but it occurred to me I don't know how a language knows a list is sorted or not.
Say I have a list:
["Apple","Apricot","Blueberry","Cardamom","Cumin"]
and I want to insert "Cinnamon".
AFAIK The language I'm using doesn't know the list is sorted; it's just a list. And it doesn't have a "wide screen" field of view like we do, so it doesn't know where the A-chunk ends and the C-chunk begins from outside the list. So it goes through and compares the first letter of each array string to the first letter of the insert string. If the insert char is greater, it moves to the next string. If the chars match, it moves to the next letter. If it moves on to the next string and the array's char is greater than the insert's char, then the char is inserted there.
My question is, can a language KNOW when a list is sorted?
If the process for combing through a unsorted and sorted list is the same, and the list is still iterated through, then how does sorting save time?
EDIT:
I understand that "sorting allows algorithms that rely on sorting to work"; I apologize for not making that clear. I guess I'm asking if there's anything intrinsic about sorting inside computer languages, or if it's a strategy that people built on top of it. I think it's the latter and you guys have confirmed it. A language doesn't know if it's sorting something or not, but we recognize the performance difference.
Here's the key. The language doesn't / can't / shouldn't know whether your data structure is sorted or unsorted. In fact it doesn't even care what data structure it really is.
Now consider this: What does insertion or deletion really mean? What exact steps need to be taken to insert a new item or delete an existing one. It turns out that the exact meaning of these operations depend upon the data structure that you're using. An array will insert a new element very differently than a linked list.
So it stands to reason that these operations must be defined in the context of the data structure on which these are being applied. The language in general does not supply any keywords to deal with these data structures. Rather the accompanying libraries provide built-in implementations of these structures that contain methods to perform these operations.
Now to the original question: How does the language "know" if a list is sorted or not and why is it more efficient to deal with sorted lists? The answer is, as evident from what I said above, the language doesn't and shouldn't know about the internals of a list. It is the implementation of the list structure that you're using that knows if it is sorted or not, and how to insert a new node in an ordered manner. For example, a certain data structure may use an index (much like the index of a book) to locate the position of the words starting with a certain letter, thus reducing the amount of time that an unsorted list would require to traverse through the entire list, one element at a time.
Hope this makes it a bit clearer.
Languages don't know such things.
Some programming languages come with a standard library containing data structures, and if they don't, you generally can link to libraries that do.
Some of those data structures may be collection types that maintain order.
So given you have a data structure that represents an ordered collection type, then it is that data structure that maintains the order of the collection, whenever an item is added or removed.
If you're using the most basic collection type, an array, then neither the language nor the runtime nor the data structure itself (the array) care in the slightest at what point you insert an item.
can a language KNOW when a list is sorted
Do you mean a language interpreter? Of course it can check whether a list is sorted, simply by checking each elements is "larger" than the previous. I doubt that interpreters do this; why should they care if the list is sorted or not?
In general, if you want to insert "Cinammon" into your list, you need to either specify where to insert it, or just append it at the end. It doesn't matter to the interpreter if the list is sorted beforehand or not. It's how you use the list that determines whether a sorted list will remain sorted, and whether or not it needs to be sorted to begin with. (For example, if you try to find something in the list using a binary search, then the list must be sorted. But you must arrange for this to be the case).
AFAIK The language I'm using ...
(which is?)
... doesn't know the list is sorted; it's just a list. And it doesn't have a "wide screen" field of view like we do, so it doesn't know where the A-chunk ends and the C-chunk begins from outside the list. So it goes through and compares the first letter of each array string to the first letter of the insert string. If the insert char is greater, it moves to the next string. If the chars match, it moves to the next letter. If it moves on to the next string and the array's char is greater than the insert's char, then the char is inserted there.
What you're saying, I think, is that it looks for the first element that is "bigger than" the one being inserted, and inserts the new element just before it. That implies that it maintains the "sorted" property of the list, if it is already sorted. This is horribly inefficient for the case of unsorted lists. Also, the technique you describe for finding the insertion point (linear search) would be inefficient, if that is truly what is happening. I would suspect that your understanding of the list/language semantics are not correct.
It would help a lot if you gave a concrete example in a specific language.

Unique Property of Strings to build an efficient Hash Table

What is the unique property of strings in C++? Why can they be compared by relational operators (e.g. when trying to sort an array of strings alphabetically)? I am trying to capitalize on this "property" in order to build a fine hashing function for a table with no collisions for every possible string. Also, what data structure would work for this? I'm thinking a vector because I will have to go through a document without knowing how many unique words are in it, and I want to go through the document just once.
C++ standard strings are essentially vectors of characters. Comparing strings thus means to compare them character by character from the beginning.
I'm not sure what you mean by 'unique property', but for your usecase any hashing algorithm should do.
If I understand your usecase correctly, you might want to use a std::set< YourHashType > or std::map. That way you wouldn't have to take care of finding out whether a word was already added or not.
The most simple algorithm that calculates the hash key for a null-terminated C-style string is the following:
UINT HashKey(const char* key) const
{
UINT nHash = 0;
while (*key)
nHash = (nHash<<5) + nHash + *key++;
return nHash;
}
I am trying to capitalize on this "property" in order to build a fine hashing function for a table with no collisions for every possible string.
As an example of the pigeonhole principle, you can't have a collision free hash function. Strings sort uniquely when you compare them lexically (e.g. letter by letter) using a function like std::strcmp, but that only gives you a unique ordering using comparison and not an intrinsic unique property of a string.
If you have a finite set of keys, you can design a collision free hash function though, which is referred to as perfect hashing.

Not sure which data structure to use

Assuming I have the following text:
today was a good day and today was a sunny day.
I break up this text into lines, seperated by white spaces, which is
Today
was
a
good
etc.
Now I use the vector data structure to simple count the number of words in a text via .size(). That's done.
However, I also want to check If a word comes up more than once, and if so, how many time. In my example "today" comes up 2 times.
I want to store that "today" and append a 2/x (depending how often it comes up in a large text). Now that's not just for "today" but for every word in the text. I want to look up how often a word appears, append an counter, and sort it (the word + counters) in descending order (that's another thing, but
not important right now).
I'm not sure which data structure to use here. Map perhaps? But I can't add counters to map.
Edit: This is what I've done so far: http://pastebin.com/JncR4kw9
You should use a map. Infact, you should use an unordered_map.
unordered_map<string,int> will give you a hash table which will use strings as keys, and you can augment the integer to keep count.
unordered_map has the advantage of O(1) lookup and insertion over the O(logn) lookup and insertion of a map. This is because the former uses an array as a container whereas the latter uses some implementation of trees (red black, I think).
The only disadvantage of an unordered_map is that as mentioned in its name, you can't iterate over all the elements in lexical order. This should be clear from the explanation of their structure above. However, you don't seem to need such a traversal, and hence it shouldn't be an issue.
unordered_map<string,int> mymap;
mymap[word]++; // will increment the counter associated with the count of a word.
Why not use two data structures? The vector you have now, and a map, using the string as the key, and an integer as data, which then will be the number of times the word was found in the text.
Sort the vector in alphabetical order.
Scan it and compare every word to those that follow, until you find a different one, and son on.
a, a, and, day, day, sunny, today, today, was, was
2 1 2 1 2 2
A better option to consider is Radix Tree, https://en.wikipedia.org/wiki/Radix_tree
Which is quite memory efficient, and in case of large text input, it will perform better than alternative data structures.
One can store the frequencies of a word in the nodes of tree. Also it will reap the benefits of "locality of reference[For any text document]" too.

Hash function for String Data

I'm working on hash table in C++ language and I need a hash function for string data. One hash function that I have tried is add ascii code and use modulo (%100).
My actual requirement is to find the words which exactly matches or started with a given pattern.
Ex: Given pattern is "comp". Then I want get all the words starting with comp. (Ex: company, computer, comp etc) Can I do this using a hash because the tried hash function can find only exact matches.
So can anyone suggest me a hash function suitable for this requirement.
Prefix matched is better handled with a trie.
Basically this is a tree structure that holds on each node one character from the key. The concatenating the characters from the different nodes in the path from the root to a given node will produce the key for that node.
Searching is a matter of descending the trie comparing each character of the searched key with the child nodes. Once you consumed all the characters, the remaining subtree are all the keys that have as prefix the searched key.
Sounds like what you really need is lexicographical sorting. You can do that by using a sorted data structure, like a std set or map, or by using vector and the std::sort algorithm. Note that C++ sort is faster than std C qsort.

How to improve linked list searching. C++

I have simple method in C++ which searchs for string in linked list. That works well but I need to make it faster. Is it possible? Maybe I need to insert items into list in alphabetical order? But I dont think it could help in serching list anymore. In list there is about 300 000 items (words).
int GetItemPosition(const char* stringToFind)
{
int i = 0;
MyList* Tmp = FistListItem;
while (Tmp){
if (!strcmp(Tmp->Value, stringToFind))
{
return i;
}
Tmp = Tmp->NextItem;
i++;
}
return -1;
}
Method returns the position number if item found, otherwise returns -1.
Any sugesstion will be helpfull.
Thanks for answers, I can change structure. I have only one constraint. Code must implement the following interface:
int Count(void);
int AddItem(const char* StringValue, int WordOccurrence);
int GetItemPosition(const char* StringValue);
char* GetString(int Index);
int GetOccurrenceNum(int Index);
void SetInteger(int Index, int WordOccurrence);
So which structure will be the in your opinion the most suitable?
Searching a linked list is linear so you need to iterate from beginning one by one so it is O(n). Linked lists are not the best if you will use it for searching, you can utilize more suitable data structures such as binary trees.
Ordering elements does not help much because still you need to iterate each element anyway.
Wikipedia article says:
In an unordered list, one simple heuristic for decreasing average search time is the move-to-front heuristic, which simply moves an element to the beginning of the list once it is found. This scheme, handy for creating simple caches, ensures that the most recently used items are also the quickest to find again.
Another common approach is to "index" a linked list using a more
efficient external data structure. For example, one can build a
red-black tree or hash table whose elements are references to the
linked list nodes. Multiple such indexes can be built on a single
list. The disadvantage is that these indexes may need to be updated
each time a node is added or removed (or at least, before that index
is used again).
So in the first case you can slightly improve (by statistical assumptions) your search performance by moving items found previously closer to the beginning of the list. This assumes that previously found elements will be searched more frequently.
Second method requires to use other data structures.
If using linked lists is not a hard requirement, consider using hash tables, sorted arrays (random access) or balanced trees.
Consider using array or std::vector as a storage instead of linked list, and use binary search to find particular string, or even better, std::set, if you don't need a numerical index. If for some reasons it is not possible to use other containers, there is not much possible to do - you may want to speed up the process of comparison by storing hash of the string along with it in node.
I suggest hashing.
Since you've already got a linked list of your own), you can try chaining with linked lists for collision resolution.
Rather than using a linear linked list, you may want to use a binary search tree, or a red/black tree. These trees are designed on minimizing the traversals to find an item.
You could also store "short cut links". For example, if the list is of strings, you could have an array of links of where to start searching based on the first letter.
For example, shortcut['B'] would return a pointer to the first link to start searching for strings starting with 'B'.
The answer is no, you cannot improve the search without changing your data-structure.
As it stands, sorting the list will not give you a faster search for any random item.
It will only allow you to quickly decide if the given item is in the list by testing against the first item (which will be either the smallest or the largest entry) and this improvement is not likely to make a big difference.
So can you please edit your question and explain to us your constraints?
Can you use a completely different data structure, like an array or a tree? (as others have suggested)
If not, can you modify the way your linked list is linked?
If not, we will be unlikely to help you...
The best option is to use faster data structure for storing strings:
std::map - red-black tree behind the scenes. Has O(logn) for search/insert/delete operations. Suitable if you want to store additional values with strings (for example - positions).
std::set - basically the same tree but without values. Best for case when you need only contains operation.
std::unordered_map - hash table. O(1) access.
std::unordered_set - hash set. Also O(1) access.
Note. But in all of these cases there is a catch. Complexity is calculated only based on n (count of strings). In reality string comparison is not free. So, O(1) becomes O(m), O(logn) becomes O(mlogn) (where m is maximal length of string). This does not matter in case of relatively short strings. But if this is not true consider using Trie. In practice trie can be even faster than hash table - each character of query string is accessed only once instead of multiple times. For hash table/set it's at least once for hash calculation and at least once for actual string comparison (depending on collision resolution strategy - not sure how it is implemented in C++).