Store characters in C++ tree - c++

How it is possible to store character values in binary tree? I have an CSV file with data, and I have to retrieve that data, search the database, then insert search results. I did that using C++ map from Standard Template Library, but now my task is to do that using tree structure. Searched the web, but haven't found anything about characters, just integers, like this: http://www.cprogramming.com/tutorial/lesson18.html
Thanks.
Edin.

Just use the code from your link and replace int by char.

I wouldn't use "my own" binary tree.
I would suggest you use, std::map or std::vector (depending on the amount of data, and many other factors) - start with vector, as that's the "easiest" - if that can be proven to be "bad", then change it - if you write your code well, it shouldn't change much.
But more importantly, when you say "character", I suspect you actually mean "string". So a vector with a class or struct containing your elements from the csv file would be a sutiable solution.

Related

Threat only unique strings - what is faster vector<std::string> or just std::string

I read from file some strings, and I need to ignore strings that I already treated. First my thought was to create vector<std::string> where I will store strings and after receiving new one check if it is already in the vector. But then I though that I can do the same using just std::string, I think that it is faster and uses less memory, but this way isn't that obvious then using vector. Which approach is better?
A better solution would be to store the strings that you have read in a std::set<string>.
Set lookups are generally faster than lookups in a vector, because sets in C++ standard library are organized as binary trees. If you put all your strings in a single long string, your search would remain linear, and you would have one more problem to solve: dealing with word aliasing. You wouldn't be able to concatenate strings as-is, without a separator, because you wouldn't be able to distinguish between "abc"+"xyz" and "abcxyz"

what data structure to use for range searches?

Trying to make a simple program to catalogue books. Something like this, for example:
struct book{
string author;
string title;
int catalogNumber;
}
Ultimately, I want to be able to do title searches based on a range. So the user could specify to display results of books where the title begins with "aa" though "be". Ideally, the search average case would be logarithmic.
Is there something in the STL that could help me? Otherwise, what is the best way to go about this?
Thanks!
You can store them in an std::set, and use std::lower_bound and std::upper_bound to find a range (and yes, that should be logarithmic). To do that, you'll need to define operator< to operate on just the field(s) you care about (title, in this case).
If you're (virtually) always treating the title as the key, you might prefer to use an std::map<std::string, info>, with info defined like:
struct info {
string author;
int catalogNumber;
info(string a, int c) : author(a), catalogNumber(c) {}
};
This makes a few operations a little easier, such as:
books["Moby Dick"] = info("Herman Melville", 1234);
If you want to support searching by title or author (for example) consider using something like a Boost bimap or multi_index.
For what it's worth, I'd also give serious thought to using a string instead of an int for the catalog number. Almost none of the standard numbering systems (e.g., Dewey decimal, library of congress, ISBN) will store very nicely in an int.
You can use a trie [expanding #smarinov suggestion here]:
Finding the set of relevant words with a common prefix is farily easy in a trie, just follow pointers in the trie until you reach the node representing the desired common prefix. This node is the trie containing the desired common prefix.
In your example, you will need:
range("aa","be") = prefix("a") + (prefix("b[a-e]")
The complexity expected for this OP is O(|S|), where |S| is the length of the common prefix. Note that any algorithm is expected to be not better then it [O(logn) algorithms are actually O(|S| * logn) because the compare op depends on the length of the string.
You can put your elements in a std::set. The problem with that is that you'd probably like your users to be able to search by title as well as by author. A solution is just to maintain two sets, but if your data changes this can be tricky to maintain and you need twice as much space.
You can always write something like Trie, but chances are your data will change and it becomes harder to maintain the logarithmic search time. You can implement any kind of Self-balancing binary search tree, but that's essentially what a set is - a Red-black tree. Writing one is not the easiest task, however...
Update: You can hash everything and implement some form of the Rabin-Karp string search algorithm, but you should be aware that there are collisions possible if you do it. You can reduce the probability of one by double-hashing and/or using good hashing functions.

What are some good methods to replace string names with integer hashes

Usually, entities and components or other parts of the game code in data-driven design will have names that get checked if you want to find out which object you're dealing with exactly.
void Player::Interact(Entity *myEntity)
{
if(myEntity->isNearEnough(this) && myEntity->GetFamilyName() == "guard")
{
static_cast<Guard*>(myEntity)->Say("No mention of arrows and knees here");
}
}
If you ignore the possibility that this might be premature optimization, it's pretty clear that looking up entities would be a lot faster if their "name" was a simple 32 bit value instead of an actual string.
Computing hashes out of the string names is one possible option. I haven't actually tried it, but with a range of 32bit and a good hashing function the risk of collision should be minimal.
The question is this: Obviously we need some way to convert in-code (or in some kind of external file) string-names to those integers, since the person working on these named objects will still want to refer to the object as "guard" instead of "0x2315f21a".
Assuming we're using C++ and want to replace all strings that appear in the code, can this even be achieved with language-built in features or do we have to build an external tool that manually looks through all files and exchanges the values?
Jason Gregory wrote this on his book :
At Naughty Dog, we used a variant of the CRC-32 algorithm to hash our strings, and we didn't encounter a single collision in over two years of development on Uncharted: Drake's Fortune.
So you may want to look into that.
And about the build step you mentioned, he also talked about it. They basically encapsulate the strings that need to be hashed in something like:
_ID("string literal")
And use an external tool at build time to hash all the occurrences. This way you avoid any runtime costs.
This is what enums are for. I wouldn't dare to decide which resource is best for the topic, but there are plenty to choose from: https://www.google.com/search?q=c%2B%2B+enum
I'd say go with enums!
But if you already have a lot of code already using strings, well, either just keep it that way (simple and usually enough fast on a PC anyway) or hash it using some kind of CRC or MD5 into an integer.
This is basically solved by adding an indirection on top of a hash map.
Say you want to convert strings to integers:
Write a class wraps both an array and a hashmap. I call these classes dictionaries.
The array contains the strings.
The hash map's key is the string (shared pointers or stable arrays where raw pointers are safe work as well)
The hash map's value is the index into the array the string is located, which is also the opaque handle it returns to calling code.
When adding a new string to the system, it is searched for already existing in the hashmap, returns the handle if present.
If the handle is not present, add the string to the array, the index is the handle.
Set the string and the handle in the map, and return the handle.
Notes/Caveats:
This strategy makes getting the string back from the handle run in constant time (it is merely an array deference).
handle identifiers are first come first serve, but if you serialize the strings instead of the values it won't matter.
Operator[] overloads for both the key and the value are fairly simple (registering new strings, or getting the string back), but wrapping the handle with a user-defined class (wrapping an integer) adds a lot of much needed type safety, and also avoids ambiguity if you want the key and the values to be the same types (overloaded[]'s wont compile and etc)
You have to store the strings in RAM, which can be a problem.

Reading from a file into a data structure in C++

So, I have a text file (data.txt), It's a story, so just sentence after sentence, and fairly long. What I'm trying to do is to take every individual word from the file and store it in a data structure of some type. As user input I'm going to get a word as input, and then I need to find the 10 closest words(in data.txt) to that input word, using a function that finds the Levenshtein distance between 2 strings(I figured that function out though). So I figured I'd use getline() using " " as the delimiter to store the words individually. But i don't know what I should store these words into so that I can access them easily. And there's also the fact that I don't know how many words are in the data.txt file.
I may have explained this badly sorry, I'll answer any questions you have though, to clarify.
In C++ you can store the words in a vector of strings:
#include <vector>
#include <string>
//....
std::vector<std::string> wordsArray;
// read word
wordsArray.push_back(oneWord);
You need a data structure capable to "contain" the strings you read.
The standard library offer a number of "container" classes like:
vector
deque
list
set
map
Give a check to http://en.cppreference.com/w/cpp to the containers library and find the one that better fit your needs.
The proper answer changes depending not only on the fact you have to "store them" but also on what you have to do with them afterwards.

Best string search algorithm around

I have a code where in i compare a large data, say a source of a web page against some words in a file. What is the best algorithm to be used?
There can be 2 scenarios:
If I have a large amount of words to compare against the source, In which case, for a normal string search algorithm, it would have to take a word, compare against the data, take the next and compare against the data and so on until all is complete.
I have only a couple of words in the file and the normal string search would be ok, but still need to reduce the time as much as possible.
What algorithm is best? I know about Boyer-Moore and also Rabin-Karp search algorithms.
Although Boyer-Moore search seems fast, I would also like names of other algorithms and their comparisons.
In both cases, I think you probably want to construct a patricia trie (also called radix tree). Most importantly, lookup time would be O(k), where k is the max length of a string in the trie.
Note that Boyer-Moore is to search a text (several words) within a text.
If all you want is identifying some individual words, then it's much easier to:
put each searched word in a dictionary structure (whatever it is)
look-up each word of the text in the dictionary
This most notably mean that you read the text as a stream, and need not hold it all in memory at once (which works great with the typical example of a file cursor).
As for the structure of the dictionary, I would recommend a simple hash table. Works great memory-wise compared to tree structures.