Autocompletion library in C++ - c++

I need an auto-completion routine or library in C++ for 1 million words. I guess I can find a routine on the net like Rabin–Karp. Do you know a library that does this. I don't see it in Boost.
Also, is it a crazy idea to use MySql LIKE SQL request to do that ?
Thank you
EDIT: It is true that it is more suggestions than auto-completion that I need (propose ten words when the user typed the first 2 letters). I actually also have expressions "Nikon digital camera". But for a first version, I only need suggestions on "Ni" of Nikon and not on "digital camera".

You don't have to use any crazy algorithm if you begin by preparing an index.
A simple Trie/Binary Search Tree structure, that keeps the words ordered alphabetically, would allow efficient prefix searches.
In C++, for example, the std::map class has the lower_bound member which would point in O(log N) to the first element that could possibly extend your word.

hmmmm, if you're thinking about using like, it means that most probably, you want to have classical autocompletion (begin of word is matching).
What about organising (nicely) your data into a 26-tree (one entry per letter, or if you support other than letters, an well chosen x-tree). That way, you organize your data once and then, you have quick result by tree parsing. if you want to limit the amount of results proposed into your autocompletion, you can adapt your tree parsing algorithm. Seems simple and efficient (a like syntax in SQL will have to compare all your items in your table each time, whereas my solution is much quicker once the data is correctly set)
Other solution, you can peek at Qt implementation of QCompleter (might be overkill to depend on Qt on your code, I don't know)

I worked on a project once that did something like this using CLucene. It worked fine.

You can use a trie (prefix tree) to store your words.
struct trie
{
std::map<char, trie*> next;
bool is_word;
void insert(std::string w)
{
trie * n = this;
for (int i = 0; i < w.size(); ++i) {
if (n->next.find(w[i]) == n->next.end()) {
n->next[w[i]] = new trie();
}
n = n->next[w[i]];
}
n->is_word = true;
}
};
Then you can easily get prefix matches iterating on subtrees.

You could write your own simple auto-completion function with using Damerau-Levenshtein distance.

Related

If-else statement trouble

My project uses an if-else statement to determine if a random image from a list of images matches to an image in another list. I'm trying to find a piece of code that will allow me to set the if-else statement so when it asks: if randomImage == list[?]. In the question mark I need code that will go through the entire list and see if the randomImage matches from ANY of the elements in the list. Here's a snippet of code: trash[randomTrash] generate a random image from the list trash. I need it so it checks if the random image of trash is equal to an image in another list. It needs to go through recycle list and determine if an element is equal to it.
There is probably an easier way to do this depending on your project specifics, but you should be able to use a for loop that loops through each element in your list.
for (int element = 0; element < list.length; element++) {
if (randomImage == list[element]) {
// randomImage matches with an element in the list. Assuming you are using some boolean variable 'match' which is initialized to be false.
match = true;
}
}
it's really helpful if you tag the language you're using, so people know how exactly to address the issue in particular.
The most common and straightforward approach you would see is looping through the whole list by index, and comparing each image to the one from it, and that's a solid working one.
If you're on a language supporting list comprehension you could take an approach similar to this,
[x for x if x==image...]
then check if the list is empty for your if/else condition.
Please let us know in particular if it's something more specific you're looking for.

What's a better way of doing this in Visual C++?

I don't normally work in Visual C++ but I was wondering what I could do to speed up this logic...and if there's a better way of doing this.
I have a map<wstring, wstring> with contents like this:
\Device\CdRom0\, E:\
\Device\CdRom1\, F:\
\Device\HarddiskVolume1\,
\Device\HarddiskVolume4\, C:\
\Device\HarddiskVolume5\, D:\
And I have a huge list of strings that have the following format:
L"\\Device\\HarddiskVolume4\\Users\\User\\Documents\\Visual Studio 2013\\Projects\\FileLocker\\FileLocker\\bin\\Debug\\Test.txt";
My whole purpose is to take strings in the above format, use the map as a type of lookup, and convert these strings into the following format (example converting the above string to a drive path):
L"C:\\Users\\User\\Documents\\Visual Studio 2013\\Projects\\FileLocker\\FileLocker\\bin\\Debug\\Test.txt";
The way I am doing it currently is as follows (for each string):
std::wstring test = ...
for (map<wstring, wstring>::iterator i = volumes.begin(); i != volumes.end(); ++i)
{
if (test.find((*i).first.c_str()) == 0)
{
test = test.replace(0, wcslen((*i).first.c_str()), (*i).second.c_str());
}
}
But there's a lot of strings here, and performance can really take a hit! What are some better ways of performing this lookup and assigning to the string at hand?
If you know there are always exactly two \ separated terms to match, extract just that part of the string then search for that in the map - or try a hashmap.
If you want to stick to the map and same style of logic, you could replace...
if (test.find((*i).first.c_str()) == 0)
...with test.compare(0, i->first.size(), i->first), so it doesn't try to match at every position along the string.
You could also build a tree of resolution steps:
\Device\ ---> Cdrom ---> 0
| 1
|
---> HardDiskVolume ---> 1
4
5
The C++ Standard library doesn't provide a convenient container type for modelling this though - if the depth is always 3 you can hardcode a few maps (last numeric one could even be an array), otherwise there's e.g. boost graph.
After a successful replace, use break; to exit the for loop. That will double the performance by eliminating attempts to match other drives. If the frequency of the appearance of the drives is roughly known, ordering the map by that frequency will add to the effectiveness of the break.

How to remove all words in a list from a fixed list of candidates?

I'm working on the code which includes comprehensive text preprocessing, including stopwords removal, stemming, boilerplate information removal/substitution (urls, emails, number, money amounts, tags, etc...), building inverted index, LCA, etc. Not exactly surprising - removing stopwords is the bottleneck, the most expensive part of the procedure.
What I have now is pretty simple:
I have around 500 stop-terms stored in static array static const std::wstring stopwords [].
Then for each document (std::vector<wstring>):
for each ( auto term in stopwords)
{
doc.erase( std::remove( doc.begin(), doc.end(), term), doc.end() );
}
Any suggestion how to improve this code's performance?
Your algorithm is n*m, seaching the document multiple times. Instead you should loop over the words in doc, checking if each is a stopword, and your stopwords should be in a hash table (not a map) so you can do an O(1) check if a given word is a stop word. That will reduce your time to O(n) where n is the size of the document.
Ex: C++11 provides an unordered set container that you can use for your hash table.
std::unordered_set<std::wstring> stopwords; // keep your stop words in here.
Once you have that, the trivial solution becomes:
doc.erase(std::remove_if(
doc.begin(),
doc.end(),
[](const std::wstring& s){ return stopwords.find(s) != stopwords.end(); }),
doc.end());
Case sensitivity checking not withstanding, (which your original sample did not account for so we didn't here either), this will perform significantly better than what you had before, assuming your words has a reasonable hash distribution.

Most effective way to create a naive text summaring algorithm

I'm building a simple naive text summary algorithm. The algorithm works like this:
First step of my algorithm is to remove all stop words(stop words in English).
After my text contains only words with actual meaning I'm going to see how many times each word is used in the text to find the frequency of the word. For example if the word "supercomputer" is used 5 times, it will have frequency = 5.
Then I'm going to calculate each sentences weight by dividing the sum of the frequencies of all words in the sentence to the number of the words in the sentence.
On the last step I'm going to sort the sentences by their length.
I need to write this algorithm in C++ (as V8 NodeJS module), but the problem is that in the past few years I've been working mostly with high-level scripting languages like Javascript and I'm not that experienced in C++. In javascript I could just use regex to remove all stop words and then find the frequency, but in C++ seems to be much more complex.
I came up with the following idea:
struct words {
string word;
int freq;
}
std::vector<words> Words;
The stop words are going to be preloaded in a V8 Local Array or std::vector.
For each word in the text I'm going to loop through all stop words, if the current word is not a stop word, then check if its in the struct, if not -> add a new word to the Words vector, if exists increase freq by 1.
After I have found all the frequencies of all words, I'm going to loop through the text again to find the weight of each sentence.
And with this idea few problems came to my mind:
My texts will be mostly 1000+ words. And for each word looping through 100+ stop words are going to make 100000 iterations just to figure out the stop words. This seems to be really ineffective.
After I have the frequencies I will need to loop one more time through the text 1000+ words with 300+ words(in the vector frequencies) to calculate each sentences weight.
My idea seems to be ineffective, but I'm not well familiar with C++.
So my questions is are there better ways to do this or optimize my algorithm, especially the problems I listed above?
I'm worried about the performance of my algorithm and any tips/suggestions will be greatly appreciated.
For the stopwords, have a look at std::unordered_set. You can store all of your stopword strings in a std::unordered_set<string>, then when you have a string you want to compare, call count(string) to see if it exists.
For the word/frequency pairs, use a std::unordered_map as in some of the comments. It would be fastest if you perform both the find and insert in a single map lookup. Try something like this:
struct Frequency
{
int val;
Frequency() : val(0) {}
void increment()
{
++val;
}
};
std::unordered_map<std::string, Frequency> words;
void processWord(const std::string str)
{
words[str].increment();
}
words[str] searches for a word in the map, adding it if it doesn't exist. New words will call Frequency's constructor which initializes to zero. So all you have to do is call processWord on every word.

Given a string, find all its permutations that are a word in dictionary

This is an interview question:
Given a string, find all its permutations that are a word in dictionary.
My solution:
Put all words of the dictionary into a suffix tree and then search each permutation of the string in the tree.
The search time is O(n), where n is the size of the string. But the string may have n! permutations.
How do I improve the efficiency?
Your general approach isn't bad.
However, you can prevent having to search for each permutation by rearranging your word so that all it's characters are in alphabetical order, then searching on a dictionary where each word is similarly re-arranged into alphabetical order and mapped to the original word.
I realise that might be a little hard to grasp as is, so here's an example. Say your word is leap. Rearrange this to aelp.
Now in your dictionary you might have the words plea and pale. Having done as suggested, your dictionary will (among other things) contain the following mappings:
...
aelp -> pale
aelp -> plea
...
So now, to find your anagrams you need only find entries for aelp (using, for example, a suffix-tree approach as suggested), rather than for all 4! = 24 permutations of leap.
A quick alternative solution - all depends on the sizes of data structures in question.
If the dictionary is reasonable small and the string is reasonably long, you can go over each entry in the dictionary and figure out if they are a permutation of the string. You can be smarter - you can sort the dictionary and skip certain entries.
You can build a map from a sorted list of characters to a list of words.
For example, given these:
Array (him, hip, his, hit, hob, hoc, hod, hoe, hog, hon, hop, hos, hot)
you would sort them internally:
Array (him, hip, his, hit, bho, cho, dho, eho, gho, hno, hop, hos, hot)
sort the result:
Array (bho, cho, dho, eho, gho, him, hip, his, hit, hno, hop, hos, hot)
In this small sample, we don't have a match, but for a particular word, you would sort it internally, and with this as key look into your map.
Why don't you use a hash map to store the dictionary words? So you get O(1) lookup time. And if your input is in english, you can build another table to tell all the possible letters in your dictionary, using this table, you can filter some inputs at the beginning. Following is an example:
result_list = empty;
for(char in input)
{
if(char not in letter_table)
{
return result_list;
}
}
for(entry in permutations of input)
{
if(entry in dictionary_hash_table)
{
result_list->add_entry();
}
}
return result_list
You should put the words into a trie. Then you can look up the word as you generate the permutations. You can skip over whole blocks of permutations with the first part is not in the trie.
http://en.wikipedia.org/wiki/Trie
Another simple solution could be as algorithm below,
1) Use "next_permutation" to find a unique permutation.
2) Use "find/find_if" to find it against a dictionary.