Auto correct algorithm - c++

I want to implement the following in C++:
1) Check whether the given word exists in a dictionary. The dictionary file is a huge file; consider 100MB or 3-4 Million words.
2) Suggest corrections for the incorrect word.
3) Autocomplete feature.
My Approach
1) I am planning to build a tree so searching will efficient.
2) I am not getting how to implement auto correction feature.
3) I can implement auto complete feature using trees
What's the best data structure and algorithm to implement all the above features?

I have been working on the same problem. So far the best solution I have come across is using a ternary search tree for auto completion. Ternary Search Trees are more space efficient than tries.
If im unable to find the looked up string in my ternary search tree then I use an already built BK Tree for finding the closest match. BK Tree internally uses Levenshtein distance.
You
Metaphones are also something you might want to explore however I havent gone into the depth of metaphones.
I have a solution in Java for BK TREE + TERNARY SEARCH TREE if you like.

You can do autocomplete by looking at all the strings in a given subtree. Some score to help you pick might help. This works something like if you have "te" you go down that path in the trie and the traverse the entire subtree there to get all the possible endings.
For corrections you need to implement something like http://en.wikipedia.org/wiki/Levenshtein_distance over the tree. You can use the fact that if you processed a given path in the trie, you can reuse the result for all the strings in the subtree rooted at the end of your path.

1) Aside from trees, another interesting method is BWT
http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
BWT suffix array can be easily used to track words with given prefix.
2) For error correction, modern approach is LHS:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search
Some links randomly provided by google search:
https://cs.stackexchange.com/questions/2093/efficient-map-data-structure-supporting-approximate-lookup
https://code.google.com/p/likelike/
http://aspguy.wordpress.com/2012/02/18/the-magic-behind-the-google-search/

Related

Levenstein distance, multiple paths

Edit: TL;DR version: how to get all possible backtraces for Damerau–Levenshtein distance between two words? I'm using https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm in order to compute distance, and trivial backtrace algorithm (illustrated below) in order to reconstruct corrections list.
More details below:
Just got stuck with optimal string alignment (sort of Damerau–Levenshtein distance) while trying to get a complete set of possible alignments.
Goal is to align 2 strings for further comparison in auto-suggestions algorithm. Particularly, I'd like to ignore insertions past the end of 1st word.
The problem that in some cases multiple "optimal" alignments is possible, e.g.
align("goto", "go to home")
1) go to
go to home
2) go t o
go to home
Unfortunately, mine implementation finds second variant only, while I need both or 1st one.
I've tried to perform some kind of A* or BFS path finding, but it looks like cost computation matrix is "tuned" for (2) variant only. There is screenshot below where I can find red path, but it looks like there is no green path:
However, someone made a web demo which implements exactly what I want:
What I'm missing here?
Perhaps my implementation is too long to post it here, so there is a link to github: https://github.com/victor-istomin/incrementalSpellCheck/blob/f_improvement/spellCheck.hpp
Distance implementation is located in optimalStringAlignementDistance() and optimalStringAlignmentBacktrace() methods.

Does the Random tree algorithm in WEKA uses Information gain?

I'm using weka with some db's and i'm trying to understand how works Random tree algorithm. I don't understand if it choose, in every node, a random attribute to split, or if it uses (in every node) the information gain of an attribute to decide which one it has to split.
Thank you
Regards,
Andrea

What exactly is Pairwise Matching and How it works?

I'm working on Multiple Image Stitching and I came around the term Pairwise Matching. I almost searched on every site but am unable to get CLEAR description on what it exactly is and how it works.
I'm working in Visual Studio 2012 with opencv. I have modified stitching_detailed.cpp according to my requirement and am very successful in maintaining the quality with significantly less time, except pairwise matching. I'm using ORB to find feature points. BestOf2NearestMatcher is used in stitching_detailed.cpp for pairwise matching.
What I know about Pairwise Matching and BestOf2NearestMatcher:
(Correct me if I'm wrong somewhere)
1) Pairwise Matching works similarly like other matchers such as Brute Force Matcher, Flann Based Matcher, etc.
2) Pairwise Matching works with multiple images unlike the above matchers. You have to go one by one if you want to use them for multiple images.
3) In Pairwise Matching, the features of one image are matched with every other image in the data set.
4) BestOf2NearestMatcher finds two best matches for each feature and leaves the best one only if the ratio between descriptor distances is greater than the threshold match_conf.
What I want to know:
1) I want to know more details about pairwise matching, if I'm missing some on it.
2) I want to know HOW pairwise matching works, the actual flow of it in detail.
3) I want to know HOW BestOf2NearestMatcher works, the actual flow of it in detail.
4) Where can I find code for BestOf2NearestMatcher? OR Where can I get similar code to BestOf2NearestMatcher?
5) Is there any alternative I can use for pairwise matching (or BestOf2NearestMatcher) which takes less time than the current one?
Why I want to know and what I'd do with it:
1) As I stated in the introduction part, I want to reduce the time pairwise matching takes. If I'm able to understand what actually pairwise matching is and how it works, I can create my own according to my requirement or I can modify the existing one.
Here's where I posted a question in which I want to reduce time for the entire program: here. I'm not asking the same question again, I'm asking about specifics here. There I wanted to know how can I reduce time in pairwise matching as well as other code sections and here I want to know what pairwise matching is and how it works.
Any help is much appreciated!
EDIT: I found the code of pairwise matching in matchers.cpp. I created my own function in the main code to optimize the time. Works good.

what are sufficient approaches for manipulating a list of thousands of individual words

I'm trying to design such an application that manipulates a list of thousands of individual words that is stored in a txt file for the following tasks,
1- Randomly picking up some words.
2- Checking whether some entered words by the user are actually in the list.
3- Retrieve the entire list from a txt file and store it temporarily for subsequent manipulations.
I'm not asking for implementation neither for pseudo codes. I'm looking for sufficient approach to deal with a massive list of words. For the time being, I might go with a vector of strings, however, searching thousands of words will take some times. Of course there must be some strategies to cope with this kind of tasks however, since my background is not Computer Science, I don't know in which direction which I go. Any suggestions are welcomed.
A vector of strings is fine for this problem. Just sort them, and then you can use binary search to find a string in the list.
Radix trees are a good solution for searching through word lists for matches. Reduced space for storage, but you'll have to have some custom code for getting and putting words in the list. And the text file won't necessarily be easy to read unless you create the tree anew each time you load from a text file. Here's an implementation I committed to on GitHub (I can't even remember the source material at this point) that might be of assistance to you.

Partition in Equivalence sets

I have a simple, non-dirictional tree T. I should find a path named A and another named B that A and B have no common vertex. The perpuse is to maxmize the Len(A)*Len(B).
I figured this problem is similer to Partition Problem, except in Partition Problem you have a set but here you have a Equivalence set. The solution is to find two uncrossed path that Len(A) ~ Len(B) ~ [n-1/2]. Is this correnct? how should I impliment such algorithm?
First of all. I Think you are looking at this problem the wrong way. As I understand it you have a graph related problem. What you do is
Build a maximum spanning tree and find the length L.
Now, you say that the two paths can't have any vertex in common, so we have to remove an edge to archieve this. I assume that every edge wheight in your graph is 1. So sum of the two paths A and B are L-1 after you removed an edge. The problem is now that you have to remove an edge such that the product of len(a) and len(b) is maximized. You do that by removeing the edge in et most 'middel' of L. Why, the problem is of the same as optimizing the area of a rectangle with a fixed perimeter. A short youtube video on the subject can be found here.
Note if your edge wheights are not equal to 1, then you have a harder problem, because there may exist more than one maximum spanning tree. But you may be able to split them in different ways, if this is the case, write me back, then I will think about a solution, but i do not have one at hand.
Best of luck.
I think there is a dynamic programming solution that is just about tractable if path length is just the number of links in the paths (so links don't have weights).
Work from the leaves up. At each node you need to keep track of the best pair of solutions confined to the subtree with root at that node, and, for each k, the best solution with a path of length k terminating in that node and a second path of maximum length somewhere below that node and not touching the path.
Given this info for all descendants of a node, you can produce similar info for that node, and so work your way up to the route.
You can see that this amount of information is required if you consider a tree that is in fact just a line of nodes. The best solution for a line of nodes is to split it in two, so if you have only worked out the best solution when the line is of length 2n + 1 you won't have the building blocks you need for a line of length 2n + 3.