String matching algorithm - c++

Say I have 3 strings. And then 1 more string.
Is there an algorithm that would allow me to find which one of the first 3 strings matches the 4th string the most?
None of the strings are going to be exact matches, I'm just trying to find the closest match.
And if the algorithm already exists in STL, that would be nice.
Thanks in advance.

You don't specify what exactly you mean by "matches the most", so I assume you don't have precise requirements. In that case, Levenshtein distance in a reasonable metric. Simply compute the Levenshtein distance between each of the three strings and the fourth, and pick the one that gives the lowest distance.

You can implement the Levenshtein Distance algorithm, it provides a very nice measure of how close a match between two strings you have. It measures how many keystrokes you need to make in order to turn one string into the other. You can find a C++ implementation here.
Compute Levenshtein Distance between string #4 and the three strings that you have. Pick the string with the smallest distance.

There's nothing ready in the STL, but what you need is some kind of string metric.
Use Levenshtein distance if the strings are similar up to some typos, e.g. Hello vs Helol.
Use Jaccard distance/Dice's coefficient on the set of n-grams if the strings might change more drastically, like hello world versus world hello.

You have approximate string matching problem. Depending on what kind of matching you want to perform, you will use different algorithm. There are many..SOUNDEX, Jaro-Winkler, Levenstein Distance, metaphore... etc. Regarding STL, I don't know any functions that implement those algorithms, but you can take a look here for some soource using c++. Also, note that if you are getting your strings from a database, it is very likely that your database engine implements some of those algorithms (most likely SOUNDEX).

Related

Levenshtein distance in regular expression

Is it possible to include Levenshtein distance in a regular expression query?
(Except by making union between permutations, like this to search for "hello" with Levenshtein distance 1:
.ello | h.llo | he.lo | hel.o | hell.
since this is stupid and unusable for larger Levenshtein distances.)
There are a couple of regex dialects out there with an approximate matching feature - namely the TRE library and the regex PyPI module for Python.
The TRE approximate matching syntax is described in the "Approximate matching settings" section at https://laurikari.net/tre/documentation/regex-syntax/. A TRE regex to match stuff within Levenshtein distance 1 of hello would be:
(hello){~1}
The regex module's approximate matching syntax is described at https://pypi.org/project/regex/ in the bullet point that begins with the text Approximate “fuzzy” matching. A regex regex to match stuff within Levenshtein distance 1 of hello would be:
(hello){e<=1}
Perhaps one or the other of these syntaxes will in time be adopted by other regex implementations, but at present I only know of these two.
You can generate the regex programmatically. I will leave that as an exercise for the reader, but for the output of this hypothetical function (given an input of "word") you want something like this string:
"^(?>word|wodr|wrod|owrd|word.|wor.d|wo.rd|w.ord|.word|wor.?|wo.?d|w.?rd|.?ord)$"
In English, first you try to match on the word itself, then on every possible single transposition, then on every possible single insertion, then on every possible single omission or substitution (can be done simultaneously).
The length of that string, given a word of length n, is linear (and notably not exponential) with n.
Which is reasonable, I think.
You pass this to your regex generator (like in Ruby it would be Regexp.new(str)) and bam, you got a matcher for ANY word with a Damerau-Levenshtein distance of 1 from a given word.
(Damerau-Levenshtein distances of 2 are far more complicated.)
Note use of the (?> non-backtracing construct which means the order of the individual |'d expressions in that output matter.
I could not think of a way to "compact" that expression.
EDIT: I got it to work, at least in Elixir! https://github.com/pmarreck/elixir-snippets/blob/master/damerau_levenshtein_distance_1.exs
I wouldn't necessarily recommend this though (except for educational purposes) since it will only get you to distances of 1; a legit D-L library will let you compute distances > 1. Although since this is regex, it would probably work pretty fast once constructed (note that you should save the "compiled" regex somewhere since this code currently reconstructs it on EVERY comparison!)
is there possiblity how to include levenshtein distance in regular expression query?
No, not in a sane way. Implementing - or using an existing - Levenshtein distance algorithm is the way to go.

Create regex from samples algorithm

AFAIK no one have implemented an algorithm that takes a set of strings and substrings and gives back one or more regular expressions that would match the given substrings inside the strings. So, for instance, if I'd give my algorithm this two samples:
string1 = "fwef 1234 asdfd"
substring1 = "1234"
string2 = "asdf456fsdf"
substring2 = "456"
The algorithm would give me the regular expression "[0-9]*" back. I know it could give more than one regex or even no possible regex back and you might find 1000 reasons why such algorithm would be close to impossible to implement to perfection. But what's the closest thing?
I don't really care about regex itself also. Basically what I want is an algorithm that takes samples as the ones above and then finds a pattern in them that can be used to easily find the "kind" of text I want to find in a string without having to write any regex or code manually.
I don't have proof but I suspect no such discrete algorithm with a finite output could exist since you are asking for the creation of a regular language which could be "large" in respect to the input size.
With that, I suggest you peek at txt2re which can break down sample texts one-by-one and help you build regexes.
FlashFill a new feature of MS Excel 2013 would do exactly the task you want, but it does not give you the regular expression. It's a NP-complete problem and an open question for practical purposes. If you're interested in how to synthesise string manipulation from multiple examples, Go Flash Fill official website and read a few papers. They have pseudo-code and demo. movies as well.
There are many such algorithm in fact. This is a research area called "Grammatical inference".
I know RPNI, for example. (you could also look on the probabilistic branch, alergia, MDI, DEES). These algorithms generate DSA (Deterministic State Automata). In fact you absolutely don't need to enter the strings in your example. Only substrings.
There are also some algorithms to generate directly Non deterministic automata.
Of course, get the regular expression from an Non Deterministic Automata is easy.
The main ideas are simple:
Generate a PTSA (Prefix Tree State Automata) from your sample.
Then, you have to try to "merge" some states. From these merge, will emerge loops (i.e. * in the regular expression). All the difficulty being to choose the right rule to merge.
Here you go:
re = '|'.join(substrings)
If you want anything more general, your algorithm is going to have to make educated guesses about what type of strings are acceptable as matches, and it's trivial to demonstrate that no procedure can account for all possible sets of possible inputs without simply enumerating them all. For instance, consider some of these scenarios:
Match all prime numbers
Match hexadecimal strings, but no strings containing 'f' are in the sample set
Match the same string repeated twice
Match any even-length string
The root problem is that your question is incompletely specified. If you have a more specific requirement, that might be solvable, depending on what it is.

string comparison with the most similar string

does anyone know if exist an algorithm that given one string A and an array of strings B, compares the A string with all the strings in B giving in output the most similar one.
For "the most similar one" I mean that for example,
if the A string is: "hello world how are you"
then
"asdf asdewr hello world how asfrqr you"
is more similar than:
"h2ll4 w1111 h11 111 111"
The usual measurement for this is the Levenshtein distance. Compute the Levenshtein distance from the original to each candidate, and take the smallest distance as the most likely candidate.
Define similarity. Algorithms that can do this include:
Levenshtein/LCS/n-gram distance (compare the string with each of the strings in your set, take the one with lowest distance)
tf-idf indexing
Levenshtein automata
Hopfield networks
BK-trees
All of which can feasibly by implemented in C or C++. Google "string similarity", "duplicate finding" or "record linkage" for the available metrics and algorithms.
This is usually done with checking a bunch of variations of the string that you have ... take a look at spelling correction algorithms - e.g. here

Using the Levenshtein distance in a spell checker

I am working on a spell checker in C++ and I'm stuck at a certain step in the implementation.
Let's say we have a text file with correctly spelled words and an inputted string we would like to check for spelling mistakes. If that string is a misspelled word, I can easily find its correct form by checking all words in the text file and choosing the one that differs from it with a minimum of letters. For that type of input, I've implemented a function that calculates the Levenshtein edit distance between 2 strings. So far so good.
Now, the tough part: what if the inputted string is a combination of misspelled words? For example, "iloevcokies". Taking into account that "i", "love" and "cookies" are words that can be found in the text file, how can I use the already-implemented Levenshtein function to determine which words from the file are suitable for a correction? Also, how would I insert blanks in the correct positions?
Any idea is welcome :)
Spelling correction for phrases can be done in a few ways. One way requires having an index of word bi-grams and tri-grams. These of course could be immense. Another option would be to try permutations of the word with spaces inserted, then doing a lookup on each word in the resulting phrase. Take a look at a simple implementation of a spell checker by Peter Norvig from Google. Either way, consider using an n-gram index for better performance, there are libraries available in C++ for reference.
Google and other search engines are able to do spelling correction on phrases because they have a large index of queries and associated result sets, which allows them to calculate a statistically good guess. Overall, the spelling correction problem can become very complex with methods like context-sensitive correction and phonetic correction. Given that using permutations of possible sub-terms can become expensive you can utilize certain types of heuristics, however this can get out of scope quick.
You may also consider using and existing spelling library, such as aspell.
A starting point for an idea: one of the top hits of your L-distance for "iloevcokies" should be "cookies". If you can change your L-distance function to also track and return a min-index and max-index (i.e., this match is best starting from character 5 and going to character 10) then you could remove that substring and re-check L-distance for the string before it and after it, then concatenate those for a suggestion....
Just a thought, good luck....
I will suppose that you have an existing index, on which you run your levenshtein distance (for example, a Trie, but any sorted index generally work well).
You can consider the addition of white-spaces as a regular edit operation, it's just that there is a twist: you need (then) to get back to the root of your index for the next word.
This way you get the same index, almost the same route, approximately the same traversal, and it should not even impact your running time that much.

Is it possible to calucate the edit distance between a regexp and a string?

If so, please explain how.
Re: what is distance -- "The distance between two strings is defined as the minimal number of edits required to convert one into the other."
For example, xyz to XYZ would take 3 edits, so the string xYZ is closer to XYZ and xyz.
If the pattern is [0-9]{3} or for instance 123, then a23 would be closer to the pattern than ab3.
How can you find the shortest distance between a regexp and a non-matching string?
The above is the Damerau–Levenshtein distance algorithm.
You can use Finite State Machines to do this efficiently (that is, linear in time). If you use a transducer, you can even write the specification of the transformation fairly compactly and do far more nuanced transformations than simply inserts or deletes - see wikipedia for Finite State Transducer as a starting point, and software such as the FSA toolkit or FSA6 (which has a not entirely stable web-demo) too. There are lots of libraries for FSA manipulation; I don't want to suggest the previous two are your only or best options, just two I've heard of.
If, however, you merely want the efficient, approximate searching, a less flexibly but already-implemented-for-you option exists: TRE, which has an approximate matching function that returns the cost of the match - i.e., the distance to the match, from your perspective.
If you mean the string with the smallest levenshtein distance between the closest matched string and a sample, then I'm pretty sure it can be done, but you'd have to convert the Regex to a DFA yourself, then try to match and whenever something fails, non-deterministically continue as if it had passed and keep track of the number differences. you could use A* search or something similar for this, it would be quite inefficient though (O(2^n) worst case)