string comparison with the most similar string - c++

does anyone know if exist an algorithm that given one string A and an array of strings B, compares the A string with all the strings in B giving in output the most similar one.
For "the most similar one" I mean that for example,
if the A string is: "hello world how are you"
then
"asdf asdewr hello world how asfrqr you"
is more similar than:
"h2ll4 w1111 h11 111 111"

The usual measurement for this is the Levenshtein distance. Compute the Levenshtein distance from the original to each candidate, and take the smallest distance as the most likely candidate.

Define similarity. Algorithms that can do this include:
Levenshtein/LCS/n-gram distance (compare the string with each of the strings in your set, take the one with lowest distance)
tf-idf indexing
Levenshtein automata
Hopfield networks
BK-trees
All of which can feasibly by implemented in C or C++. Google "string similarity", "duplicate finding" or "record linkage" for the available metrics and algorithms.

This is usually done with checking a bunch of variations of the string that you have ... take a look at spelling correction algorithms - e.g. here

Related

Decision that texts or sentences are equivalent in content

The classic example of determining similarity as distance Word Mover's Distance as for example here https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html,
word2vec model on GoogleNews-vectors-negative300.bin, D1="Obama speaks to the media in Illinois",D2="The president greets the press in Chicago",D3="Oranges are my favorite fruit". When calculated wmd distances: distance (D1,D2)=3.3741, distance (D1,D3)=4.3802. So we understand that (D1,D2) more similar than (D1,D3). What is the threshold value for vmd distance to decide that the two sentences actually contain almost the same information? Maybe in the case of sentences D1 and D2, the value of 3.3741 is too large and in reality these sentences are different? Such decisions need to be made, for example, when there is a question, a sample of the correct answer and a student's answer.
Addition after the answer by gojomo:
Let's postpone identification and automatic understanding of logic for later. Let's consider the case when in two sentences there is an enumeration of objects, or properties and actions of one object in a positive way, and we need to evaluate how similar the content of these two sentences is.
I don't believe there's any absolute threshold that could be used as you wish.
The "Word Mover's Distance" can offer some impressive results in finding highly-similar texts, especially in relative comparison to other candidate texts.
However, its magnitude may be affected by the sizes of the texts, and further it has no understanding of rigorous grammar/semantics. Thus things like subtle negations or contrasts, or things that would be nonsense to a native speaker, won't be highlighted as very "different" from other statements.
For example, the two phrases "Many historians agree Obama is absolutely positively the best President of the 21st century", and "Many historians agree Obama is absolutely positively not the best President of the 21st century", will be noted as incredibly similar by most measures based on word-statistics, such as Word Mover's Distance. Yet, the insertion of one word means they convey somewhat opposite ideas.

Parent string of two given strings

Given 2 strings, we have to find a string smallest in length such that the given strings are subsequences to the string. In other words, we need to find a string such that deleting some characters result in the given strings. was thinking of brute force and LCS, but in vain.
12345 and 11234 should result in 112345
WWA and WWS have a answer WWAS
LCS is pretty memory inefficient ( the DP one ) and brute force is just childish. What should I do?
Perhaps you could do a global alignment with Needleman-Wunsch and a high mismatch penalty, to prefer indels. At the end, merge the alignment into a "parent string" by taking letters from matching positions, and then a letter from either of the inserted letters, e.g.:
WW-A
||
WWS-
WWSA
Or:
-12345
||||
11234-
112345
Memory is O(nm), but a modification narrows that down to O(min(n,m)).
There's a well defined algorithm in standard library which would serve your purpose.
set_union ();
Condition is your input ranges must be sorted.

String matching algorithm

Say I have 3 strings. And then 1 more string.
Is there an algorithm that would allow me to find which one of the first 3 strings matches the 4th string the most?
None of the strings are going to be exact matches, I'm just trying to find the closest match.
And if the algorithm already exists in STL, that would be nice.
Thanks in advance.
You don't specify what exactly you mean by "matches the most", so I assume you don't have precise requirements. In that case, Levenshtein distance in a reasonable metric. Simply compute the Levenshtein distance between each of the three strings and the fourth, and pick the one that gives the lowest distance.
You can implement the Levenshtein Distance algorithm, it provides a very nice measure of how close a match between two strings you have. It measures how many keystrokes you need to make in order to turn one string into the other. You can find a C++ implementation here.
Compute Levenshtein Distance between string #4 and the three strings that you have. Pick the string with the smallest distance.
There's nothing ready in the STL, but what you need is some kind of string metric.
Use Levenshtein distance if the strings are similar up to some typos, e.g. Hello vs Helol.
Use Jaccard distance/Dice's coefficient on the set of n-grams if the strings might change more drastically, like hello world versus world hello.
You have approximate string matching problem. Depending on what kind of matching you want to perform, you will use different algorithm. There are many..SOUNDEX, Jaro-Winkler, Levenstein Distance, metaphore... etc. Regarding STL, I don't know any functions that implement those algorithms, but you can take a look here for some soource using c++. Also, note that if you are getting your strings from a database, it is very likely that your database engine implements some of those algorithms (most likely SOUNDEX).

Using the Levenshtein distance in a spell checker

I am working on a spell checker in C++ and I'm stuck at a certain step in the implementation.
Let's say we have a text file with correctly spelled words and an inputted string we would like to check for spelling mistakes. If that string is a misspelled word, I can easily find its correct form by checking all words in the text file and choosing the one that differs from it with a minimum of letters. For that type of input, I've implemented a function that calculates the Levenshtein edit distance between 2 strings. So far so good.
Now, the tough part: what if the inputted string is a combination of misspelled words? For example, "iloevcokies". Taking into account that "i", "love" and "cookies" are words that can be found in the text file, how can I use the already-implemented Levenshtein function to determine which words from the file are suitable for a correction? Also, how would I insert blanks in the correct positions?
Any idea is welcome :)
Spelling correction for phrases can be done in a few ways. One way requires having an index of word bi-grams and tri-grams. These of course could be immense. Another option would be to try permutations of the word with spaces inserted, then doing a lookup on each word in the resulting phrase. Take a look at a simple implementation of a spell checker by Peter Norvig from Google. Either way, consider using an n-gram index for better performance, there are libraries available in C++ for reference.
Google and other search engines are able to do spelling correction on phrases because they have a large index of queries and associated result sets, which allows them to calculate a statistically good guess. Overall, the spelling correction problem can become very complex with methods like context-sensitive correction and phonetic correction. Given that using permutations of possible sub-terms can become expensive you can utilize certain types of heuristics, however this can get out of scope quick.
You may also consider using and existing spelling library, such as aspell.
A starting point for an idea: one of the top hits of your L-distance for "iloevcokies" should be "cookies". If you can change your L-distance function to also track and return a min-index and max-index (i.e., this match is best starting from character 5 and going to character 10) then you could remove that substring and re-check L-distance for the string before it and after it, then concatenate those for a suggestion....
Just a thought, good luck....
I will suppose that you have an existing index, on which you run your levenshtein distance (for example, a Trie, but any sorted index generally work well).
You can consider the addition of white-spaces as a regular edit operation, it's just that there is a twist: you need (then) to get back to the root of your index for the next word.
This way you get the same index, almost the same route, approximately the same traversal, and it should not even impact your running time that much.

Is it possible to calucate the edit distance between a regexp and a string?

If so, please explain how.
Re: what is distance -- "The distance between two strings is defined as the minimal number of edits required to convert one into the other."
For example, xyz to XYZ would take 3 edits, so the string xYZ is closer to XYZ and xyz.
If the pattern is [0-9]{3} or for instance 123, then a23 would be closer to the pattern than ab3.
How can you find the shortest distance between a regexp and a non-matching string?
The above is the Damerau–Levenshtein distance algorithm.
You can use Finite State Machines to do this efficiently (that is, linear in time). If you use a transducer, you can even write the specification of the transformation fairly compactly and do far more nuanced transformations than simply inserts or deletes - see wikipedia for Finite State Transducer as a starting point, and software such as the FSA toolkit or FSA6 (which has a not entirely stable web-demo) too. There are lots of libraries for FSA manipulation; I don't want to suggest the previous two are your only or best options, just two I've heard of.
If, however, you merely want the efficient, approximate searching, a less flexibly but already-implemented-for-you option exists: TRE, which has an approximate matching function that returns the cost of the match - i.e., the distance to the match, from your perspective.
If you mean the string with the smallest levenshtein distance between the closest matched string and a sample, then I'm pretty sure it can be done, but you'd have to convert the Regex to a DFA yourself, then try to match and whenever something fails, non-deterministically continue as if it had passed and keep track of the number differences. you could use A* search or something similar for this, it would be quite inefficient though (O(2^n) worst case)