Damerau–Levenshtein distance for language specific quirks - compare

To Dutch speaking people the two characters "ij" are considered to be a single letter that is easily exchanged with "y".
For a project I'm working on I would like to have a variant of the Damerau–Levenshtein distance that calculates the distance between "ij" and "y" as 1 instead of the current value of 2.
I've been trying this myself but failed. My problem is that I do not have a clue on how to handle the fact that both texts are of different lengths.
Does anyone have a suggestion/code fragment on how to solve this?
Thanks.

The Wikipedia article is rather loose with terminology. There are no such things as "strings" in "natural language". There are phonemes in natural language which can be represented by written characters and character-combinations.
Some character-combinations are vestiges of historical conventions which have survived into modern times, as in modern English "rough" where the "gh" can sound like -f- or make no sound at all. It seems to me that in focusing on raw "strings" the algorithm must be agnostic about the historical relationship of language and orthographic convention, which leads to some arbitrary metrics whenever character-combinations correlate to a single phoneme. How would it measure "rough" to "ruf"? Or "through" to "thru"?
Or German o-umlaut to "oe"?
In your case the -y- can be exchanged phonetically and orthographically with -ij-. So what is that according to the algorithm, two deletions followed by an insertion, or a single deletion of the -j- or of the -i- followed by a transposition of the remaining character to -y-? Or is -ij- being coalesced and the coalescence is followed by a transposition?
I would recommend that you use another unused comnbining character for -ij- before applying the algorithm, perhaps U00EC, Latin small letter i with grave accent.
How does the algorithm handle multi-codepoint characters?

Well the D-L distance itself isn't going to handle it for you, due to the way it measure distances.
As there is no code (or language) involved here, I can only leave you with a suggestion to ensure all strings adhere to the same structure.
To clarify the situation since your asking in general terms,
bear in mind that the D-L distance compares character for character and doesn't actually read your strings in themselves, as such you'll have to parse before compare, as cases where ij shouldn't be exchanged with y will cause other issues instead.

An idea is to translate each string into some sort of constructed orthographemic representation, where digraphs such as "ij" and the english "gh" "th" and friends are only one character long. The distance metric does not have to be equal for all types of replactements when doing Damerau-Levenshtein so you can use whatever penalties you want, but the table needs to be filled locally, therefore you really want each sound to be one cell in the table.
This however breaks when the "ij" was not intended as "ij" but a misspelling or at a word-segmentation border (I don't know if that can happen in Dutch), or in any other situation it is not actually (meant as) a digraph.
Otherwise you will need to do some lookaround, this will complicate things but should not change the growth order of the algorithm (I believe), provided you only look at constant number of cells around. The constant factors will still be much bigger though.

Related

Decision that texts or sentences are equivalent in content

The classic example of determining similarity as distance Word Mover's Distance as for example here https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html,
word2vec model on GoogleNews-vectors-negative300.bin, D1="Obama speaks to the media in Illinois",D2="The president greets the press in Chicago",D3="Oranges are my favorite fruit". When calculated wmd distances: distance (D1,D2)=3.3741, distance (D1,D3)=4.3802. So we understand that (D1,D2) more similar than (D1,D3). What is the threshold value for vmd distance to decide that the two sentences actually contain almost the same information? Maybe in the case of sentences D1 and D2, the value of 3.3741 is too large and in reality these sentences are different? Such decisions need to be made, for example, when there is a question, a sample of the correct answer and a student's answer.
Addition after the answer by gojomo:
Let's postpone identification and automatic understanding of logic for later. Let's consider the case when in two sentences there is an enumeration of objects, or properties and actions of one object in a positive way, and we need to evaluate how similar the content of these two sentences is.
I don't believe there's any absolute threshold that could be used as you wish.
The "Word Mover's Distance" can offer some impressive results in finding highly-similar texts, especially in relative comparison to other candidate texts.
However, its magnitude may be affected by the sizes of the texts, and further it has no understanding of rigorous grammar/semantics. Thus things like subtle negations or contrasts, or things that would be nonsense to a native speaker, won't be highlighted as very "different" from other statements.
For example, the two phrases "Many historians agree Obama is absolutely positively the best President of the 21st century", and "Many historians agree Obama is absolutely positively not the best President of the 21st century", will be noted as incredibly similar by most measures based on word-statistics, such as Word Mover's Distance. Yet, the insertion of one word means they convey somewhat opposite ideas.

How to compare a "basic_string" using an arbitary locale

I'm re-posting a question I submitted earlier today but I'm now citing a specific example in response to the feedback I received. The original question can be found here (note that it's not a homework assignment):
I'm simply trying to determine if C++ makes it impossible to perform an (efficient) case-INsensitive comparison of a basic_string object that also factors in any arbitrary locale object. For instance, it doesn't appear to be possible to write an efficient function such as the following:
bool AreStringsEqualIgnoreCase(const string &str1, const string &str2, const locale &loc);
Based on my current understanding (but can someone confirm this), this function has to call both ctype::toupper() and collate::compare() for the given locale (extracted as always using use_facet()). However, because collate::compare() in particular requires 4 pointer args, you either need to pass these 4 args for every char you need to compare (after first calling ctype::toupper()), or alternatively, convert both strings to upppercase first and then make a single call to collate::compare().
The 1st approach is obviously inefficient (4 pointers to pass for each char tested), and the 2nd requires you to convert both strings to uppercase in their entirety (requiring allocation of memory and needless copying/converting of both strings to uppercase). Am I correct about this, i.e., it's not possible to do it efficiently (because there's no way around collate::compare()).
One of the little annoyances about trying to deal in a consistent way with all the world's writing systems is that practically nothing you think you know about characters is actually correct. This makes it tricky to do things like "case-insensitive comparison". Indeed, it is tricky to do any form of locale-aware comparison, and case-insensitivity is additionally thorny.
With some constraints, though, it is possible to accomplish. The algorithm needed can be implemented "efficiently" using normal programming practices (and precomputation of some static data), but it cannot be implemented as efficiently as an incorrect algorithm. It is often possible to trade off correctness for speed, but the results are not pleasant. Incorrect but fast locale implementations may appeal to those whose locales are implemented correctly, but are clearly unsatisfactory for the part of the audience whose locales produce unexpected results.
Lexicographical ordering doesn't work for human beings
Most locales (other than the "C" locale) for languages which have case already handle letter case in the manner expected, which is to use case differences only after all other differences have been taken into account. That is, if a list of words are sorted in the locale's collation order, then words in the list which differ only in case are going to be consecutive. Whether the words with upper case come before or after words with lower case is locale-dependent, but there won't be other words in between.
That result cannot be achieved by any single-pass left-to-right character-by-character comparison ("lexicographical ordering"). And most locales have other collation quirks which also don't yield to naïve lexicographical ordering.
Standard C++ collation should be able to deal with all of these issues, if you have appropriate locale definitions. But it cannot be reduced to lexicographical comparison just using a comparison function over pairs of whar_t, and consequently the C++ standard library doesn't provide that interface.
The following is just a few examples of why locale-aware collation is complicated; a longer explanation, with a lot more examples, is found in Unicode Technical Standard 10.
Where do the accents go?
Most romance languages (and also English, when dealing with borrowed words) consider accents over vowels to be a secondary characteristic; that is, words are first sorted as though the accents weren't present, and then a second pass is made in which unaccented letters come before accented letters. A third pass is necessary to deal with case, which is ignored in the first two passes.
But that doesn't work for Northern European languages. The alphabets of Swedish, Norwegian and Danish have three extra vowels, which follow z in the alphabet. In Swedish, these vowels are written å, ä, and ö; in Norwegian and Danish, these letters are written å, æ, and ø, and in Danish å is sometimes written aa, making Aarhus the last entry in an alphabetical list of Danish cities.
In German, the letters ä, ö, and ü are generally alphabetised as with romance accents, but in German phonebooks (and sometimes other alphabetical lists), they are alphabetised as though they were written ae, oe and ue, which is the older style of writing the same phonemes. (There are many pairs of common surnames such as "Müller" and "Mueller" are pronounced the same and are often confused, so it makes sense to intercollate them. A similar convention was used for Scottish names in Canadian phonebooks when I was young; the spellings M', Mc and Mac were all clumped together since they are all phonetically identical.)
One symbol, two letters. Or two letters, one symbol
German also has the symbol ß which is collated as though it were written out as ss, although it is not quite identical phonetically. We'll meet this interesting symbol again a bit later.
In fact, many languages consider digraphs and even trigraphs to be single letters. The 44-letter Hungarian alphabet includes Cs, Dz, Dzs, Gy, Ly, Ny, Sz, Ty, and Zs, as well as a variety of accented vowels. However, the language most commonly referenced in articles about this phenomenon -- Spanish -- stopped treating the digraphs ch and ll as letters in 1994, presumably because it was easier to force Hispanic writers to conform to computer systems than to change the computer systems to deal with Spanish digraphs. (Wikipedia claims it was pressure from "UNESCO and other international organizations"; it took quite a while for everyone to accept the new alphabetization rules, and you still occasionally find "Chile" after "Colombia" in alphabetical lists of South American countries.)
Summary: comparing character strings requires multiple passes, and sometimes requires comparing groups of characters
Making it all case-insensitive
Since locales handle case correctly in comparison, it should not really be necessary to do case-insensitive ordering. It might be useful to do case-insensitive equivalence-class checking ("equality" testing), although that raises the question of what other imprecise equivalence classes might be useful. Unicode normalization, accent deletion, and even transcription to latin are all reasonable in some contexts, and highly annoying in others. But it turns out that case conversions are not as simple as you might think, either.
Because of the existence of di- and trigraphs, some of which have Unicode codepoints, the Unicode standard actually recognizes three cases, not two: lower-case, upper-case and title-case. The last is what you use to upper case the first letter of a word, and it's needed, for example, for the Croatian digraph dž (U+01C6; a single character), whose uppercase is DŽ (U+01C4) and whose title case is Dž (U+01C5). The theory of "case-insensitive" comparison is that we could transform (at least conceptually) any string in such a way that all members of the equivalence class defined by "ignoring case" are transformed to the same byte sequence. Traditionally this is done by "upper-casing" the string, but it turns out that that is not always possible or even correct; the Unicode standard prefers the use of the term "case-folding", as do I.
C++ locales aren't quite up to the job
So, getting back to C++, the sad truth is that C++ locales do not have sufficient information to do accurate case-folding, because C++ locales work on the assumption that case-folding a string consists of nothing more than sequentially and individually upper-casing each codepoint in the string using a function which maps a codepoint to another codepoint. As we'll see, that just doesn't work, and consequently the question of its efficiency is irrelevant. On the other hand, the ICU library has an interface which does case-folding as correctly as the Unicode database allows, and its implementation has been crafted by some pretty good coders so it is probably just about as efficient as possible within the constraints. So I'd definitely recommend using it.
If you want a good overview of the difficulty of case-folding, you should read sections 5.18 and 5.19 of the Unicode standard (PDF for chapter 5). The following is just a few examples.
A case transform is not a mapping from single character to single character
The simplest example is the German ß (U+00DF), which has no upper-case form because it never appears at the beginning of a word, and traditional German orthography didn't use all-caps. The standard upper-case transform is SS (or in some cases SZ) but that transform is not reversible; not all instances of ss are written as ß. Compare, for example, grüßen and küssen (to greet and to kiss, respectively). In v5.1, ẞ, an "upper-case ß, was added to Unicode as U+1E9E, but it is not commonly used except in all-caps street signs, where its use is legally mandated. The normal expectation of upper-casing ß would be the two letters SS.
Not all ideographs (visible characters) are single character codes
Even when a case transform maps a single character to a single character, it may not be able to express that as a wchar→wchar mapping. For example, ǰ can easily be capitalized to J̌, but the former is a single combined glyph (U+01F0), while the second is a capital J with a combining caron (U+030C).
There is a further problem with glyphs like ǰ:
Naive character by character case-folding can denormalize
Suppose we upper-case ǰ as above. How do we capitalize ǰ̠ (which, in case it doesn't render properly on your system, is the same character with an bar underneath, another IPA convention)? That combination is U+01F0,U+0320 (j with caron, combining minus sign below), so we proceed to replace U+01F0 with U+004A,U+030C and then leave the U+0320 as is: J̠̌. That's fine, but it won't compare equal to a normalized capital J with caron and minus sign below, because in the normal form the minus sign diacritic comes first: U+004A,U+0320,U+030C (J̠̌, which should look identical). So sometimes (rarely, to be honest, but sometimes) it is necessary to renormalize.
Leaving aside unicode wierdness, sometimes case-conversion is context-sensitive
Greek has a lot of examples of how marks get shuffled around depending on whether they are word-initial, word-final or word-interior -- you can read more about this in chapter 7 of the Unicode standard -- but a simple and common case is Σ, which has two lower-case versions: σ and ς. Non-greeks with some maths background are probably familiar with σ, but might not be aware that it cannot be used at the end of a word, where you must use ς.
In short
The best available correct way to case-fold is to apply the Unicode case-folding algorithm, which requires creating a temporary string for each source string. You could then do a simple bytewise comparison between the two transformed strings in order to verify that the original strings were in the same equivalence class. Doing a collation ordering on the transformed strings, while possible, is rather less efficient than collation ordering the original strings, and for sorting purposes, the untransformed comparison is probably as good or better than the transformed comparison.
In theory, if you are only interested in case-folded equality, you could do the transformations linearly, bearing in mind that the transformation is not necessarily context-free and is not a simple character-to-character mapping function. Unfortunately, C++ locales don't provide you the data you need to do this. The Unicode CLDR comes much closer, but it's a complex datastructure.
All of this stuff is really complicated, and rife with edge cases. (See the note in the Unicode standard about accented Lithuanian i's, for example.) You're really better off just using a well-maintained existing solution, of which the best example is ICU.

Using the Levenshtein distance in a spell checker

I am working on a spell checker in C++ and I'm stuck at a certain step in the implementation.
Let's say we have a text file with correctly spelled words and an inputted string we would like to check for spelling mistakes. If that string is a misspelled word, I can easily find its correct form by checking all words in the text file and choosing the one that differs from it with a minimum of letters. For that type of input, I've implemented a function that calculates the Levenshtein edit distance between 2 strings. So far so good.
Now, the tough part: what if the inputted string is a combination of misspelled words? For example, "iloevcokies". Taking into account that "i", "love" and "cookies" are words that can be found in the text file, how can I use the already-implemented Levenshtein function to determine which words from the file are suitable for a correction? Also, how would I insert blanks in the correct positions?
Any idea is welcome :)
Spelling correction for phrases can be done in a few ways. One way requires having an index of word bi-grams and tri-grams. These of course could be immense. Another option would be to try permutations of the word with spaces inserted, then doing a lookup on each word in the resulting phrase. Take a look at a simple implementation of a spell checker by Peter Norvig from Google. Either way, consider using an n-gram index for better performance, there are libraries available in C++ for reference.
Google and other search engines are able to do spelling correction on phrases because they have a large index of queries and associated result sets, which allows them to calculate a statistically good guess. Overall, the spelling correction problem can become very complex with methods like context-sensitive correction and phonetic correction. Given that using permutations of possible sub-terms can become expensive you can utilize certain types of heuristics, however this can get out of scope quick.
You may also consider using and existing spelling library, such as aspell.
A starting point for an idea: one of the top hits of your L-distance for "iloevcokies" should be "cookies". If you can change your L-distance function to also track and return a min-index and max-index (i.e., this match is best starting from character 5 and going to character 10) then you could remove that substring and re-check L-distance for the string before it and after it, then concatenate those for a suggestion....
Just a thought, good luck....
I will suppose that you have an existing index, on which you run your levenshtein distance (for example, a Trie, but any sorted index generally work well).
You can consider the addition of white-spaces as a regular edit operation, it's just that there is a twist: you need (then) to get back to the root of your index for the next word.
This way you get the same index, almost the same route, approximately the same traversal, and it should not even impact your running time that much.

efficient algorithm for searching one of several strings in a text?

I need to search incoming not-very-long pieces of text for occurrences of given strings. The strings are constant for the whole session and are not many (~10). Additional simplification is that none of the strings is contained in any other.
I am currently using boost regex matching with str1 | str2 | .... The performance of this task is important, so I wonder if I can improve it. Not that I can program better than the boost guys, but perhaps a dedicated implementation is more efficient than a general one.
As the strings stay constant over long time, I can afford building a data structure, like a state transition table, upfront.
e.g., if the strings are abcx, bcy and cz, and I've read so far abc, I should be in a combined state that means you're either 3 chars into string 1, 2 chars into string 2 or 1 char into string 1. Then reading x next will move me to the string 1 matched state etc., and any char other than xyz will move to the initial state, and I will not need to retract back to b.
Any ideas or references are appreciated.
Check out the Aho–Corasick string matching algorithm!
Take a look at Suffix Tree.
Look at this: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/configuration/algorithm.html
The existence of a recursive/non-recursive distinction is a pretty strong suggestion that BOOST is not necessarily a linear-time discrete finite-state machine. Therefore, there's a good chance you can do better for your particular problem.
The best answer depends quite a bit on how many haystacks you have and the minimum size of a needle. If the smallest needle is longer than a few characters, you may be able to do a little bit better than a generalized regex library.
Basically all string searches work by testing for a match at the current position (cursor), and if none is found, then trying again with the cursor slid farther to the right.
Rabin-Karp builds a DFSM out of the string (or strings) for which you are searching so that the test and the cursor motion are combined in a single operation. However, Rabin-Karp was originally designed for a single needle, so you would need to support backtracking if one match could ever be a proper prefix of another. (Remember that for when you want to reuse your code.)
Another tactic is to slide the cursor more than one character to the right if at all possible. Boyer-Moore does this. It's normally built for a single needle. Construct a table of all characters and the rightmost position that they appear in the needle (if at all). Now, position the cursor at len(needle)-1. The table entry will tell you (a) what leftward offset from the cursor that the needle might be found, or (b) that you can move the cursor len(needle) farther to the right.
When you have more than one needle, the construction and use of your table grows more complicated, but it still may possibly save you an order of magnitude on probes. You still might want to make a DFSM but instead of calling a general search method, you call does_this_DFSM_match_at_this_offset().
Another tactic is to test more than 8 bits at a time. There's a spam-killer tool that looks at 32-bit machine words at a time. It then does some simple hash code to fit the result into 12 bits, and then looks in a table to see if there's a hit. You have four entries for each pattern (offsets of 0, 1, 2, and 3 from the start of the pattern) and then this way despite thousands of patterns in the table they only test one or two per 32-bit word in the subject line.
So in general, yes, you can go faster than regexes WHEN THE NEEDLES ARE CONSTANT.
I've been looking at the answers but none seem quite explicit... and mostly boiled down to a couple of links.
What intrigues me here is the uniqueness of your problem, the solutions exposed so far do not capitalize at all on the fact that we are looking for several needles at once in the haystack.
I would take a look at KMP / Boyer Moore, for sure, but I would not apply them blindly (at least if you have some time on your hand), because they are tailored for a single needle, and I am pretty convinced we could capitalized on the fact that we have several strings and check all of them at once, with a custom state machine (or custom tables for BM).
Of course, it's unlikely to improve the big O (Boyer Moore runs in 3n for each string, so it'll be linear anyway), but you could probably gain on the constant factor.
Regex engine initialization is expected to have some overhead,
so if there are no real regular expressions involved,
C - memcmp() should do fine.
If you can tell the file sizes and give some
specific use cases, we could build a benchmark
(I consider this very interesting).
Interesting: memcmp explorations and timing differences
Regards
rbo
There is always Boyer Moore
Beside Rabin-Karp-Algorithm and Knuth-Morris-Pratt-Algorithm, my Algorithm-Book suggests a Finite State Machine for String Matching. For every Search String you need to build such a Finite State Machine.
You can do it with very popular Lex & Yacc tools, with the support of Flex and Bison tools.
You can use Lex for getting tokens of the string.
Compare your pre-defined strings with the tokens returned from Lexer.
When match is found, perform the desired action.
There are many sites which describe about Lex and Yacc.
One such site is http://epaperpress.com/lexandyacc/

Matching unmatched strings based on a unknown pattern

Alright guys, I really hurt my brain over this one and I'm curious if you guys can give me any pointers towards the right direction I should be taking.
The situation is this:
Lets say, I have a collection of strings (let it be clear that the pattern of this strings is unknown. For a fact, I can say that the string contain only signs from the ASCII table and therefore, I don't have to worry about weird Chinese signs).
For this example, I take the following collection of strings (note that the strings don't have to make any human sense so don't try figuring them out :)):
"[001].[FOO].[TEST] - 'foofoo.test'",
"[002].[FOO].[TEST] - 'foofoo.test'",
"[003].[FOO].[TEST] - 'foofoo.test'",
"[001].[FOO].[TEST] - 'foofoo.test.sample'",
"[002].[FOO].[TEST] - 'foofoo.test.sample'",
"-001- BAR.[TEST] - 'bartest.xx1",
"-002- BAR.[TEST] - 'bartest.xx1"
Now, what I need to have is a way of finding logical groups (and subgroups) of these set of strings, so in the above example, just by rational thinking, you can combine the first 3, the 2 after that and the last 2. Also the resulting groups from the first 5 can be combined in one main group with 2 subgroups, this should give you something like this:
{
{
"[001].[FOO].[TEST] - 'foofoo.test'",
"[002].[FOO].[TEST] - 'foofoo.test'",
"[003].[FOO].[TEST] - 'foofoo.test'",
}
{
"[001].[FOO].[TEST] - 'foofoo.test.sample'",
"[002].[FOO].[TEST] - 'foofoo.test.sample'",
}
}
{
{
"-001- BAR.[TEST] - 'bartest.xx1",
"-002- BAR.[TEST] - 'bartest.xx1"
}
}
Sorry for the layout above but indenting with 4 spaces doesn't seem to work correctly (or I'm frakk'n it up).
Anyway, I'm not sure how to approach this problem (how to get the result desired as indicated above).
First of, I thought of creating a huge set of regexes which would parse most known patterns but the amount of different patterns is just to huge that this isn't realistic.
Another think I thought of was parsing each individual word within a string (so strip all non alphabetic or numeric characters and split by those), and if X% matches, I can assume the strings belong to the same group. (where X will probably be around 80/90). However, I find the area of speculation kinda big. For example, when matching strings with each 20 words, the change of hitting above 80% is kinda big (that means that 4 words can differ), however when matching only 8 words, 2 words at most can differ.
My question to you is, what would be a logical approach in the above situation?
As for a reallife example:
Thanks in advance!
Basically I would consider each string as a bag of characters. I would define a kind of distance between two strings which would be sth like "number of characters belonging to both strings" divided by "total number of characters in string 1 + total number of characters in string 2". (well, it's not a distance mathematically speaking...) and then I would try to apply some algorithms to cluster your set of strings.
Well, this is just a basic idea but I think it would be a good start to try some experiments...
Building on #PierrOz' answer, you might want to experiment with multiple measures, and do a statistical cluster analysis on those measures.
For example, you could use four measures:
How many letters (upper/lowercase)
How many digits
How many of ([,],.)
How many other characters (probably) not included above
You then have, in this example, four measures for each string, and you could, if you wished, apply a different weight to each measure.
R has a number of functions for cluster analysis. This might be a good starting point.
Afterthought: the measures can be almost anything you invent. Some more examples:
Binary: does the string contain a given character (0 or 1)?
Binary: does the string contain a given substring?
Count: how many times does the given substring appear?
Binary: does the string include all these characters?
Enough for a least a weekend's tinkering...
I would recommend using this: http://en.wikipedia.org/wiki/Hamming_distance as the distance.
Also, For files a good heuristic would be to remove checksum in the end from the filename before calculating the distance:
[BSS]_Darker_Than_Black_-_The_Black_Contractor_-_Gaiden_-_01_[35218661].mkv
->
[BSS]_Darker_Than_Black_-_The_Black_Contractor_-_Gaiden_-_01_.mkv
A check is simple - it's always 10 characters, the first being [, the last -- ], and the rest ALPHA-numeric :)
With the heuristic and the distance max of 4, your stuff will work in the vast majority of the cases.
Good luck!
Your question is not easy to understand, but I think what you ask is impossible to do in a satisfying way given any group of strings. Take these strings for instance:
[1].[2].[3].[4].[5]
[a].[2].[3].[4].[5]
[a].[b].[3].[4].[5]
[a].[b].[c].[4].[5]
[a].[b].[c].[d].[5]
[a].[b].[c].[d].[e]
Each is close to those listed next to it, so they should all group with their neighbours, but the first and the last are completely different, so it would not make sense to group those together. Given a more "grouping" dataset you might get pretty good results with a method like the one PierrOz describes, but there is no guarantee for meaningful results.
May I enquire what the purpose is? It would allow us all to better understand what errors might be tolerated, or perhaps even come up with a different approach to solving the problem.
Edit: I wonder, would it be OK if one string ends up in multiple different groups? That could make the problem a lot simpler, and more reliably give you useful information, but you would end up with a bigger grouping tree with the same node copied to different branches.
I'd be tempted to tackle this with cluster analysis techniques. Hit Wikipedia for an introduction. And the other answers probably fall within the domain of cluster analysis, but you might find some other useful approaches by reading a bit more widely.