Decision that texts or sentences are equivalent in content - word2vec

The classic example of determining similarity as distance Word Mover's Distance as for example here https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html,
word2vec model on GoogleNews-vectors-negative300.bin, D1="Obama speaks to the media in Illinois",D2="The president greets the press in Chicago",D3="Oranges are my favorite fruit". When calculated wmd distances: distance (D1,D2)=3.3741, distance (D1,D3)=4.3802. So we understand that (D1,D2) more similar than (D1,D3). What is the threshold value for vmd distance to decide that the two sentences actually contain almost the same information? Maybe in the case of sentences D1 and D2, the value of 3.3741 is too large and in reality these sentences are different? Such decisions need to be made, for example, when there is a question, a sample of the correct answer and a student's answer.
Addition after the answer by gojomo:
Let's postpone identification and automatic understanding of logic for later. Let's consider the case when in two sentences there is an enumeration of objects, or properties and actions of one object in a positive way, and we need to evaluate how similar the content of these two sentences is.

I don't believe there's any absolute threshold that could be used as you wish.
The "Word Mover's Distance" can offer some impressive results in finding highly-similar texts, especially in relative comparison to other candidate texts.
However, its magnitude may be affected by the sizes of the texts, and further it has no understanding of rigorous grammar/semantics. Thus things like subtle negations or contrasts, or things that would be nonsense to a native speaker, won't be highlighted as very "different" from other statements.
For example, the two phrases "Many historians agree Obama is absolutely positively the best President of the 21st century", and "Many historians agree Obama is absolutely positively not the best President of the 21st century", will be noted as incredibly similar by most measures based on word-statistics, such as Word Mover's Distance. Yet, the insertion of one word means they convey somewhat opposite ideas.

Related

Using the Levenshtein distance in a spell checker

I am working on a spell checker in C++ and I'm stuck at a certain step in the implementation.
Let's say we have a text file with correctly spelled words and an inputted string we would like to check for spelling mistakes. If that string is a misspelled word, I can easily find its correct form by checking all words in the text file and choosing the one that differs from it with a minimum of letters. For that type of input, I've implemented a function that calculates the Levenshtein edit distance between 2 strings. So far so good.
Now, the tough part: what if the inputted string is a combination of misspelled words? For example, "iloevcokies". Taking into account that "i", "love" and "cookies" are words that can be found in the text file, how can I use the already-implemented Levenshtein function to determine which words from the file are suitable for a correction? Also, how would I insert blanks in the correct positions?
Any idea is welcome :)
Spelling correction for phrases can be done in a few ways. One way requires having an index of word bi-grams and tri-grams. These of course could be immense. Another option would be to try permutations of the word with spaces inserted, then doing a lookup on each word in the resulting phrase. Take a look at a simple implementation of a spell checker by Peter Norvig from Google. Either way, consider using an n-gram index for better performance, there are libraries available in C++ for reference.
Google and other search engines are able to do spelling correction on phrases because they have a large index of queries and associated result sets, which allows them to calculate a statistically good guess. Overall, the spelling correction problem can become very complex with methods like context-sensitive correction and phonetic correction. Given that using permutations of possible sub-terms can become expensive you can utilize certain types of heuristics, however this can get out of scope quick.
You may also consider using and existing spelling library, such as aspell.
A starting point for an idea: one of the top hits of your L-distance for "iloevcokies" should be "cookies". If you can change your L-distance function to also track and return a min-index and max-index (i.e., this match is best starting from character 5 and going to character 10) then you could remove that substring and re-check L-distance for the string before it and after it, then concatenate those for a suggestion....
Just a thought, good luck....
I will suppose that you have an existing index, on which you run your levenshtein distance (for example, a Trie, but any sorted index generally work well).
You can consider the addition of white-spaces as a regular edit operation, it's just that there is a twist: you need (then) to get back to the root of your index for the next word.
This way you get the same index, almost the same route, approximately the same traversal, and it should not even impact your running time that much.

Damerau–Levenshtein distance for language specific quirks

To Dutch speaking people the two characters "ij" are considered to be a single letter that is easily exchanged with "y".
For a project I'm working on I would like to have a variant of the Damerau–Levenshtein distance that calculates the distance between "ij" and "y" as 1 instead of the current value of 2.
I've been trying this myself but failed. My problem is that I do not have a clue on how to handle the fact that both texts are of different lengths.
Does anyone have a suggestion/code fragment on how to solve this?
Thanks.
The Wikipedia article is rather loose with terminology. There are no such things as "strings" in "natural language". There are phonemes in natural language which can be represented by written characters and character-combinations.
Some character-combinations are vestiges of historical conventions which have survived into modern times, as in modern English "rough" where the "gh" can sound like -f- or make no sound at all. It seems to me that in focusing on raw "strings" the algorithm must be agnostic about the historical relationship of language and orthographic convention, which leads to some arbitrary metrics whenever character-combinations correlate to a single phoneme. How would it measure "rough" to "ruf"? Or "through" to "thru"?
Or German o-umlaut to "oe"?
In your case the -y- can be exchanged phonetically and orthographically with -ij-. So what is that according to the algorithm, two deletions followed by an insertion, or a single deletion of the -j- or of the -i- followed by a transposition of the remaining character to -y-? Or is -ij- being coalesced and the coalescence is followed by a transposition?
I would recommend that you use another unused comnbining character for -ij- before applying the algorithm, perhaps U00EC, Latin small letter i with grave accent.
How does the algorithm handle multi-codepoint characters?
Well the D-L distance itself isn't going to handle it for you, due to the way it measure distances.
As there is no code (or language) involved here, I can only leave you with a suggestion to ensure all strings adhere to the same structure.
To clarify the situation since your asking in general terms,
bear in mind that the D-L distance compares character for character and doesn't actually read your strings in themselves, as such you'll have to parse before compare, as cases where ij shouldn't be exchanged with y will cause other issues instead.
An idea is to translate each string into some sort of constructed orthographemic representation, where digraphs such as "ij" and the english "gh" "th" and friends are only one character long. The distance metric does not have to be equal for all types of replactements when doing Damerau-Levenshtein so you can use whatever penalties you want, but the table needs to be filled locally, therefore you really want each sound to be one cell in the table.
This however breaks when the "ij" was not intended as "ij" but a misspelling or at a word-segmentation border (I don't know if that can happen in Dutch), or in any other situation it is not actually (meant as) a digraph.
Otherwise you will need to do some lookaround, this will complicate things but should not change the growth order of the algorithm (I believe), provided you only look at constant number of cells around. The constant factors will still be much bigger though.

Is it possible to calucate the edit distance between a regexp and a string?

If so, please explain how.
Re: what is distance -- "The distance between two strings is defined as the minimal number of edits required to convert one into the other."
For example, xyz to XYZ would take 3 edits, so the string xYZ is closer to XYZ and xyz.
If the pattern is [0-9]{3} or for instance 123, then a23 would be closer to the pattern than ab3.
How can you find the shortest distance between a regexp and a non-matching string?
The above is the Damerau–Levenshtein distance algorithm.
You can use Finite State Machines to do this efficiently (that is, linear in time). If you use a transducer, you can even write the specification of the transformation fairly compactly and do far more nuanced transformations than simply inserts or deletes - see wikipedia for Finite State Transducer as a starting point, and software such as the FSA toolkit or FSA6 (which has a not entirely stable web-demo) too. There are lots of libraries for FSA manipulation; I don't want to suggest the previous two are your only or best options, just two I've heard of.
If, however, you merely want the efficient, approximate searching, a less flexibly but already-implemented-for-you option exists: TRE, which has an approximate matching function that returns the cost of the match - i.e., the distance to the match, from your perspective.
If you mean the string with the smallest levenshtein distance between the closest matched string and a sample, then I'm pretty sure it can be done, but you'd have to convert the Regex to a DFA yourself, then try to match and whenever something fails, non-deterministically continue as if it had passed and keep track of the number differences. you could use A* search or something similar for this, it would be quite inefficient though (O(2^n) worst case)

Is there a fairly simple way for a script to tell (from context) whether "her" is a possessive pronoun?

I am writing a script to reverse all genders in a piece of text, so all gendered words are swapped - "man" is swapped with "woman", "she" is swapped with "he", etc. But there is an ambiguity as to whether "her" should be replaced with "him" or "his".
Okay. Lets look at this like a linguist might. I am thinking aloud here.
"Her" is a pronoun. It can either be a:
1. possessive pronoun
This is her book.
2. personal pronoun
Give it to her. (after preposition)
He wrote her a letter. (indirect object)
He treated her for a cold. (direct object)
So lets look at case (1), possessive pronoun. That is it is a pronoun which is in the "genitive" case (meaning, it is a noun which is being "possessive." Okay, that detail isn't quite as important as the next one.)
In this case, "her" is acting as a "determiner". Determiners may occur in two places in a sentence (this is a simplification):
Det + Noun ("her book")
Det + Adj + Noun ("her nice book")
So to figure out if her is a determiner, you could have this logic:
a. If the word following "her" is a noun, then "her" is a determiner.
b. If the 2 words following "her" is an adjective, then a noun, then "her" is a determiner"
And if you establish that "her" is a determiner, then you know that you must replace it with "his", which is also a determiner (aka genitive noun, aka possessive pronoun).
If it doesn't match criteria (a) and (b) above, then you could possibly conclude that it is not a determiner, which means it must be a personal pronoun. In that case, you would replace "her" with "him".
You wouldn't even have to do the tests below, but I'll try to describe them anyway.
Looking at (2) from above: personal pronoun, rather than possessive. This gets trickier.
The examples above show "her" occurring in 3 ways:
(1) Give it to her. (after preposition. we call this the "object of a preposition".)
So you could maybe devise a rule: "If 'her' occurs immediately after a preposition, then it should be treated as a noun, so we would replace it with 'him'".
The next two are tricky. "her" can either be a direct object or an indirect object.
(2) He wrote her a letter. (indirect object)
(3) He treated her for a cold. (direct object)
Syntactically, how can we tell the difference?
A direct object occurs immediately after a verb.
If you have a verb, followed by a noun, then that noun is a direct object. eg:
He treated her.*
If you have a verb, followed by a noun, followed by a prepositional phrase, then the noun is a direct object.
He treated her for a cold. ("her" is a noun, and it comes immediately after the verb "treated". "for a cold" is a prepositional phrase.)
Which means that you could say "If you have Verb + Noun + Prep" then the noun is a direct object. Since the noun is a direct object, then it is a personal pronoun, so use "him". (note, you only have to check for a preposition, not the entire prep phrase, since the phrase will always begin with a preposition.)
If it is an indirect object, then you'll have the form "verb + noun + noun".
He wrote her a letter. ("her" is a noun, "letter" is a noun. well, "a letter" is a "noun phrase", so you'd have to account for determiners as well.)
So... if "her" is a direct object, indirect object, or obj of prep, you could change it to "him", otherwise, change it to "his".
This method seems a lot more complicated - so I'd just start by checking to see if "her" is a determiner (see above), and if it is a determiner, use "his" otherwise, just use "him".
So, the above has a lot of simplifications. It doesn't cover "interrupting phrases", or clause structures, or constituency tests, or embedded clauses, or punctuation, or anything like that.
Also, this solution requires a dictionary - a list of "nouns" and "verbs" and "prepositions" so that you can determine the lexical category of each word in the sentence.
And even there, man, natural language processing is hard. You'd want to do some sort of "training" for your model to have a good solution. BUT for very simple things, try some of the stuff described above.
Sorry for being so verbose! (None of the existing answers gave any hard data, or precise linguistic definitions, so here goes.)
Given the scope of your project: reversing all gender-related words, it appears that :
The "investment" in a more fundamental approach would be justified
No heuristic based on simple lookup/substitution will adequately serve all or even most cases.
Furthermore, Regex too seems a poor choice of tool; natural language is just not a regular langugage ;-).
Instead, you should consider introducing Part-of-Speech (POS) tagging, possibly with a hint of Named Entity Recognition, and then apply substitution rules based on the extra info the tagging supplied.
This may seem like a lot of work, but if for example your scripting language happens to be Python, you can leverage NTLK to implement all this with a relatively small effort.
G'day,
This is one of those cases where you could invest an inordinate amount of time tracking down the automatic solution and finish up with a result that you're going to have to check through anyway.
I'd suggest making your script insert a piece of text that will really stand out at every instance of "her" and would be easily searchable. Maybe even make the script insert both "him" and "his" strings so that you only need to delete one of them after you've seen the context?
You're going to save a lot of time and effort this way. Not to mention blood, sweat and tears even! (-:
Coming up with a fully automatic solution is no mean feat as it will involve scanning a massive corpus of words to determine if the following word is an object.
Sometimes gaining that extra 5 or 10 percent improvement is just not worth the extra effort involved. Except of course as an "it is left as an interesting exercise for the reader..." type problem that some text books seem to love.
Edit: I forgot to mention that finding this "tipping point" is a true art. Definitely one skill that only comes with experience. (-:
Edit: Part II - The Revenge I also forgot to mention that you can eliminate one edge case though. If the word "him" is followed by punctuation, e.g. "... to her.", "... for her," etc. then you can eliminate the uncertainty for those cases and just replace them with "him". Similarly if the word is followed by a class of words, e.g. "... for her to" can have the "her" easily be replaced with "him". Edit 3: This is not a full list of exceptions but is merely intended as a suggestion for a starting point of the list of items you'll need to look for.
HTH
Trying to determine whether her is a possessive or personal pronoun is harder than trying to determine the class of him or his. However, you would expect both to be used in the same contexts given a large enough corpus. So why not reverse the problem? Take a large corpus and find all occurrences of him and his. Then look at the words surrounding them (just how many words you need to look at is left up to you). With enough training examples, you can estimate the probability that a given set of words in the vicinity of the word indicates him or his. Then you can use those probability estimates on an occurrence of her to determine whether you should be using him or his. As other responses have indicated, you're not going to be perfect. Also, figuring out how big of a neighborhood to use and how to calculate the probabilities is a fair bit of work. You could probably do fairly well using a simple classifier like Naive Bayes.
I suspect, though, you can get a decent bit of accuracy just by looking at patterns in parts of speech and writing some rules. Naturally, you'll miss some, but probably a dozen rules or so will account for the majority of occurrences. I just glanced through about fifty occurrences of her in "The Phantom Rickshaw" by Rudyard Kipling and you can easily get 90% accuracy just by the rule:
her_followed_by_noun ? possessive : personal
You can use an off-the-shelf part-of-speech (POS) tagger like the Stanford POS Tagger to automatically determine whether a word is a noun or something else in context. Again, it's not perfect, but it does pretty well.
Edge cases with odd clause structures are hard to get right, but they also occur fairly rarely in most text. It just depends on your data.
I don’t think so. You could check if the possessive pronoun is followed by a noun or an adjective and thereby conclude that is indeed a possessive pronoun. But of course you would have to write a script that is able to do this and even if you had a method it would still be wrong in some other cases. A simple pattern matching algorithm won’t help you here.
Good luck with analysing this: http://en.wikipedia.org/wiki/X-bar_theory
Definitely no. You would have to do syntactic analysis on your input text (parsing the English language, really, that's where the word “to parse” comes from). That's the only way you can determine with certainty what the “her” in your text stand for, you can't rely on search-and-replace. There are many ways to do that, but none would qualify as “fairly simple”, I think.
I will address regex, since that is one of the tags. Regular expressions are insufficiently powerful for parsing human language, because regex does not do recursion, and all human lnguages are recursive.
When this fact is combined with the other ambiguities in English, such as the way many words can serve multiple functions in a sentense, I think that a reliable automated solution will be a very difficult and costly project.
About the only one I can think of (and I'm sure someone in the comments will prove me wrong!) is any instance of her followed by punctuation can most probably be replace with him. But I still agree with the previous answers that you're probably best off doing a manual replace.
OK, based on some of the answers people gave I've got a better idea of how to approach this. Instead of trying to write a script that gets this right 100% of the time I'll just aim to get it right as often as possible. A quick search through some English-language texts shows that "his" appears (very roughly) twice as often as "him", so the default behaviour should be to convert "her" to "his". If I did this and nothing else it should be right about two thirds of the time.
Now I'm not interested in finding patterns that would show "her" should be converted to "his", since this is what I would do anyway, I'm only interested in finding patterns that would show "her" should be converted to "him", since these would allow me to lower the error rate. There's two rules I can implement fairly painlessly:
If "her" is followed immediately by a comma or period, it should be converted to "him", as Michael Itzoe said.
If 'her' occurs immediately after a preposition, then it should be treated as a noun, we would replace it with 'him', as Rasher said.
And I'll be able to do more than that if I use Part of Speech tagging software. I think I'll get on with doing the easy stuff first :-)

Are there any good / interesting analogs to regular expressions in 2d?

Are there any good (or at least interesting but flawed) analogs to regular expressions in two dimensions?
In one dimension I can write something like /aaac?(bc)*b?aaa/ to quickly pull out a region of alternating bs and cs with a border of at least three as. Perhaps as importantly, I can come back a month later and see at a glance what it's looking for.
I'm finding myself writing custom code for analogous problems in 2d (some much more complicated / constrained) and it would be nice to have a more concise and standardized notation, even if I have to write the engine behind it myself.
A second example might be called "find the +". The goal is to locate a column of 3 or more as, a b bracketed by 3 or more as with three or more as below. It should match:
..7...hkj.k f
7...a h o j
----a--------
j .a,g- 8 9
.aaabaaaaa7 j
k .a,,g- h j
hh a----? j
a hjg
and might be written as [b^(a{3})v(a{3})>(a{3})<(a{3})] or...
Suggestions?
Not being a regex expert, but finding the problem interesting, I looked around and found this interesting blog entry. Especially the syntax used there for defining the 2D regex looks appealing. The paper linked there might tell you more than me.
Update from comment: Here is the link to the primary author's page where you can download the linked paper "Two-dimensional languages": http://www.mat.uniroma2.it/~giammarr/Research/pubbl.html
Nice problem.
First, ask yourself if you can constrain the pattern to a "+" pattern, or if you would it need to test/match rectangles also. For instance, a rectangle of [bc] with a border of a would match the center rectangle below, but would also match a "+" shape of [c([bc]*a})v([bc]*a)>([bc]*a)<([bc]*a)] (in your syntax)
xxxaaaaaxxx
yzyabcba12d
defabcbass3
oreabcba3s3
s33aaaaas33
k388x.egiee
If you can constrain it to a "+" then your task is much easier. As ShuggyCoUk said, parsing a RE is usually equivalent to a DFSM -- but for a single, serial input which simplifies things greatly.
In your "RE+" engine, you'll have to debug not only the engine, but also each place that the expressions are entered. I'd like the compiler to catch any errors it could. Given that, you could also use something that was explicitly four RE's, like:
REPlus engine = new REPlus('b').North("a{3}")
.East("a{3}").South("a{3}").West("a{3}");
However, depending on your implementation this may be cumbersome.
And with regard to the traversal engine -- would the North/West patterns match RtL or LtR? It could matter if the patterns match differently with or w/o greedy sub-matches.
Incidentally, I think the '^' in your example is supposed to be one character to the left, outside the parenthesis.
Regular expressions are designed to model patterns in one dimension. As I understand it, you want to match patterns in a two dimensional array of characters.
Can you use regular expressions? That depends on whether the property that you are searching for is decomposable into components which can be matched in one dimension or not. If so, you can run your regular expressions on columns and rows, and look for intersections in the solution sets from those. Depending on the particular problem you have, this may either solve the problem, or it may find areas in your 2d array which are likely to be solutions.
Whether your problem is decomposable or not, I think writing some custom code will be inevitable. But at least it sounds like an interesting problem, so the work should be pleasant.
You're essentially talking about a spatial query language. There are plenty out there if you look up spatial query, geographic query and graphic querying. The spatial part generally comes down to points, lines and objects in a region that have other given attributes. Regions can be specified as polygons, distance from a point (e.g. circles), distance from a linear feature such as a road, all points on one side of a linear feature, etc... You can then get into more complex queries like set of all nearest neighbours, shortest path, travelling salesman, and tesselations like Delaunay TINs and Voronoi diagrams.