I have a question about doing negative sampling in word2vec.
In order to solve the binary classification problem, I know that words around the target word are labeled positive and words in other ranges are labeled negative. At this time, there are too many negative words. So, negative sampling samples according to the frequency of words appearing in the entire document.
In this case, if the words in positive are sampled in the same negative, how is it processed?
For example, if you look at the target word love in "I love pizza", (love, pizza) should have a positive label. But isn't it possible for (love, pizza) to have a negative label through negative sampling again in the sentence afterwards?
The negative-sampling isn't done based on words elsewhere in the sentence, but the word-frequencies across the entire training corpus.
That is, while the observed neighbors become positive examples, the negative examples are chosen at random from the entire distribution of all known words.
These negative examples might even, by bad luck, be the same as other nearby in-context-window intended-positive words! But this isn't fatal: across all training examples, across all passes, across all negative-samples drawn, the net effect of training remains roughly the same: the positive-examples have their predictions very-often up-weighted, the negative-examples down-weighted, and even occasionally where the negative-example is sometimes also a neighbor, that effect "washes out" across all the (non-biasing) errors, still leaving that real neighbor net-reinforced over the full training run.
So: you're right to note that the process seems a little sloppy, but it all evens out in the end.
FWIW, the word2vec implementations with which I'm most familiar do check if the negative-sample is exactly the positive word we're trying to predict, but don't do any other checking against the whole current window.
If the vocabulary is ordered from the more frequent word to the less frequent, placing '[UNK]' at the beginning means that it occurs most. But what if '[UNK]' isn't the most frequent word? Should I put it at another place in the vocabulary, according to its frequency?
I found such issue when doing this tutorial -> https://www.tensorflow.org/tutorials/text/word2vec
When I'm doing negative sampling using the function tf.random.log_uniform_candidate_sampler, the negative samples with low token (s.g. 0,1,2 ...) will be sampled most. If '[UNK]' is the first (or second when using padding) in the vocabulary, which means that it has token 0 (or 1 when using padding), then the '[UNK]' will be heavily sampled as negative sample. If '[UNK]' happens a lot, there is no problem, but what if it doesn't? Then it should receive a higher token, shouldn't?
If I have a sentence, ex: “get out of here”
And I want to use word2vec Embed. to represent it .. I found three different ways to do that:
1- for each word, we compute the AVG of its embedding vector, so each word replaced by a single value.
2- as in 1, but with using the standard deviation of the embedding vector values.
3- or by adding the Embed. vector as it is. So if I use 300 length embedding vector .. for the above example, I will have in the final a vector of (300 * 4 words) 1200 length as a final vector to represent the sentence.
Which one of them is most suitable .. ? specifically, for the sentence similarity applications ..
The way you describe option (1) makes it sound like each word becomes a single number. That wouldn't work.
The simple approach that's often used is to average all word-vectors for words in the sentence together - so with 300-dimensional word-vectors, you still wind up with a 300-dimensional sentence-average vector. Perhaps that's what you mean by your option (1).
(Sometimes, all vectors are normalized to unit-length before this operation, but sometimes not - because the non-normalized vector lengths can sometimes indicate the strength of a word's meaning. Sometimes, word-vectors are weighted by some other frequency-based indicator of their relative importance, such as TF/IDF.)
I've never seen your option (2) used and don't quite understand what you mean or how it could possibly work.
Your option (3) would be better described as "concatenating the word-vectors". It gives different-sized vectors depending on the number of words in the sentence. Slight differences in word placement, such as comparing "get out of here" and "of here get out", would result in very different vectors, that usual methods of comparing vectors (like cosine-similarity) would not detect as being 'close' at all. So it doesn't make sense, and I've not seen it used.
So, only your option (1), as properly implemented to (weighted-)average word-vectors, is a good baseline for sentence-similarities.
But, it's still fairly basic and there are many other ways to compare sentences using text-vectors. Here are just a few:
One algorithm closely related to word2vec itself is called 'Paragraph Vectors', and is often called Doc2Vec. It uses a very word2vec-like process to train vectors for full ranges of text (whether they're phrases, sentences, paragraphs, or documents) that work kind of like 'floating document-ID words' over the full text. It sometimes offers a benefit over just averaging word-vectors, and in some modes can produce both doc-vectors and word-vectors that are also comparable to each other.
If your interest isn't just pairwise sentence similarities, but some sort of downstream classification task, then Facebook's 'FastText' refinement of word2vec has a classification mode, where the word-vectors are trained not just to predict neighboring words, but to be good at predicting known text classes, when simply added/averaged together. (Text-vectors constructed from such classification vectors might be good at similarities too, depending on how well the training-classes capture salient contrasts between texts.)
Another way to compute pairwise similarities, using just word-vectors, is "Word Mover's Distance". Rather than averaging all the word-vectors for a text together into a single text-vector, it considers each word-vector as a sort of "pile of meaning". Compared to another sentence, it calculates the minimum routing work (distance along lots of potential word-to-word paths) to move all the "piles" from one sentence into the configuration of another sentence. It can be expensive to calculate, but usually represents sentence-contrasts better than the simple single-vector-summary that naive word-vector averaging achieves.
`
model = Word2Vec(sentences,vector_size=100, min_count=1)
def sent_vectorizer(sent, model):
sent_vec =[]
numw = 0
for w in sent:
try:
if numw == 0:
sent_vec = model[w]
else:
sent_vec = np.add(sent_vec, model[w])
numw+=1
except:
pass
return np.asarray(sent_vec) / numw
X=[]
for sentence in sentences:
X.append(sent_vectorizer(sentence, model))
print ("========================")
print (X)
`
I'm solving a matching problem with two vectors of a class
class matching
{
public:
int n;
char match;
};
This is the algorithm I'm trying to implement:
int augment(vector<matching> &left, vector<matching> &right)
{
while(there's no augmenting path)
if(condition for matching)
<augment>
return "number of matching";
}
For the rough matching, if left[i] matches with right[j], then left[i].n = j, left[i].match ='M' , right[j].n = i and right[j].match = 'M' and the unmatched ones have members n = -1 and match = 'U'
While finding the augmenting paths, if one exists for another (i, j), then we change the member match of the one being unmatched from 'M' to 'U' and its n = -1 and the two matched with the augmenting path have their members match changed to 'A' while we change their members n according to their indices.
I don't know if this is the right approach to solving this, this is my first attempt on maximum matching and I've read a lot of articles and watched tutorials online and I can't get my 'code' to function appropriately.
I do not need a code, I can write my code. I just want to understand this algorithm step by step. If someone can give me an algorithm like the one I was trying above, I would appreciate it. Also, if I have been going the wrong direction since, please correct me.
I am not sure if you are finding the augmenting paths correctly. I suggest the following approach.
Find an initial matching in a greedy way. To obtain this we travel through every vertex in the left side and greedily try to match it with some free (unmatched) vertex on the right side.
Try to find an augmenting path P in the graph. For this we need to do a breadth-first search starting from all the free vertices on the left side and alternating through matched and unmatched edges in the search. (i.e. the second level contains all the right side vertices adjacent to level-1
vertices, the third level contains all the left side vertices that are
matched to level-2 vertices, the fourth level contains all the right side
vertices adjacent to level-3 vertices etc). We stop the search when we
visit a free vertex in any future level and compute the augmenting path P
using the breadth-first search tree computed so far.
If we can find an augmenting path P in the previous step: Change the matched and unmatched edges in P to unmatched and matched edges respectively and goto step 2.
Else: The resulting matching obtained is maximum.
This algorithm requires a breadth-first search for every augumentation and so it's worst-case complexity is O(nm). Although Hopcroft-Karp algorithm can perform multiple augmentations for each breadth-first search and has a better worst-case complexity, it
seems (from the Wikipedia article) that it isn't faster in practice.
I'm looking for an algorithm, or at least theory of operation on how you would find similar text in two or more different strings...
Much like the question posed here: Algorithm to find articles with similar text, the difference being that my text strings will only ever be a handful of words.
Like say I have a string:
"Into the clear blue sky"
and I'm doing a compare with the following two strings:
"The color is sky blue" and
"In the blue clear sky"
I'm looking for an algorithm that can be used to match the text in the two, and decide on how close they match. In my case, spelling, and punctuation are going to be important. I don't want them to affect the ability to discover the real text. In the above example, if the color reference is stored as "'sky-blue'", I want it to still be able to match. However, the 3rd string listed should be a BETTER match over the second, etc.
I'm sure places like Google probably use something similar with the "Did you mean:" feature...
* EDIT *
In talking with a friend, he worked with a guy who wrote a paper on this topic. I thought I might share it with everyone reading this, as there are some really good methods and processes described in it...
Here's the link to his paper, I hope it is helpful to those reading this question, and on the topic of similar string algorithms.
Levenshtein distance will not completely work, because you want to allow rearrangements. I think your best bet is going to be to find best rearrangement with levenstein distance as cost for each word.
To find the cost of rearrangement, kinda like the pancake sorting problem. So, you can permute every combination of words (filtering out exact matches), with every combination of other string, trying to minimize a combination of permute distance and Levenshtein distance on each word pair.
edit:
Now that I have a second I can post a quick example (all 'best' guesses are on inspection and not actually running the algorithms):
original strings | best rearrangement w/ lev distance per word
Into the clear blue sky | Into the c_lear blue sky
The color is sky blue | is__ the colo_r blue sky
R_dist = dist( 3 1 2 5 4 ) --> 3 1 2 *4 5* --> *2 1 3* 4 5 --> *1 2* 3 4 5 = 3
L_dist = (2D+S) + (I+D+S) (Total Subsitutions: 2, deletions: 3, insertion: 1)
(notice all the flips include all elements in the range, and I use ranges where Xi - Xj = +/- 1)
Other example
original strings | best rearrangement w/ lev distance per word
Into the clear blue sky | Into the clear blue sky
In the blue clear sky | In__ the clear blue sky
R_dist = dist( 1 2 4 3 5 ) --> 1 2 *3 4* 5 = 1
L_dist = (2D) (Total Subsitutions: 0, deletions: 2, insertion: 0)
And to show all possible combinations of the three...
The color is sky blue | The colo_r is sky blue
In the blue clear sky | the c_lear in sky blue
R_dist = dist( 2 4 1 3 5 ) --> *2 3 1 4* 5 --> *1 3 2* 4 5 --> 1 *2 3* 4 5 = 3
L_dist = (D+I+S) + (S) (Total Subsitutions: 2, deletions: 1, insertion: 1)
Anyway you make the cost function the second choice will be lowest cost, which is what you expected!
One way to determine a measure of "overall similarity without respect to order" is to use some kind of compression-based distance. Basically, the way most compression algorithms (e.g. gzip) work is to scan along a string looking for string segments that have appeared earlier -- any time such a segment is found, it is replaced with an (offset, length) pair identifying the earlier segment to use. You can use measures of how well two strings compress to detect similarities between them.
Suppose you have a function string comp(string s) that returns a compressed version of s. You can then use the following expression as a "similarity score" between two strings s and t:
len(comp(s)) + len(comp(t)) - len(comp(s . t))
where . is taken to be concatenation. The idea is that you are measuring how much further you can compress t by looking at s first. If s == t, then len(comp(s . t)) will be barely any larger than len(comp(s)) and you'll get a high score, while if they are completely different, len(comp(s . t)) will be very near len(comp(s) + comp(t)) and you'll get a score near zero. Intermediate levels of similarity produce intermediate scores.
Actually the following formula is even better as it is symmetric (i.e. the score doesn't change depending on which string is s and which is t):
2 * (len(comp(s)) + len(comp(t))) - len(comp(s . t)) - len(comp(t . s))
This technique has its roots in information theory.
Advantages: good compression algorithms are already available, so you don't need to do much coding, and they run in linear time (or nearly so) so they're fast. By contrast, solutions involving all permutations of words grow super-exponentially in the number of words (although admittedly that may not be a problem in your case as you say you know there will only be a handful of words).
One way (although this is perhaps better suited a spellcheck-type algorithm) is the "edit distance", ie., calculate how many edits it takes to transform one string to another. A common technique is found here:
http://en.wikipedia.org/wiki/Levenshtein_distance
You might want to look into the algorithms used by biologists to compare DNA sequences, since they have to cope with many of the same things (chunks may be missing, or have been inserted, or just moved to a different position in the string.
The Smith-Waterman algorithm would be one example that'd probably work fairly well, although it might be too slow for your uses. Might give you a starting point, though.
i had a similar problem, i needed to get the percentage of characters in a string that were similar. it needed exact sequences, so for example "hello sir" and "sir hello" when compared needed to give me five characters that are the same, in this case they would be the two "hello"'s. it would then take the length of the longest of the two strings and give me a percentage of how similar they were. this is the code that i came up with
int compare(string a, string b){
return(a.size() > b.size() ? bigger(a,b) : bigger(b,a));
}
int bigger(string a, string b){
int maxcount = 0, currentcount = 0;//used to see which set of concurrent characters were biggest
for(int i = 0; i < a.size(); ++i){
for(int j = 0; j < b.size(); ++j){
if(a[i+j] == b[j]){
++currentcount;
}
else{
if(currentcount > maxcount){
maxcount = currentcount;
}//end if
currentcount = 0;
}//end else
}//end inner for loop
}//end outer for loop
return ((int)(((float)maxcount/((float)a.size()))*100));
}
I can't mark two answers here, so I'm going to answer and mark my own. The Levenshtein distance appears to be the correct method in most cases for this. But, it is worth mentioning j_random_hackers answer as well. I have used an implementation of LZMA to test his theory, and it proves to be a sound solution. In my original question I was looking for a method for short strings (2 to 200 chars), where the Levenshtein Distance algorithm will work. But, not mentioned in the question was the need to compare two (larger) strings (in this case, text files of moderate size) and to perform a quick check to see how similar the two are. I believe that this compression technique will work well but I have yet to study it to find at which point one becomes better than the other, in terms of the size of the sample data and the speed/cost of the operation in question. I think a lot of the answers given to this question are valuable, and worth mentioning, for anyone looking to solve a similar string ordeal like I'm doing here. Thank you all for your great answers, and I hope they can be used to serve others well too.
There's another way. Pattern recognition using convolution. Image A is run thru a Fourier transform. Image B also. Now superimposing F(A) over F(B) then transforming this back gives you a black image with a few white spots. Those spots indicate where A matches B strongly. Total sum of spots would indicate an overall similarity. Not sure how you'd run an FFT on strings but I'm pretty sure it would work.
The difficulty would be to match the strings semantically.
You could generate some kind of value based on the lexical properties of the string. e.g. They bot have blue, and sky, and they're in the same sentence, etc etc... But it won't handle cases where "Sky's jean is blue", or some other odd ball English construction that uses same words, but you'd need to parse the English grammar...
To do anything beyond lexical similarity, you'd need to look at natural language processing, and there isn't going to be one single algorith that would solve your problem.
Possible approach:
Construct a Dictionary with a string key of "word1|word2" for all combinations of words in the reference string. A single combination may happen multiple times, so the value of the Dictionary should be a list of numbers, each representing the distance between the words in the reference string.
When you do this, there will be duplication here: for every "word1|word2" dictionary entry, there will be a "word2|word1" entry with the same list of distance values, but negated.
For each combination of words in the comparison string (words 1 and 2, words 1 and 3, words 2 and 3, etc.), check the two keys (word1|word2 and word2|word1) in the reference string and find the closest value to the distance in the current string. Add the absolute value of the difference between the current distance and the closest distance to a counter.
If the closest reference distance between the words is in the opposite direction (word2|word1) as the comparison string, you may want to weight it smaller than if the closest value was in the same direction in both strings.
When you are finished, divide the sum by the square of the number of words in the comparison string.
This should provide some decimal value representing how closely each word/phrase matches some word/phrase in the original string.
Of course, if the original string is longer, it won't account for that, so it may be necessary to compute this both directions (using one as the reference, then the other) and average them.
I have absolutely no code for this, and I probably just re-invented a very crude wheel. YMMV.