How do you handle positive and negative samples when they are the same in word2vec's negative sampling? - word2vec

I have a question about doing negative sampling in word2vec.
In order to solve the binary classification problem, I know that words around the target word are labeled positive and words in other ranges are labeled negative. At this time, there are too many negative words. So, negative sampling samples according to the frequency of words appearing in the entire document.
In this case, if the words in positive are sampled in the same negative, how is it processed?
For example, if you look at the target word love in "I love pizza", (love, pizza) should have a positive label. But isn't it possible for (love, pizza) to have a negative label through negative sampling again in the sentence afterwards?

The negative-sampling isn't done based on words elsewhere in the sentence, but the word-frequencies across the entire training corpus.
That is, while the observed neighbors become positive examples, the negative examples are chosen at random from the entire distribution of all known words.
These negative examples might even, by bad luck, be the same as other nearby in-context-window intended-positive words! But this isn't fatal: across all training examples, across all passes, across all negative-samples drawn, the net effect of training remains roughly the same: the positive-examples have their predictions very-often up-weighted, the negative-examples down-weighted, and even occasionally where the negative-example is sometimes also a neighbor, that effect "washes out" across all the (non-biasing) errors, still leaving that real neighbor net-reinforced over the full training run.
So: you're right to note that the process seems a little sloppy, but it all evens out in the end.
FWIW, the word2vec implementations with which I'm most familiar do check if the negative-sample is exactly the positive word we're trying to predict, but don't do any other checking against the whole current window.

Related

Why is '[UNK]' word the first in word2vec vocabulary?

If the vocabulary is ordered from the more frequent word to the less frequent, placing '[UNK]' at the beginning means that it occurs most. But what if '[UNK]' isn't the most frequent word? Should I put it at another place in the vocabulary, according to its frequency?
I found such issue when doing this tutorial -> https://www.tensorflow.org/tutorials/text/word2vec
When I'm doing negative sampling using the function tf.random.log_uniform_candidate_sampler, the negative samples with low token (s.g. 0,1,2 ...) will be sampled most. If '[UNK]' is the first (or second when using padding) in the vocabulary, which means that it has token 0 (or 1 when using padding), then the '[UNK]' will be heavily sampled as negative sample. If '[UNK]' happens a lot, there is no problem, but what if it doesn't? Then it should receive a higher token, shouldn't?

converting a sentence to an embedding representation

If I have a sentence, ex: “get out of here”
And I want to use word2vec Embed. to represent it .. I found three different ways to do that:
1- for each word, we compute the AVG of its embedding vector, so each word replaced by a single value.
2- as in 1, but with using the standard deviation of the embedding vector values.
3- or by adding the Embed. vector as it is. So if I use 300 length embedding vector .. for the above example, I will have in the final a vector of (300 * 4 words) 1200 length as a final vector to represent the sentence.
Which one of them is most suitable .. ? specifically, for the sentence similarity applications ..
The way you describe option (1) makes it sound like each word becomes a single number. That wouldn't work.
The simple approach that's often used is to average all word-vectors for words in the sentence together - so with 300-dimensional word-vectors, you still wind up with a 300-dimensional sentence-average vector. Perhaps that's what you mean by your option (1).
(Sometimes, all vectors are normalized to unit-length before this operation, but sometimes not - because the non-normalized vector lengths can sometimes indicate the strength of a word's meaning. Sometimes, word-vectors are weighted by some other frequency-based indicator of their relative importance, such as TF/IDF.)
I've never seen your option (2) used and don't quite understand what you mean or how it could possibly work.
Your option (3) would be better described as "concatenating the word-vectors". It gives different-sized vectors depending on the number of words in the sentence. Slight differences in word placement, such as comparing "get out of here" and "of here get out", would result in very different vectors, that usual methods of comparing vectors (like cosine-similarity) would not detect as being 'close' at all. So it doesn't make sense, and I've not seen it used.
So, only your option (1), as properly implemented to (weighted-)average word-vectors, is a good baseline for sentence-similarities.
But, it's still fairly basic and there are many other ways to compare sentences using text-vectors. Here are just a few:
One algorithm closely related to word2vec itself is called 'Paragraph Vectors', and is often called Doc2Vec. It uses a very word2vec-like process to train vectors for full ranges of text (whether they're phrases, sentences, paragraphs, or documents) that work kind of like 'floating document-ID words' over the full text. It sometimes offers a benefit over just averaging word-vectors, and in some modes can produce both doc-vectors and word-vectors that are also comparable to each other.
If your interest isn't just pairwise sentence similarities, but some sort of downstream classification task, then Facebook's 'FastText' refinement of word2vec has a classification mode, where the word-vectors are trained not just to predict neighboring words, but to be good at predicting known text classes, when simply added/averaged together. (Text-vectors constructed from such classification vectors might be good at similarities too, depending on how well the training-classes capture salient contrasts between texts.)
Another way to compute pairwise similarities, using just word-vectors, is "Word Mover's Distance". Rather than averaging all the word-vectors for a text together into a single text-vector, it considers each word-vector as a sort of "pile of meaning". Compared to another sentence, it calculates the minimum routing work (distance along lots of potential word-to-word paths) to move all the "piles" from one sentence into the configuration of another sentence. It can be expensive to calculate, but usually represents sentence-contrasts better than the simple single-vector-summary that naive word-vector averaging achieves.
`
model = Word2Vec(sentences,vector_size=100, min_count=1)
def sent_vectorizer(sent, model):
sent_vec =[]
numw = 0
for w in sent:
try:
if numw == 0:
sent_vec = model[w]
else:
sent_vec = np.add(sent_vec, model[w])
numw+=1
except:
pass
return np.asarray(sent_vec) / numw
X=[]
for sentence in sentences:
X.append(sent_vectorizer(sentence, model))
print ("========================")
print (X)
`

Understanding SpamAssassin HK_RANDOM regex

SpamAssassin has several rules that attempt to detect "random looking" values. For example:
/^(?!(?:mail|bounce)[_.-]|[^#]*(?:[+=^~\#]|mcgr|kpmg|nlpbr|ndqv|lcgc|cplpr|-mailer#)|[^#]{26}|.*?#.{0,20}\bcmp-info\.com$)[^#]*(?:[bcdfgjklmnpqrtvwxz]{5}|[aeiouy]{5}|([a-z]{1,2})(?:\1){3})/mi
I understand that the first part of the regex prevents certain cases from matching:
(?!(?:mail|bounce)[_.-]|[^#]*(?:[+=^~\#]|mcgr|kpmg|nlpbr|ndqv|lcgc|cplpr|-mailer#)|[^#]{26}|.*?#.{0,20}\bcmp-info\.com$)
However, I am not able to understand how the second part detects "randomness". Any help would be greatly appreciated!
/[^#]*(?:[bcdfgjklmnpqrtvwxz]{5}|[aeiouy]{5}|([a-z]{1,2})(?:\1){3})/mi
It will match strings containing 5 consecutive consonants (excluding h and s for some reason) :
[bcdfgjklmnpqrtvwxz]{5}
or 5 consecutive vowels :
[aeiouy]{5}
or the same letter or couple of letters repeated 3 times (present 4 times) :
([a-z]{1,2})(?:\1){3}
Here are a few examples of strings it will match :
somethingmkfkgkmsomething
aiaioe
totototo
aaaa
It obviously can't detect randomness, however it can identify patterns that don't often happen in meaningful strings, and mention these patterns look random.
It is also possible that these patterns are constructed "from experience", after analysis of a number of emails crafted by spammers, and would actually reflect the algorithms behind the tools used by these spammers or the process they use to create these emails (e.g. some degree of keyboard mashing ?).
Bottom note is that you can't detect randomness on a single piece of data. What you can do however is try to detect purpose, and if you don't find any then assume that to the best of your knowledge it is random. SpamAssasin assumes a few rules about human communication (which might fit different languages better or worse : as is it will flag a few forms of French's imperfect tense such as "échouaient"), and if the content doesn't match them it reports it as "random".

Machine Learning Datasets - Negative vs Positive Vocabulary Dataset

I'm looking for a way to develop a ML dataset which compares positive vs negative vocabulary. For example "is valid" vs "not valid" or "can be used" vs "can't be used" or "not on Thursdays" vs "on Thursdays" would be the positive vs negative. It can be simplified by determining if the adverb is positive or negative. I was wondering if there are any available datasets for this or any existing solutions.
There are some sentiment dictionaries you can use.
Automated sentiment analysis is an application of text analytics
techniques for the identification of subjective opinions in text data.
It normally involves the classification of text into categories such
as “positive”, “negative” and in some cases “neutral” [Source]
WordStat Sentiment Dictionary 1.2
Loughran and McDonald Financial Sentiment Dictionary
To create a dataset
Search for the articles having debate about some point. There, you will get most of positive and negative sentences. In starting, pick small paragraphs. Check efficiency of your algorithm manually.
Solution
Start with very basic approach. Like search for keyword, "not". then go for combined "can't" "wont" etc. Then check if you miss anything.
Now you can go for more sophisticated approach. Like the sentence "I took precautions with equipment, It won't harm me." It gives a positive sense. Thing you should look for is "won't harm". You see, won't is negative word and harm is also a negative word. Combining both gives positive effect.

c/c++ produce different string(s) from substring variations

Suppose, we have input strings in a form I have/had alot/none/zero money.
I would like to have a set of output strings as follows (example 1):
I have alot money
I have none money
I have zero money
I had alot money
I had none money
I had zero money
But then, real task here, is to be able to choose one or more, or none input substrings to ignore. So, the output strings would look like this:
I money
or
first example
or
I alot money
I none money
I zero money
or
I
or
money
I hope you got the point.
How can i do this in the way, friendliest to cpu cycles ?
Ok, to break the ice, this is what im Not willing to do, but considering until brighter ideas:
generate all the output strings (mentioned, example 1).
iterating through strings, i filter out ones that meet my criteria, replace unwanted substrings with "".
put the resulting string into final output array only if it is not already there.
Also, the answer to why do i care for cpu cycles, is simple : the longer this task takes, the longer it will block worker thread.
This simplest way is to find all the spaces and / character and put individual word into a two-level list, you then get a structure like this:
I
have, had
a lot, none, zero
money
.
Now you just loop thru the tree and generate a result string.
To push it to an overkill, you may also write a token parser instead of doing strchr.
How can i do this in the way, friendliest to cpu cycles ?
Why do you care?