Handling OOV words in GoogleNews-vectors-negative300.bin - word2vec

I need to calculate the word vectors for each word of a sentence that is tokenized as follows:
['my', 'aunt', 'give', 'me', 'a', 'teddy', 'ruxpin'].
If I was using the pretrained [fastText][1] Embeddings: cc.en.300.bin.gz by facebook. I could get by OOV. However, when I use Google's word2vec from GoogleNews-vectors-negative300.bin, it returns an InvalidKey Error. My question is how to we calculate the word vectors that are OOV then? I searched online I could not find anything. Of course on way to do this is removing all the sentences that have words not listed in the google's word2vec. However, I noticed only 5550 out of 16134 have words completely in the embedding.
I did also
model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/My Drive/Colab Notebooks/GoogleNews-vectors-negative300.bin', binary=True)
model.train(sentences_with_OOV_words)
However, tensorflow 2 returns an error.
Any help would be greatly appreciate it.

If vocab is not found, initialize them with zero vector of the same size (Google word2vec should be a vector of 300 dimensions):
try:
word_vector = model.wv.get_vector('your_word_here')
except KeyError:
word_vector = np.zeros((300,))

The GoogleNews vector set is a plain mapping of words to vectors. There's no facility in it (or the algorithms that created it) for synthesizing vectors for unknown words.
(Similarly, if you load a plain vector-set into gensim as a KeyedVectors, there's no opportunity to run train() on the resulting object, as you show in your question code. It's not a full trainable model, just a collection of vectors.)
You can check if a word is available, using the in keyword. As other answers have noted, you can then choose to use some plug value (such as an all-zeros vector) for such words.
But it's often better to just ignore such words entirely – pretend they're not even in your text. (Using a zero-vector instead, then feeding that zero-vector into other parts of your system, can make those unknown-words essentially dilute the influence of other nearby word-vectors – which often isn't what you want.)

Awesome! Thank you very much.
def get_vectorOOV(s):
try:
return np.array(model.wv.get_vector(s))
except KeyError:
return np.zeros((300,))

Related

Word2vec build vocab adds TM to words

I'm trying to make convert my textdata to vectors. I would like to transform the word ultraram to a vector. I added the word to the model using model.build_vocab, but only ultraramTM is added. What did i do wrong
model.save("word2vec.model2")
model = Word2Vec.load("word2vec.model2")
model.build_vocab(data_tokenized, update=True)
# Store just the words + their trained embeddings.
word_vectors = model.wv
word_vectors.save("word2vec.wordvectors2")
# Load back with memory-mapping = read-only, shared across processes.
self.wv = KeyedVectors.load("word2vec.wordvectors2", mmap='r')
for i in self.wv.key_to_index:
if "ultrar" in i:
print(i)
ultraram™
manufactureultraram™
ultrarobust
ultrarare
ultrarealistic
ultrarelativistic
it shows some words with a TM. What does this mean? and how can i add the word "utraram" without the tm.
If there's a ™ at the end of some tokens, then those tokens, with the ™, were exactly what had been passed into the model when its vocabulary was first established.
If you don't want them, you'd have to strip them during your tokenization. (You current question/code doesn't show how you might have tokenized your data.)
Separately:
Directly using .load() to replace a Word2Vec model's existing KeyedVectors won't in general be reliable: a Word2Vec isn't expecting that to change separate from its own initialization/training.
It may work in this limited case – eloading exactly the same word-vectors as were just saved – but in such a case, it's unclear why you'd want to do it. From the comment it seems the motivation here might be to save some memory. However, if you're only looking-up word-vectors, you don't the full Word2Vec model at all. You can just use the set-of-KeyedVectors alone (for more memory savings).

What are the ways of Key-Value extraction from unstructured text?

I'm trying to figure out what are the ways (and which of them the best one) of extraction of Values for predefined Keys in the unstructured text?
Input:
The doctor prescribed me a drug called favipiravir.
His name is Yury.
Ilya has already told me about that.
The weather is cold today.
I am taking a medicine called nazivin.
Key list: ['drug', 'name', 'weather']
Output:
['drug=favipiravir', 'drug=nazivin', 'name=Yury', 'weather=cold']
So, as you can see, in the 3d sentence there is no explicit key 'name' and therefore no value extracted (I think there is the difference with NER). At the same time, 'drug' and 'medicine' are synonyms and we should treat 'medicine' as 'drug' key and extract the value also.
And the next question, what if the key set will be mutable?
Should I use as a base regexp approach because of predefined Keys or there is a way to implement it with supervised learning/NN? (but in this case how to deal with mutable keys?)
You can use a parser to tag words. Your problem is similar to Named Entity Recognition (NER). A lot of libraries, like NLTK in Python, have POS taggers available. You can try those. They are generally trained to identify names, locations, etc. Depending on the type of words you need, you may need to train the parser. So you'll need some labeled data also. Check out this link:
https://nlp.stanford.edu/software/CRF-NER.html

Why Word2Vec's most_similar() function is giving senseless results on training?

I am running the gensim word2vec code on a corpus of resumes(stopwords removed) to identify similar context words in the corpus from a list of pre-defined keywords.
Despite several iterations with input parameters,stopword removal etc the similar context words are not at all making sense(in terms of distance or context)
Eg. correlation and matrix occurs in the same window several times yet matrix doesnt fall in the most_similar results for correlation
Following are the details of the system and codes
gensim 2.3.0 ,Running on Python 2.7 Anaconda
Training Resumes :55,418 sentences
Average words per sentence : 3-4 words(post stopwords removal)
Code :
wordvec_min_count=int()
size = 50
window=10
min_count=5
iter=50
sample=0.001
workers=multiprocessing.cpu_count()
sg=1
bigram = gensim.models.Phrases(sentences, min_count=10, threshold=5.0)
trigram = gensim.models.Phrases(bigram[sentences], min_count=10, threshold=5.0)
model=gensim.models.Word2Vec(sentences = trigram[sentences], size=size, alpha=0.005, window=window, min_count=min_count,max_vocab_size=None,sample=sample, seed=1, workers=workers, min_alpha=0.0001, sg=sg, hs=1, negative=0, cbow_mean=1,iter=iter)
model.wv.most_similar('correlation')
Out[20]:
[(u'rankings', 0.5009744167327881),
(u'salesmen', 0.4948525130748749),
(u'hackathon', 0.47931140661239624),
(u'sachin', 0.46358123421669006),
(u'surveys', 0.4472047984600067),
(u'anova', 0.44710394740104675),
(u'bass', 0.4449636936187744),
(u'goethe', 0.4413239061832428),
(u'sold', 0.43735259771347046),
(u'exceptional', 0.4313117265701294)]
I am lost as to why the results are so random ? Is there anyway to check the accuracy for word2vec ?
Also is there an alternative of word2vec for most_similar() function ? I read about gloVE but was not able to install the package.
Any information in this regard would be helpful
Enable INFO-level logging and make sure that it indicates real training is happening. (That is, you see incremental progress taking time over the expected number of texts, over the expected number of iterations.)
You may be hitting this open bug issue in Phrases, where requesting the Phrase-promotion (as with trigram[sentences]) only offers a single-iteration, rather than the multiply-iterable collection object that Word2Vec needs.
Word2Vec needs to pass over the corpus once for vocabulary-discovery, then iter times again for training. If sentences or the phrasing-wrappers only support single-iteration, only the vocabulary will be discovered – training will end instantly, and the model will appear untrained.
As you'll see in that issue, a workaround is to perform the Phrases-transformation and save the results into an in-memory list (if it fits) or to a separate text corpus on disk (that's already been phrase-combined). Then, use a truly restartable iterable on that – which will also save some redundant processing.

word2vec guesing word embeddings

can word2vec be used for guessing words with just context?
having trained the model with a large data set e.g. Google news how can I use word2vec to predict a similar word with only context e.g. with input ", who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri." The output should be Kasparov or maybe Carlsen.
I'ven seen only the similarity apis but I can't make sense how to use them for this? is this not how word2vec was intented to use?
It is not the intended use of word2vec. The word2vec algorithm internally tries to predict exact words, using surrounding words, as a roundabout way to learn useful vectors for those surrounding words.
But even so, it's not forming exact predictions during training. It's just looking at a single narrow training example – context words and target word – and performing a very simple comparison and internal nudge to make its conformance to that one example slightly better. Over time, that self-adjusts towards useful vectors – even if the predictions remain of wildly-varying quality.
Most word2vec libraries don't offer a direct interface for showing ranked predictions, given context words. The Python gensim library, for the last few versions (as of current version 2.2.0 in July 2017), has offered a predict_output_word() method that roughly shows what the model would predict, given context-words, for some training modes. See:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.predict_output_word
However, considering your fill-in-the-blank query (also called a 'cloze deletion' in related education or machine-learning contexts):
_____, who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri
A vanilla word2vec model is unlikely to get that right. It has little sense of the relative importance of words (except when some words are more narrowly predictive of others). It has no sense of grammar/ordering, or or of the compositional-meaning of connected-phrases (like 'dominated chess' as opposed to the separate words 'dominated' and 'chess'). Even though words describing the same sorts of things are usually near each other, it doesn't know categories to be able to determine that the blank must be a 'person' and a 'chess player', and the fuzzy-similarities of word2vec don't guarantee words-of-a-class will necessarily all be nearer-each-other than other words.
There has been a bunch of work to train word/concept vectors (aka 'dense embeddings') to be better at helping at such question-answering tasks. A random example might be "Creating Causal Embeddings for Question Answering with Minimal Supervision" but queries like [word2vec question answering] or [embeddings for question answering] will find lots more. I don't know of easy out-of-the-box libraries for doing this, with or without a core of word2vec, though.

Custom word weights for sentences when calling h2o transform and word2vec, instead of straight AVERAGE of words

I am using H2O machine learning package to do natural language predictions, including the functions h2o.word2vec and h2o.transform. I need sentence level aggregation, which is provided by the AVERAGE parameter value:
h2o.transform(word2vec, words, aggregate_method = c("NONE", "AVERAGE"))
However, in my case I strongly wish to avoid equal weighting of "the" and "platypus" for example.
Here's a scheme I concocted to achieve custom word-weightings. If H2O's word2vec "AVERAGE" option uses all the words including duplicates that might appear, then I could effect a custom word weighting when calling h2o.transform by adding additional duplicates of certain words to my sentences, when I want to weight them more heavily than other words.
Can any H2O experts confirm that that the word2vec AVERAGE parameter is using all the words rather than just the unique words when computing AVERAGE of the words in sentence?
Alternatively, is there a better way? I tried but I find myself unable to imagine any correct math to multiply the sentence average by some factor, after it was already computed.
Yes, h2o.transform will consider each occurrence of a word for the averaging, not just the unique words. Your trick will therefore work.
There is currently no direct way to provide user defined weights. You could probably do an ugly hack and weight directly the word embeddings but that won't be a straightforward solution I could recommend.
We can add this feature to H2O. I would love to hear what API would work for you (how would you like to provide the weights).