How I can get vector from output matrix in FastText ? - word2vec

In this study author have found out that, Word2Vec generates the two kinds of embeddings(IN & OUT).
https://arxiv.org/abs/1602.01137
Well, you can easily get that using syn1 attribute in gensim word2vec. But in the case of gensim fastText, the syn1 do exists but as the concept of fastText is sub-word based, it's not possible to get a vector for word from output matrix by matching the indexes. Do you know any other way around to calculate vector using output matrix??

In FastText, the vector for a word is the combination of:
the full-word vector, if it exists; and
all the subword vectors
You can view the gensim method that returns a vector, composed from subwords if necessary, at:
https://github.com/RaRe-Technologies/gensim/blob/2ccc82bf50bcfbee44932c160db076a873cf893e/gensim/models/keyedvectors.py#L1970
(I think this method might have a bug, in comparison to the original FastText approach, in that this gensim method perhaps should also add the subword vectors to the whole-word-vector, even when a whole-word-vector is available.)

Related

Finding word similarities along the output or input vectors using gensim word2vec?

I know that you can use model.wv.most_similar(...) to get words by cosine similarity in gensim.
I also know gensim gives you input and output vectors in e.g. model.syn0 and model.syn1neg.
Is there a simple way to calculate cosine similarity and create a list of most similar using only the input or output vectors, one or the other? E.g. I want to try doing it using just output vectors.
There's no built-in facility, but I believe you could achieve it by creating a separate KeyedVectors instance where you replace the usual ('input projection'/'syn0') vector-array with the same-sized output array (as exists in negative-sampling models).
Roughly the following may work:
full_w2v_model = ... # whatever training/loading is necessary
full_w2v_model.wv.save_word2vec_format(PATH) # saves just the word-vectors
out_vecs = KeyedVectors.load_word2vec_format(PATH) # reloads as separate object
out_vecs.vectors = full_w2v_model.syn1neg # clobber raw vecs with model's output layer
(Let me know if this works as-is or neds some further touch-up to work!)

Handling OOV words in GoogleNews-vectors-negative300.bin

I need to calculate the word vectors for each word of a sentence that is tokenized as follows:
['my', 'aunt', 'give', 'me', 'a', 'teddy', 'ruxpin'].
If I was using the pretrained [fastText][1] Embeddings: cc.en.300.bin.gz by facebook. I could get by OOV. However, when I use Google's word2vec from GoogleNews-vectors-negative300.bin, it returns an InvalidKey Error. My question is how to we calculate the word vectors that are OOV then? I searched online I could not find anything. Of course on way to do this is removing all the sentences that have words not listed in the google's word2vec. However, I noticed only 5550 out of 16134 have words completely in the embedding.
I did also
model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/My Drive/Colab Notebooks/GoogleNews-vectors-negative300.bin', binary=True)
model.train(sentences_with_OOV_words)
However, tensorflow 2 returns an error.
Any help would be greatly appreciate it.
If vocab is not found, initialize them with zero vector of the same size (Google word2vec should be a vector of 300 dimensions):
try:
word_vector = model.wv.get_vector('your_word_here')
except KeyError:
word_vector = np.zeros((300,))
The GoogleNews vector set is a plain mapping of words to vectors. There's no facility in it (or the algorithms that created it) for synthesizing vectors for unknown words.
(Similarly, if you load a plain vector-set into gensim as a KeyedVectors, there's no opportunity to run train() on the resulting object, as you show in your question code. It's not a full trainable model, just a collection of vectors.)
You can check if a word is available, using the in keyword. As other answers have noted, you can then choose to use some plug value (such as an all-zeros vector) for such words.
But it's often better to just ignore such words entirely – pretend they're not even in your text. (Using a zero-vector instead, then feeding that zero-vector into other parts of your system, can make those unknown-words essentially dilute the influence of other nearby word-vectors – which often isn't what you want.)
Awesome! Thank you very much.
def get_vectorOOV(s):
try:
return np.array(model.wv.get_vector(s))
except KeyError:
return np.zeros((300,))

How to reduce semantically similar words?

I have a large corpus of words extracted from the documents. In the corpus are words which might mean the same.
For eg: "command" and "order" means the same, "apple" and "apply" which does not mean the same.
I would like to merge the similar words, say "command" and "order" to "command".
I have tried to use word2vec but it doesn't check for semantic similarity of words(it ouputs good similarity for apple and apply since four characters in the words are the same). And when I try using wup similarity, it gives good similarity score if the words have matching synonyms whose results are not that impressive.
What could be the best approach to reduce semantically similar words to get rid of redundant data and merge similar data?
I believe one of the options here is using WordNet. It gives you a list of synonyms for the word, so you may merge them together given you know its part of speech.
However, I'd like to point out that "order" and "command" are not the same, e.g. you do not command food in restaurants and such homonymy is true for many-many words.
Also I'd like to point out that for Word2vec spelling is irrelevant and is not taken into consideration at all, the algorithm considers only concurrent usage. I suppose you might be mixing it with FastText.
However, there should be some problems with your model.
Because in a standard set of embeddings distance between these concepts should be large. MUSE FastText similarity between "apple" and "apply" is only 0.15, which is quite low.
I use Gensim's function
model.similarity("apply", "apple")
So you might need to fix learning parameters or just use a pretrained model.

doc2vec: any way to fetch closest matching terms for a given vector?

The use-case I have is to have a collection of "upvoted" documents and "downvoted" documents and using those to re-order a set of results in a search.
I am using gensim doc2vec and am able to run the most_similar queries for word(s) and fetch matching words. But how would I be able to fetch the matching keywords given a vector fetched by a vector sum of the above doc vectors?
Ohh silly me, I found the answer staring right in my face, posting here in case anyone else has the issue:
similar_by_vector(vector, topn=10, restrict_vocab=None)
This is however found not in the Doc2Vec class, but in the KeyedVector class.

Custom word weights for sentences when calling h2o transform and word2vec, instead of straight AVERAGE of words

I am using H2O machine learning package to do natural language predictions, including the functions h2o.word2vec and h2o.transform. I need sentence level aggregation, which is provided by the AVERAGE parameter value:
h2o.transform(word2vec, words, aggregate_method = c("NONE", "AVERAGE"))
However, in my case I strongly wish to avoid equal weighting of "the" and "platypus" for example.
Here's a scheme I concocted to achieve custom word-weightings. If H2O's word2vec "AVERAGE" option uses all the words including duplicates that might appear, then I could effect a custom word weighting when calling h2o.transform by adding additional duplicates of certain words to my sentences, when I want to weight them more heavily than other words.
Can any H2O experts confirm that that the word2vec AVERAGE parameter is using all the words rather than just the unique words when computing AVERAGE of the words in sentence?
Alternatively, is there a better way? I tried but I find myself unable to imagine any correct math to multiply the sentence average by some factor, after it was already computed.
Yes, h2o.transform will consider each occurrence of a word for the averaging, not just the unique words. Your trick will therefore work.
There is currently no direct way to provide user defined weights. You could probably do an ugly hack and weight directly the word embeddings but that won't be a straightforward solution I could recommend.
We can add this feature to H2O. I would love to hear what API would work for you (how would you like to provide the weights).