Finding word similarities along the output or input vectors using gensim word2vec? - word2vec

I know that you can use model.wv.most_similar(...) to get words by cosine similarity in gensim.
I also know gensim gives you input and output vectors in e.g. model.syn0 and model.syn1neg.
Is there a simple way to calculate cosine similarity and create a list of most similar using only the input or output vectors, one or the other? E.g. I want to try doing it using just output vectors.

There's no built-in facility, but I believe you could achieve it by creating a separate KeyedVectors instance where you replace the usual ('input projection'/'syn0') vector-array with the same-sized output array (as exists in negative-sampling models).
Roughly the following may work:
full_w2v_model = ... # whatever training/loading is necessary
full_w2v_model.wv.save_word2vec_format(PATH) # saves just the word-vectors
out_vecs = KeyedVectors.load_word2vec_format(PATH) # reloads as separate object
out_vecs.vectors = full_w2v_model.syn1neg # clobber raw vecs with model's output layer
(Let me know if this works as-is or neds some further touch-up to work!)

Related

Word2vec build vocab adds TM to words

I'm trying to make convert my textdata to vectors. I would like to transform the word ultraram to a vector. I added the word to the model using model.build_vocab, but only ultraramTM is added. What did i do wrong
model.save("word2vec.model2")
model = Word2Vec.load("word2vec.model2")
model.build_vocab(data_tokenized, update=True)
# Store just the words + their trained embeddings.
word_vectors = model.wv
word_vectors.save("word2vec.wordvectors2")
# Load back with memory-mapping = read-only, shared across processes.
self.wv = KeyedVectors.load("word2vec.wordvectors2", mmap='r')
for i in self.wv.key_to_index:
if "ultrar" in i:
print(i)
ultraram™
manufactureultraram™
ultrarobust
ultrarare
ultrarealistic
ultrarelativistic
it shows some words with a TM. What does this mean? and how can i add the word "utraram" without the tm.
If there's a ™ at the end of some tokens, then those tokens, with the ™, were exactly what had been passed into the model when its vocabulary was first established.
If you don't want them, you'd have to strip them during your tokenization. (You current question/code doesn't show how you might have tokenized your data.)
Separately:
Directly using .load() to replace a Word2Vec model's existing KeyedVectors won't in general be reliable: a Word2Vec isn't expecting that to change separate from its own initialization/training.
It may work in this limited case – eloading exactly the same word-vectors as were just saved – but in such a case, it's unclear why you'd want to do it. From the comment it seems the motivation here might be to save some memory. However, if you're only looking-up word-vectors, you don't the full Word2Vec model at all. You can just use the set-of-KeyedVectors alone (for more memory savings).

How I can get vector from output matrix in FastText ?

In this study author have found out that, Word2Vec generates the two kinds of embeddings(IN & OUT).
https://arxiv.org/abs/1602.01137
Well, you can easily get that using syn1 attribute in gensim word2vec. But in the case of gensim fastText, the syn1 do exists but as the concept of fastText is sub-word based, it's not possible to get a vector for word from output matrix by matching the indexes. Do you know any other way around to calculate vector using output matrix??
In FastText, the vector for a word is the combination of:
the full-word vector, if it exists; and
all the subword vectors
You can view the gensim method that returns a vector, composed from subwords if necessary, at:
https://github.com/RaRe-Technologies/gensim/blob/2ccc82bf50bcfbee44932c160db076a873cf893e/gensim/models/keyedvectors.py#L1970
(I think this method might have a bug, in comparison to the original FastText approach, in that this gensim method perhaps should also add the subword vectors to the whole-word-vector, even when a whole-word-vector is available.)

NETLOGO: storing lists for later use

Hello I am building a model in netlogo which is supposed to run for 525614 ticks and then stop. The result of this model is a list of values. I would like to compare the lists of values given by the model in different runs. Unfortunately every time the model starts running everything is cleared so there is no way of keeping track of the list produced by the model.
I tried to write a csv file to store the elements of the list like this:
file-open "list.csv"
file-write list_element
The problem is that when I try to retrieve the list as follows:
show csv:from-file "list.csv"
I get:
[[" list_element1 list_element2....."]]
instead of:
[list_element1 list_element2 ....]
The presence of the double square bracket at the beginning and at the end as well as the presence of the quotation marks make it impossible to access a single element of the list to compare it with those of other lists.
How should I solve this? Should I use diferent primitives to write my file or should I operate on the badly formatted list I get?
The list should be composed of only numbers.
Use file-print instead of file-write.

Building Speech Dataset for LSTM binary classification

I'm trying to do binary LSTM classification using theano.
I have gone through the example code however I want to build my own.
I have a small set of "Hello" & "Goodbye" recordings that I am using. I preprocess these by extracting the MFCC features for them and saving these features in a text file. I have 20 speech files(10 each) and I am generating a text file for each word, so 20 text files that contains the MFCC features. Each file is a 13x56 matrix.
My problem now is: How do I use this text file to train the LSTM?
I am relatively new to this. I have gone through some literature on it as well but not found really good understanding of the concept.
Any simpler way using LSTM's would also be welcome.
There are many existing implementation for example Tensorflow Implementation, Kaldi-focused implementation with all the scripts, it is better to check them first.
Theano is too low-level, you might try with keras instead, as described in tutorial. You can run tutorial "as is" to understand how things goes.
Then, you need to prepare a dataset. You need to turn your data into sequences of data frames and for every data frame in sequence you need to assign an output label.
Keras supports two types of RNNs - layers returning sequences and layers returning simple values. You can experiment with both, in code you just use return_sequences=True or return_sequences=False
To train with sequences you can assign dummy label for all frames except the last one where you can assign the label of the word you want to recognize. You need to place input and output labels to arrays. So it will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,...,1], [0,0,....,2]]
In X every element is a vector of 13 floats. In Y every element is just a number - 0 for intermediate frames and word ID for final frame.
To train with just labels you need to place input and output labels to arrays and output array is simpler. So the data will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,1], [0,1,0]]
Note that output is vectorized (np_utils.to_categorical) to turn it to vectors instead of just numbers.
Then you create network architecture. You can have 13 floats for input, a vector for output. In the middle you might have one fully connected layer followed by one lstm layer. Do not use too big layers, start with small ones.
Then you feed this dataset into model.fit and it trains you the model. You can estimate model quality on heldout set after training.
You will have a problem with convergence since you have just 20 examples. You need way more examples, preferably thousands to train LSTM, you will only be able to use very small models.

Different types of features to train Naive Bayes in Python Pandas

I would like to use a number of features to train with Naive Bayes classifier to classify 'A' or 'non-A'.
I have three features of different value types:
1) total_length - in positive integer
2) vowel-ratio - in decimal/fraction
3) twoLetters_lastName - a array containing multiple two-letters strings
# coding=utf-8
from nltk.corpus import names
import nltk
import random
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from sklearn.naive_bayes import GaussianNB
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# Import data into pandas
data = pd.read_csv('XYZ.csv', header=0, encoding='utf-8',
low_memory=False)
df = DataFrame(data)
# Randomize records
df = df.reindex(np.random.permutation(df.index))
# Assign column into label Y
df_Y = df[df.AScan.notnull()][['AScan']].values # Labels are 'A' or 'non-A'
#print df_Y
# Assign column vector into attribute X
df_X = df[df.AScan.notnull()][['total_length', 'vowel_ratio', 'twoLetters_lastName']].values
#print df_X[0:10]
# Incorporate X and Y into ML algorithms
clf = GaussianNB()
clf.fit(df_X, df_Y)
df_Y is as follow:
[[u'non-A']
[u'A']
[u'non-A']
...,
[u'A']
[u'non-A']
[u'non-A']]
df_X is below:
[[9L 0.222222222 u"[u'ke', u'el', u'll', u'ly']"]
[17L 0.41176470600000004
u"[u'ma', u'ar', u'rg', u'ga', u'ar', u'ri', u'is']"]
[11L 0.454545455 u"[u'du', u'ub', u'bu', u'uc']"]
[11L 0.454545455 u"[u'ma', u'ah', u'he', u'er']"]
[15L 0.333333333 u"[u'ma', u'ag', u'ge', u'ee']"]
[13L 0.307692308 u"[u'jo', u'on', u'ne', u'es']"]
[12L 0.41666666700000005
u"[u'le', u'ef', u'f\\xe8', u'\\xe8v', u'vr', u're']"]
[15L 0.26666666699999997 u"[u'ni', u'ib', u'bl', u'le', u'et', u'tt']"]
[15L 0.333333333 u"[u'ki', u'in', u'ns', u'sa', u'al', u'll', u'la']"]
[11L 0.363636364 u"[u'mc', u'cn', u'ne', u'ei', u'il']"]]
I am getting this error:
E:\Program Files Extra\Python27\lib\site-packages\sklearn\naive_bayes.py:150: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Traceback (most recent call last):
File "C:werwer\wer\wer.py", line 32, in <module>
clf.fit(df_X, df_Y)
File "E:\Program Files Extra\Python27\lib\site-packages\sklearn\naive_bayes.py", line 163, in fit
self.theta_[i, :] = np.mean(Xi, axis=0)
File "E:\Program Files Extra\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 2727, in mean
out=out, keepdims=keepdims)
File "E:\Program Files Extra\Python27\lib\site-packages\numpy\core\_methods.py", line 69, in _mean
ret, rcount, out=ret, casting='unsafe', subok=False)
TypeError: unsupported operand type(s) for /: 'unicode' and 'long'
My understanding is I need to convert the features into one numpy array as a feature vector, but I don't think if I am preparing this X vector right since it contains very different value types.
Related questions: Choosing a Classification Algorithm to Classify Mix of Nominal and Numeric Data -- Mixing Categorial and Continuous Data in Naive Bayes Classifier Using Scikit-learn
Okay so there are a few things going on. As DalekSec pointed out, it's best practice to keep all your features as one type as you input them into a model like GaussianNB. The traceback indicates that while fitting the model, it tries to divide a string (presumably one of your unicode strings like u"[u'ke', u'el', u'll', u'ly']") by an integer. So what we need to do is convert the training data into a form that sklearn can use. We can do this a few ways, two of which ogrisel eloquently describes in this answer here.
We can convert all the continuous variables to categorical variables. In our case, this means converting total_length (in some cases you could probably treat this as a categorical variable, but let's not get ahead of ourselves) and vowel-ratio. For instance, you can basically bin the values you see in each feature to one of 5 values based on percentile: 'very small', 'small', 'medium', 'high', 'very high'. There's no real easy way in sk-learn as far as I know, but it should be pretty straightforward to do it yourself. The only thing that you would want to change is that you would want to use MultinomialNB instead of GaussianNB because you'll be dealing with features that would be better described by multinomial distributions rather than gaussian ones.
We can convert the categorical features to numeric ones for use with GaussianNB. Personally I find this to be the more intuitive approach. Basically, when dealing with text, you need to figure out what information you want to take from the text and pass to the classifier. It looks like to me that you want to extract the incidence of different two letter last names.
Normally I would ask you whether or not you have all the last names in your dataset, but since each one is only two letters each we can just store all the possible two letter names (including the unicode characters involving accent marks) with a minimal impact on performance. This is where something like sklearn's CountVectorizer might be useful. Assuming that you have every possible combination of two letter last names in your data, you can just directly use this to turn a row in your twoLetter_lastname column into a N-dimensional vector that records the number of occurrences of each unique last name in your row. Then just combine this new vector with your other two features into a numpy array.
In the case you do not have every possible combination of two letters (including accented ones), you should consider generating that list and pass it in as the 'vocabulary' for the CountVectorizer. This is so that your classifier knows how to handle all possible last names. It's not the end of the world if you don't handle all cases, but any new unseen two letter pairs will be ignored in this scheme.
Before you use these tools, you should make sure that you pass your last name column in as a list, and not as a string, as this can result in unintended behavior.
You can read more about general sklearn preprocessing here, and more about CountVectorizer and other text feature extraction tools provided by sklearn here. I use a lot of these tools daily, and recommend them for basic text extraction tasks. There are also plenty of tutorials and demos available online. You might also look for other types of methods of representation, like binarizing and one-hot encoding. There are many ways to solve this problem, it mostly depends on your specific problem/needs.
After you're able to turn all your data into one form or the other, you should be able to make use of either the Gaussian or Multinomial NB classifier. As for your error regarding the 1D vector, you printed df_Y and it looked like
[[u'non-A']
[u'A']
[u'non-A']
...,
[u'A']
[u'non-A']
[u'non-A']]
Basically, it's expecting this to be in a flat list, rather than as a column vector (a list of one-dimensional lists). Just reshape it accordingly by making use of commands like numpy.reshape() or numpy.ravel() (numpy.ravel() would probably be more appropriate, considering that you're dealing with just one column, as the error mentioned).
I'm not 100% sure, but I think scikit-learn.naive_bayes requires a purely numeric feature vector instead of a mixture of text and numbers. It looks like it crashes when trying to "divide" a unicode string by a long integer.
I can't be much help with finding numeric representations for text, but this scikit-learn tutorial might be a good start.