Cosine similarity between any two sentences is giving 0.99 always - word2vec

I downloaded the stackoverflow dump (which is a 10GB file) and ran word2vec on the dump in order to get vector representations for programming terms (I require it for a project that I'm doing). Following is the code:
from gensim.models import Word2Vec
from xml.dom.minidom import parse, parseString
titles, bodies = [], []
xmldoc = parse('test.xml') //this is the dump
reflist = xmldoc.getElementsByTagName('row')
for i in range(len(reflist)):
bitref = reflist[i]
if 'Title' in bitref.attributes.keys():
title = bitref.attributes['Title'].value
titles.append([i for i in title.split()])
if 'Body' in bitref.attributes.keys():
body = bitref.attributes['Body'].value
bodies.append([i for i in body.split()])
dimension = 8
sentences = titles + bodies
model = Word2Vec(sentences, size=dimension, iter=100)
model.save('snippet_1.model')
Now, in order to calculate the cosine similarity between a pair of sentences, I do the following:
from gensim.models import Word2Vec
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
model = Word2Vec.load('snippet_1.model')
dimension = 8
snippet = 'some text'
snippet_vector = np.zeros((1, dimension))
for word in snippet:
if word in model.wv.vocab:
vecvalue = model[word].reshape(1, dimension)
snippet_vector = np.add(snippet_vector, vecvalue)
link_text = 'some other text'
link_vector = np.zeros((1, dimension))
for word in link_text:
if word in model.wv.vocab:
vecvalue = model[word].reshape(1, dimension)
link_vector = np.add(link_vector, vecvalue)
print(cosine_similarity(snippet_vector, link_vector))
I am calculating the sum of word embedding for each word of a sentence to get some representation for the sentence as a whole. I do this for both sentences and then calculate the cosine similarity between them.
Now, the problem is I'm getting cosine similarity around 0.99 for any pair of sentences that I give. Is there anything that I'm doing wrong? Any suggestions for a better approach?

Are you checking that your snippet_vector and link_vector are meaningful vectors before calculating their cosine-similarity?
I suspect they're just zero-vectors, or similarly non-diverse, since your for word in snippet: and for word in link_text: loops aren't tokenizing the text. So they'll just loop over the characters in each string, which either won't be present in your model as words, or the few available may match exactly between your texts. (Even with tokenization, the texts' summed vectors would only differ by the value of a vector for the one different word, 'other'.)

Related

Time variable units "day as %Y%m%d.%f" in python iris

I am hoping someone can help. I am running a few climate models (NetCDF files) in python using iris. All was working well until I added my last model which is formatted differently. The units they use for the time variable in the new models is day as %Y%m%d.%f but in the other models it is days since …. This means that when I try to constrain the time variable I get the following error AttributeError: 'numpy.float64' object has no attribute 'year'.
I tried adding a year variable using iriscc.add_year(EARTH3, 'time') but that just brings up the error ‘Unit has undefined calendar’.
I’m wondering if you know how I might fix this? Do I need to convert the calendar type? Or is there is there a way around that? Not sure how to do that anyway!
Thank you!
Erika
EDIT: here is the full code for my file the model CanESM2 is working, but the model EARTH3 is not - it is the one with the funny time units.
import matplotlib.pyplot as plt
import iris
import iris.coord_categorisation as iriscc
import iris.plot as iplt
import iris.quickplot as qplt
import iris.analysis.cartography
import cf_units
from cf_units import Unit
import datetime
import numpy as np
def main():
#-------------------------------------------------------------------------
#bring in all the GCM models we need and give them a name
CanESM2= '/exports/csce/datastore/geos/users/s0XXXX/Climate_Modelling/GCM_data/tasmin_Amon_CanESM2_historical_r1i1p1_185001-200512.nc'
EARTH3= '/exports/csce/datastore/geos/users/s0XXXX/Climate_Modelling/GCM_data/tas_Amon_EC-EARTH_historical_r3i1p1_1850-2009.nc'
#Load exactly one cube from given file
CanESM2 = iris.load_cube(CanESM2)
EARTH3 = iris.load_cube(EARTH3)
print"CanESM2 time"
print (CanESM2.coord('time'))
print "EARTH3 time"
print (EARTH3.coord('time'))
#fix EARTH3 time units as they differ from all other models
t_coord=EARTH3.coord('time')
t_unit = t_coord.attributes['invalid_units']
timestep, _, t_fmt_str = t_unit.split(' ')
new_t_unit_str= '{} since 1850-01-01 00:00:00'.format(timestep)
new_t_unit = cf_units.Unit(new_t_unit_str, calendar=cf_units.CALENDAR_STANDARD)
new_datetimes = [datetime.datetime.strptime(str(dt), t_fmt_str) for dt in t_coord.points]
new_dt_points = [new_t_unit.date2num(new_dt) for new_dt in new_datetimes]
new_t_coord = iris.coords.DimCoord(new_dt_points, standard_name='time', units=new_t_unit)
print "EARTH3 new time"
print (EARTH3.coord('time'))
#regrid all models to have same latitude and longitude system, all regridded to model with lowest resolution
CanESM2 = CanESM2.regrid(CanESM2, iris.analysis.Linear())
EARTH3 =EARTH3.regrid(CanESM2, iris.analysis.Linear())
#we are only interested in the latitude and longitude relevant to Malawi (has to be slightly larger than country boundary to take into account resolution of GCMs)
Malawi = iris.Constraint(longitude=lambda v: 32.0 <= v <= 36., latitude=lambda v: -17. <= v <= -8.)
CanESM2 =CanESM2.extract(Malawi)
EARTH3 =EARTH3.extract(Malawi)
#time constraignt to make all series the same, for ERAINT this is 1990-2008 and for RCMs and GCMs this is 1961-2005
iris.FUTURE.cell_datetime_objects = True
t_constraint = iris.Constraint(time=lambda cell: 1961 <= cell.point.year <= 2005)
CanESM2 =CanESM2.extract(t_constraint)
EARTH3 =EARTH3.extract(t_constraint)
#Convert units to match, CORDEX data is in Kelvin but Observed data in Celsius, we would like to show all data in Celsius
CanESM2.convert_units('Celsius')
EARTH3.units = Unit('Celsius') #this fixes EARTH3 which has no units defined
EARTH3=EARTH3-273 #this converts the data manually from Kelvin to Celsius
#add year data to files
iriscc.add_year(CanESM2, 'time')
iriscc.add_year(EARTH3, 'time')
#We are interested in plotting the data by year, so we need to take a mean of all the data by year
CanESM2YR=CanESM2.aggregated_by('year', iris.analysis.MEAN)
EARTH3YR = EARTH3.aggregated_by('year', iris.analysis.MEAN)
#Returns an array of area weights, with the same dimensions as the cube
CanESM2YR_grid_areas = iris.analysis.cartography.area_weights(CanESM2YR)
EARTH3YR_grid_areas = iris.analysis.cartography.area_weights(EARTH3YR)
#We want to plot the mean for the whole region so we need a mean of all the lats and lons
CanESM2YR_mean = CanESM2YR.collapsed(['latitude', 'longitude'], iris.analysis.MEAN, weights=CanESM2YR_grid_areas)
EARTH3YR_mean = EARTH3YR.collapsed(['latitude', 'longitude'], iris.analysis.MEAN, weights=EARTH3YR_grid_areas)
#-------------------------------------------------------------------------
#PART 4: PLOT LINE GRAPH
#limit x axis
plt.xlim((1961,2005))
#assign the line colours and set x axis to 'year' rather than 'time'
qplt.plot(CanESM2YR_mean.coord('year'), CanESM2YR_mean, label='CanESM2', lw=1.5, color='blue')
qplt.plot(EARTH3YR_mean.coord('year'), EARTH3YR_mean, label='EC-EARTH (r3i1p1', lw=1.5, color='magenta')
#set a title for the y axis
plt.ylabel('Near-Surface Temperature (degrees Celsius)')
#create a legend and set its location to under the graph
plt.legend(loc="upper center", bbox_to_anchor=(0.5,-0.05), fancybox=True, shadow=True, ncol=2)
#create a title
plt.title('Tas for Malawi 1961-2005', fontsize=11)
#add grid lines
plt.grid()
#show the graph in the console
iplt.show()
if __name__ == '__main__':
main()
In Iris, unit strings for time coordinates must be specified in the format <time-period> since <epoch>, where <time-period> is a unit of measure of time, such as 'days', or 'years'. This format is specified by udunits2, the library Iris uses to supply valid units and perform unit conversions.
The time coordinate in this case does not have a unit that follows this format, meaning the time coordinate will not have full time coordinate functionality (this partly explains the Attribute Error in the question). To fix this we will need to construct a new time coordinate based on the values and metadata of the existing time coordinate and then replace the cube's existing time coordinate with the new one.
To do this we'll need to:
construct a new time unit based on the metadata contained in the existing time unit
take the existing time coordinate's point values and format them as datetime objects, using the format string specified in the existing time unit
convert the datetime objects from (2.) to an array of floating-point numbers using the new time unit constructed in (1.)
create a new time coordinate from the array constructed in (3.) and the new time unit produced in (1.)
remove the old time coordinate from the cube and add the new one.
Here's the code to do this...
import datetime
import cf_units
import iris
import numpy as np
t_coord = EARTH3.coord('time')
t_unit = t_coord.attributes['invalid_units']
timestep, _, t_fmt_str = t_unit.split(' ')
new_t_unit_str = '{} since 1850-01-01 00:00:00'.format(timestep)
new_t_unit = cf_units.Unit(new_t_unit_str, calendar=cf_units.CALENDAR_STANDARD)
new_datetimes = [datetime.datetime.strptime(str(dt), t_fmt_str) for dt in t_coord.points]
new_dt_points = [new_t_unit.date2num(new_dt) for new_dt in new_datetimes]
new_t_coord = iris.coords.DimCoord(new_dt_points, standard_name='time', units=new_t_unit)
t_coord_dim = cube.coord_dims('time')
cube.remove_coord('time')
cube.add_dim_coord(new_t_coord, t_coord_dim)
I've made an assumption about the best epoch for your time data. I've also made an assumption about the calendar that best describes your data, but you should be able to replace (when constructing new_t_unit) the standard calendar I've chosen with any other valid cf_units calendar without difficulty.
As a final note, it is effectively impossible to change calendar types. This is because different calendar types include and exclude different days. For example, a 360day calendar has a Feb 30 but no May 31 (as it assumes 12 idealised 30 day long months). If you try and convert from a 360day calendar to a standard calendar, problems you hit include what you do with the data from 29 and 30 Feb, and how you fill the five missing days that don't exist in a 360day calendar. For such reasons it's generally impossible to convert calendars (and Iris doesn't allow such operations).
Hope this helps!
Maybe the answer is not more useful however I write here the function that I made in order to convert the data from %Y%m%d.%f in datetime array.
The function create a perfect datetime array, without missing values, it can be modified to take into account if there are missing times, however a climate model should not have missing data.
def fromEARTHtime2Datetime(dt,timeVecEARTH):
"""
This function returns the perfect array from the EARTH %Y%m%d.%f time
format and convert it to a more useful time, such as the time array
from the datetime of pyhton, this is WHTOUT any missing data!
Parameters
----------
dt : string
This is the time discretization, it can be 1h or 6h, but always it
needs to be hours, example dt = '6h'.
timeVecEARTH : array of float
Vector of the time to be converted. For example the time of the
EARTH is day as %Y%m%d.%f.
And only this format can be converted to datetime, for example:
20490128.0,20490128.25,20490128.5,20490128.75 this will be converted
in datetime: '2049-01-28 00:00:00', '2049-01-28 60:00:00',
'2049-01-28 12:00:00','2049-01-28 18:00:00'
Returns
-------
timeArrNew : datetime
This is the perfect and WITHOUT any missing data datatime array,
for example: DatetimeIndex(['2049-01-28 00:00:00', '2049-01-28 06:00:00',
...
'2049-02-28 18:00:00', '2049-03-01 00:00:00'],
dtype='datetime64[ns]', length=129, freq='6H')
"""
dtDay = 24/np.float(dt[:-1])
partOfDay = np.arange(0,1,1/dtDay)
hDay = []
for ip in partOfDay:
hDay.append('%02.f:00:00' %(24*ip))
dictHours = dict(zip(partOfDay,hDay))
t0Str = str(timeVecEARTH[0])
timeAux0 = t0Str.split('.')
timeAux0 = timeAux0[0][0:4] +'-' + timeAux0[0][4:6] +'-' + timeAux0[0][6:] + ' ' + dictHours[float(timeAux0[1])]
tendStr = str(timeVecEARTH[-1])
timeAuxEnd = tendStr.split('.')
timeAuxEnd = timeAuxEnd[0][0:4] +'-' + timeAuxEnd[0][4:6] +'-' + timeAuxEnd[0][6:] + ' ' + dictHours[float(timeAuxEnd[1])]
timeArrNew = pd.date_range(timeAux0,timeAuxEnd, freq=dt)
return timeArrNew

How do I include a list type feature in sklearn.svm.libsvm.fit() classifier?

I'm trying to loop through a number of text documents and create a feature set by recording :
position list in text
Part of speech of keyphrase
Length of each keyphrase (number of words in it)
Frequency of each keyphrase
Code snippet of extraxting features :
#Take list of Keywords
keyword_list = [line.split(':')[1].lower().strip() for line in keywords.splitlines() if ':' in line ]
#Position
position_list = [ [m.start()/float(len(document)) for m in re.finditer(re.escape(kw),document,flags=re.IGNORECASE)] for kw in keyword_list]
#Part of Speech
pos_list = []
for key in keyword_list:
pos_list.append([pos for w,pos in nltk.pos_tag(nltk.word_tokenize(key))])
#Length of each keyword
len_list = [ len(k.split(' ')) for k in keyword_list]
#Text Frequency
freq_list = [ len(pos)/float(len(document)) for pos in position_list]
target.extend(keyword_list)
for i in range(0,len(keyword_list)):
data.append([position_list[i],pos_list[i],len_list[i],freq_list[i]])
Where
target : list of keywords
data : list of features
I passed this through a classifier :
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(data,target,test_size=0.25,random_state = 42)
import numpy as np
X_train = np.array(X_train)
y_train = np.array(y_train)
from sklearn import svm
cls = svm.SVC(gamma=0.001,C=100) # Parameter values Matter!
cls.fit(X_train,y_train)
predictions = cls.predict(X_test)
But I get an error :
Traceback (most recent call last):
File "supervised_3.py", line 113, in <module>
cls.fit(X_train,y_train)
File "/Library/Python/2.7/site-packages/sklearn/svm/base.py", line 150, in fit
X = check_array(X, accept_sparse='csr', dtype=np.float64, order='C')
File "/Library/Python/2.7/site-packages/sklearn/utils/validation.py", line 373, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence
So, I removed all the list items by changing
data.append([position_list[i],pos_list[i],len_list[i],freq_list[i]])
to
data.append([len_list[i],freq_list[i]])
It worked.
But I need to include position_list and pos_list
I thought it wasn't working because these 2 are lists. So, I tried converting them to arrays :
data.append([np.array(position_list[i]),np.array(pos_list[i]),len_list[i],freq_list[i]])
but I still get the same error.
In the last for loop of the feature extraction code you are trying to append to data a list of four elements, namely position_list[i], pos_list[i], len_list[i], freq_list[i]. The problem is that the first two elements are lists themselves, but individual features have to be escalars (this is why the issue is not solved by converting the sublists to numpy arrays). Each of them requires a different workaround:
position_list[i]
This is a list of float numbers. You could replace this list by some statistics computed from it, for example the mean and the standard deviation.
pos_list[i]
This is a list of tags extracted from the list of tuples of the form (token, tag)* yielded by nltk.pos_tag. The tags (which are strings) can be converted into numbers in a straightforward way by counting their number of occurrences. To keep things simple, I will just add the frequency of 'NN' and 'NNS' tags**.
To get your code working you just need to change the last for loop to:
for i in range(0, len(keyword_list)):
positions_i = position_list[i]
tags_i = pos_list[i]
len_tags_i = float(len(tags_i))
m = np.mean(positions_i)
s = np.std(positions_i)
nn = tags_i.count('NN')/len_tags_i
nns = tags_i.count('NNS')/len_tags_i
data.append([m, s, nn, nns, len_list[i], freq_list[i]])
By doing so the resulting feature vector becomes 6-dimensional. Needless to say, you could use a higher or lower number of statistics and/or tag frequencies, or even a different tagset.
* The identifiers w,pos you use in the for loop that creates pos_list are a bit misleading.
** You could utilize a collections.Counter to count the number of occurrences of each tag more efficiently.

How is TF calculated in Sklearn

I have been experimenting with sklearn's Tfidfvectorizer.
I am only concerned with TF, and not idf, so my settings have use_idf = FALSE
Complete settings are:
vectorizer = TfidfVectorizer(max_df=0.5, max_features= n_features,
ngram_range=(1,3), use_idf=False)
I have been trying to replicate the output of .fit_transform but haven't managed to do it so far and was hoping someone could explain the calculations for me.
My toy example is:
document = ["one two three one four five",
"two six eight ten two"]
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
n_features = 5
vectorizer = TfidfVectorizer(max_df=0.5, max_features= n_features,
ngram_range=(1,3), use_idf=False)
X = vectorizer.fit_transform(document)
count = CountVectorizer(max_df=0.5, max_features= n_features,
ngram_range=(1,3))
countMat = count.fit_transform(document)
I have assumed the counts from the Count Vectorizer will be the same as the counts used int he Tfidf Vectorizer. So am trying to change the countMat object to match X.
I had missed a line from the documentation which says
Each row is normalized to have unit euclidean norm
So to anwer my own question - the answer is:
for i in xrange(countMat.toarray().__len__()):
row = countMat.toarray()[i]
row / np.sqrt(np.sum(row**2))
Although I am sure there is a more elegant way to code the result.

NLTK package to estimate the (unigram) perplexity

I am trying to calculate the perplexity for the data I have. The code I am using is:
import sys
sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")
from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm
But I am receiving the error,
File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'
I have already performed Latent Dirichlet Allocation for the data I have and I have generated the unigrams and their respective probabilities (they are normalized as the sum of total probabilities of the data is 1).
My unigrams and their probability looks like:
Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781
This is just a fragment of the unigrams file I have. The same format is followed for about 1000s of lines. The total probabilities (second column) summed gives 1.
I am a budding programmer. This ngram.py belongs to the nltk package and I am confused as to how to rectify this. The sample code I have here is from the nltk documentation and I don't know what to do now. Please help on what I can do. Thanks in advance!
Perplexity is the inverse probability of the test set, normalized by the number of words. In the case of unigrams:
Now you say you have already constructed the unigram model, meaning, for each word you have the relevant probability. Then you only need to apply the formula. I assume you have a big dictionary unigram[word] that would provide the probability of each word in the corpus. You also need to have a test set. If your unigram model is not in the form of a dictionary, tell me what data structure you have used, so I could adapt it to my solution accordingly.
perplexity = 1
N = 0
for word in testset:
if word in unigram:
N += 1
perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))
UPDATE:
As you asked for a complete working example, here's a very simple one.
Suppose this is our corpus:
corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""
Here's how we construct the unigram model first:
import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)
#here you construct the unigram language model
def unigram(tokens):
model = collections.defaultdict(lambda: 0.01)
for f in tokens:
try:
model[f] += 1
except KeyError:
model [f] = 1
continue
N = float(sum(model.values()))
for word in model:
model[word] = model[word]/N
return model
Our model here is smoothed. For words outside the scope of its knowledge, it assigns a low probability of 0.01. I already told you how to compute perplexity:
#computes perplexity of the unigram model on a testset
def perplexity(testset, model):
testset = testset.split()
perplexity = 1
N = 0
for word in testset:
N += 1
perplexity = perplexity * (1/model[word])
perplexity = pow(perplexity, 1/float(N))
return perplexity
Now we can test this on two different test sets:
testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"
model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)
for which you get the following result:
>>>
49.09452736318415
99.99999999999997
Note that when dealing with perplexity, we try to reduce it. A language model that has less perplexity with regards to a certain test set is more desirable than one with a bigger perplexity. In the first test set, the word Monty was included in the unigram model, so the respective number for perplexity was also smaller.
Thanks for the code snippet! Shouldn't:
for word in model:
model[word] = model[word]/float(sum(model.values()))
be rather:
v = float(sum(model.values()))
for word in model:
model[word] = model[word]/v
Oh ... I see was already answered ...

How to use word2vec to calculate the similarity distance by giving 2 words?

Word2vec is a open source tool to calculate the words distance provided by Google. It can be used by inputting a word and output the ranked word lists according to the similarity. E.g.
Input:
france
Output:
Word Cosine distance
spain 0.678515
belgium 0.665923
netherlands 0.652428
italy 0.633130
switzerland 0.622323
luxembourg 0.610033
portugal 0.577154
russia 0.571507
germany 0.563291
catalonia 0.534176
However, what I need to do is to calculate the similarity distance by giving 2 words. If I give the 'france' and 'spain', how can I get the score 0.678515 without reading the whole words list by giving just 'france'.
gensim has a Python implementation of Word2Vec which provides an in-built utility for finding similarity between two words given as input by the user. You can refer to the following:
Intro: http://radimrehurek.com/gensim/models/word2vec.html
Tutorial: http://radimrehurek.com/2014/02/word2vec-tutorial/
UPDATED: Gensim 4.0.0 and above
The syntax in Python for finding similarity between two words goes like this:
>> from gensim.models import Word2Vec
>> model = Word2Vec.load(path/to/your/model)
>> model.wv.similarity('france', 'spain')
As you know word2vec can represent a word as a mathematical vector. So once you train the model, you can obtain the vectors of the words spain and france and compute the cosine distance (dot product).
An easy way to do this is to use this Python wrapper of word2vec. You can obtain the vector using this:
>>> model['computer'] # raw numpy vector of a word
array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
To compute the distances between two words, you can do the following:
>>> import numpy
>>> cosine_similarity = numpy.dot(model['spain'], model['france'])/(numpy.linalg.norm(model['spain'])* numpy.linalg.norm(model['france']))
I just stumbled on this while looking for how to do this by modifying the original distance.c version, not by using another library like gensim.
I didn't find an answer so I did some research, and am sharing it here for others who also want to know how to do it in the original implementation.
After looking through the C source, you will find that 'bi' is an array of indexes. If you provide two words, the index for word1 will be in bi[0] and the index of word2 will be in bi[1].
The model 'M' is an array of vectors. Each word is represented as a vector with dimension 'size'.
Using these two indexes and the model of vectors, look them up and calculate the cosine distance (which is the same as the dot product) like this:
dist = 0;
for (a = 0; a < size; a++) {
dist += M[a + bi[0] * size] * M[a + bi[1] * size];
}
after this completes, the value 'dist' is the cosine similarity between the two words.
I have developed a code to help with calculating cosine similarity for 2 sentences / SKUs using gensim. The code can be found here
https://github.com/aviralmathur/Word2Vec
The code is using data for Kaggle competition on Crowdflower
It has been developed using Code for Kaggle Tutorial on Word2Vec available here
https://www.kaggle.com/c/word2vec-nlp-tutorial
I hope this helps
If you look at the source code of the Gensim's native method to calculate word similarities, you will find that it calculates word similarities using the following method:
import numpy as np
from gensim import matutils # utility fnc for pickling, common scipy operations etc
def similarity_cosine(vec1, vec2):
cosine_similarity = np.dot(matutils.unitvec(vec1), matutils.unitvec(vec2))
return cosine_similarity
similarity_cosine(model.wv['space'], model.wv['france'])