How to use word2vec to calculate the similarity distance by giving 2 words? - word2vec

Word2vec is a open source tool to calculate the words distance provided by Google. It can be used by inputting a word and output the ranked word lists according to the similarity. E.g.
Input:
france
Output:
Word Cosine distance
spain 0.678515
belgium 0.665923
netherlands 0.652428
italy 0.633130
switzerland 0.622323
luxembourg 0.610033
portugal 0.577154
russia 0.571507
germany 0.563291
catalonia 0.534176
However, what I need to do is to calculate the similarity distance by giving 2 words. If I give the 'france' and 'spain', how can I get the score 0.678515 without reading the whole words list by giving just 'france'.

gensim has a Python implementation of Word2Vec which provides an in-built utility for finding similarity between two words given as input by the user. You can refer to the following:
Intro: http://radimrehurek.com/gensim/models/word2vec.html
Tutorial: http://radimrehurek.com/2014/02/word2vec-tutorial/
UPDATED: Gensim 4.0.0 and above
The syntax in Python for finding similarity between two words goes like this:
>> from gensim.models import Word2Vec
>> model = Word2Vec.load(path/to/your/model)
>> model.wv.similarity('france', 'spain')

As you know word2vec can represent a word as a mathematical vector. So once you train the model, you can obtain the vectors of the words spain and france and compute the cosine distance (dot product).
An easy way to do this is to use this Python wrapper of word2vec. You can obtain the vector using this:
>>> model['computer'] # raw numpy vector of a word
array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
To compute the distances between two words, you can do the following:
>>> import numpy
>>> cosine_similarity = numpy.dot(model['spain'], model['france'])/(numpy.linalg.norm(model['spain'])* numpy.linalg.norm(model['france']))

I just stumbled on this while looking for how to do this by modifying the original distance.c version, not by using another library like gensim.
I didn't find an answer so I did some research, and am sharing it here for others who also want to know how to do it in the original implementation.
After looking through the C source, you will find that 'bi' is an array of indexes. If you provide two words, the index for word1 will be in bi[0] and the index of word2 will be in bi[1].
The model 'M' is an array of vectors. Each word is represented as a vector with dimension 'size'.
Using these two indexes and the model of vectors, look them up and calculate the cosine distance (which is the same as the dot product) like this:
dist = 0;
for (a = 0; a < size; a++) {
dist += M[a + bi[0] * size] * M[a + bi[1] * size];
}
after this completes, the value 'dist' is the cosine similarity between the two words.

I have developed a code to help with calculating cosine similarity for 2 sentences / SKUs using gensim. The code can be found here
https://github.com/aviralmathur/Word2Vec
The code is using data for Kaggle competition on Crowdflower
It has been developed using Code for Kaggle Tutorial on Word2Vec available here
https://www.kaggle.com/c/word2vec-nlp-tutorial
I hope this helps

If you look at the source code of the Gensim's native method to calculate word similarities, you will find that it calculates word similarities using the following method:
import numpy as np
from gensim import matutils # utility fnc for pickling, common scipy operations etc
def similarity_cosine(vec1, vec2):
cosine_similarity = np.dot(matutils.unitvec(vec1), matutils.unitvec(vec2))
return cosine_similarity
similarity_cosine(model.wv['space'], model.wv['france'])

Related

PVLIB - DC Power From Irradiation - Simple Calculation

Dear pvlib users and devels.
I'm a researcher in computer science, not particularly expert in the simulation or modelling of solar panels. I'm interested in use pvlib since
we are trying to simulate the works of a small solar panel used for IoT
applications, in particular the panel spec are the following:
12.8% max efficiency, Vmp = 5.82V, size = 225 × 155 × 17 mm.
Before using pvlib, one of my collaborator wrote a code that compute the
irradiation directly from average monthly values calculated with PVWatt.
I was not really satisfied, so we are starting to use pvlib.
In the old code, we have the power and current of the panel calculated as:
W = Irradiation * PanelSize(m^2) * Efficiency
A = W / Vmp
The Irradiation, in Madrid, as been obtained with PVWatt, and this is
what my collaborator used:
DIrradiance = (2030.0,2960.0,4290.0,5110.0,5950.0,7090.0,7200.0,6340.0,4870.0,3130.0,2130.0,1700.0)
I'm trying to understand if pvlib compute values similar to the ones above, as averages over a day for each month. And the curve of production in day.
I wrote this to compare pvlib with our old model:
import math
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd
import pvlib
from pvlib.location import Location
def irradiance(day,m):
DIrradiance =(2030.0,2960.0,4290.0,5110.0,5950.0,
7090.0,7200.0,6340.0,4870.0,3130.0,2130.0,1700.0)
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,m,day,00,00),
end=dt.datetime(2015,m,day,23,59),
freq='60min')
spaout = pvlib.solarposition.spa_python(times, madrid.latitude, madrid.longitude)
spaout = spaout.assign(cosz=pd.Series(np.cos(np.deg2rad(spaout['zenith']))))
z = np.array(spaout['cosz'])
return z.clip(0)*(DIrradiance[m-1])
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start = dt.datetime(2015,8,15,00,00),
end = dt.datetime(2015,8,15,23,59),
freq='60min')
old = irradiance(15,8) # old model
new = madrid.get_clearsky(times) # pvlib irradiance
plt.plot(old,'r-') # compare them.
plt.plot(old/6.0,'y-') # old seems 6 times more..I do not know why
plt.plot(new['ghi'].values,'b-')
plt.show()
The code above compute the old irradiance, using the zenit angle. and compute the ghi values using the clear_sky. I do not understand if the values in ghi must be multiplied by the cos of zenit too, or not. Anyway
they are smaller by a factor of 6. What I'd like to have at the end is the
power and current in output from the panel (DC) without any inverter, and
we are not really interested at modelling it exactly, but at least, to
have a reasonable curve. We are able to capture from the panel the ampere
produced, and we want to compare the values from the measurements putting
the panel on the roof top with the values calculated by pvlib.
Any help on this would be really appreachiated. Thanks
Sorry Will I do not care a lot about my previous model since I'd like to move all code to pvlib. I followed your suggestion and I'm using irradiance.total_irrad, the code now looks in this way:
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,1,1,00,00),
end=dt.datetime(2015,1,1,23,59),
freq='60min')
ephem_data = pvlib.solarposition.spa_python(times, madrid.latitude,
madrid.longitude)
irrad_data = madrid.get_clearsky(times)
AM = atmosphere.relativeairmass(ephem_data['apparent_zenith'])
total = irradiance.total_irrad(40, 180,
ephem_data['apparent_zenith'], ephem_data['azimuth'],
dni=irrad_data['dni'], ghi=irrad_data['ghi'],
dhi=irrad_data['dhi'], airmass=AM,
surface_type='urban')
poa = total['poa_global'].values
Now, I know the irradiance on POA, and I want to compute the output in Ampere: It is just
(poa*PANEL_EFFICIENCY*AREA) / VOLT_OUTPUT ?
It's not clear to me how you arrived at your values for DIrradiance or what the units are, so I can't comment much the discrepancies between the values. I'm guessing that it's some kind of monthly data since there are 12 values. If so, you'd need to calculate ~hourly pvlib irradiance data and then integrate it to check for consistency.
If your module will be tilted, you'll need to convert your ~hourly irradiance GHI, DNI, DHI values to plane of array (POA) irradiance using a transposition model. The irradiance.total_irrad function is the easiest way to do that.
The next steps depend on the IV characteristics of your module, the rest of the circuit, and how accurate you need the model to be.

difference between predicted and learned tfidf weights tfidfVectorizer sklearn

Can someone explain the difference between TFIDF (sklearn) weights obtained by two methods, given an unseen document:
To get the weights from the vocabulary_ attribute of the fitted model:
>>> [vectorizer_.idf_[vectorizer_.vocabulary_[word]] for word in 'pancho villa saved mexico'.split()]
[10.453599686138697, 7.0510239445064196, 5.9265028838483307, 5.5037398873086669]
To get the weights by using the transform() method of the fitted vectorizer:
>>> new=vectorizer_.transform(['pancho villa saved mexico.']).toarray()
>>> new[new>0]
array([ 0.3673986 , 0.6978233 , 0.39561987, 0.47068655])
I can see also the magnitudes are very different and probably they are showing completely different patterns giving completely different information about the unseen document. Thanks for the feedback on this concern also.

Cosine similarity between any two sentences is giving 0.99 always

I downloaded the stackoverflow dump (which is a 10GB file) and ran word2vec on the dump in order to get vector representations for programming terms (I require it for a project that I'm doing). Following is the code:
from gensim.models import Word2Vec
from xml.dom.minidom import parse, parseString
titles, bodies = [], []
xmldoc = parse('test.xml') //this is the dump
reflist = xmldoc.getElementsByTagName('row')
for i in range(len(reflist)):
bitref = reflist[i]
if 'Title' in bitref.attributes.keys():
title = bitref.attributes['Title'].value
titles.append([i for i in title.split()])
if 'Body' in bitref.attributes.keys():
body = bitref.attributes['Body'].value
bodies.append([i for i in body.split()])
dimension = 8
sentences = titles + bodies
model = Word2Vec(sentences, size=dimension, iter=100)
model.save('snippet_1.model')
Now, in order to calculate the cosine similarity between a pair of sentences, I do the following:
from gensim.models import Word2Vec
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
model = Word2Vec.load('snippet_1.model')
dimension = 8
snippet = 'some text'
snippet_vector = np.zeros((1, dimension))
for word in snippet:
if word in model.wv.vocab:
vecvalue = model[word].reshape(1, dimension)
snippet_vector = np.add(snippet_vector, vecvalue)
link_text = 'some other text'
link_vector = np.zeros((1, dimension))
for word in link_text:
if word in model.wv.vocab:
vecvalue = model[word].reshape(1, dimension)
link_vector = np.add(link_vector, vecvalue)
print(cosine_similarity(snippet_vector, link_vector))
I am calculating the sum of word embedding for each word of a sentence to get some representation for the sentence as a whole. I do this for both sentences and then calculate the cosine similarity between them.
Now, the problem is I'm getting cosine similarity around 0.99 for any pair of sentences that I give. Is there anything that I'm doing wrong? Any suggestions for a better approach?
Are you checking that your snippet_vector and link_vector are meaningful vectors before calculating their cosine-similarity?
I suspect they're just zero-vectors, or similarly non-diverse, since your for word in snippet: and for word in link_text: loops aren't tokenizing the text. So they'll just loop over the characters in each string, which either won't be present in your model as words, or the few available may match exactly between your texts. (Even with tokenization, the texts' summed vectors would only differ by the value of a vector for the one different word, 'other'.)

How is TF calculated in Sklearn

I have been experimenting with sklearn's Tfidfvectorizer.
I am only concerned with TF, and not idf, so my settings have use_idf = FALSE
Complete settings are:
vectorizer = TfidfVectorizer(max_df=0.5, max_features= n_features,
ngram_range=(1,3), use_idf=False)
I have been trying to replicate the output of .fit_transform but haven't managed to do it so far and was hoping someone could explain the calculations for me.
My toy example is:
document = ["one two three one four five",
"two six eight ten two"]
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
n_features = 5
vectorizer = TfidfVectorizer(max_df=0.5, max_features= n_features,
ngram_range=(1,3), use_idf=False)
X = vectorizer.fit_transform(document)
count = CountVectorizer(max_df=0.5, max_features= n_features,
ngram_range=(1,3))
countMat = count.fit_transform(document)
I have assumed the counts from the Count Vectorizer will be the same as the counts used int he Tfidf Vectorizer. So am trying to change the countMat object to match X.
I had missed a line from the documentation which says
Each row is normalized to have unit euclidean norm
So to anwer my own question - the answer is:
for i in xrange(countMat.toarray().__len__()):
row = countMat.toarray()[i]
row / np.sqrt(np.sum(row**2))
Although I am sure there is a more elegant way to code the result.

NLTK package to estimate the (unigram) perplexity

I am trying to calculate the perplexity for the data I have. The code I am using is:
import sys
sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")
from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm
But I am receiving the error,
File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'
I have already performed Latent Dirichlet Allocation for the data I have and I have generated the unigrams and their respective probabilities (they are normalized as the sum of total probabilities of the data is 1).
My unigrams and their probability looks like:
Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781
This is just a fragment of the unigrams file I have. The same format is followed for about 1000s of lines. The total probabilities (second column) summed gives 1.
I am a budding programmer. This ngram.py belongs to the nltk package and I am confused as to how to rectify this. The sample code I have here is from the nltk documentation and I don't know what to do now. Please help on what I can do. Thanks in advance!
Perplexity is the inverse probability of the test set, normalized by the number of words. In the case of unigrams:
Now you say you have already constructed the unigram model, meaning, for each word you have the relevant probability. Then you only need to apply the formula. I assume you have a big dictionary unigram[word] that would provide the probability of each word in the corpus. You also need to have a test set. If your unigram model is not in the form of a dictionary, tell me what data structure you have used, so I could adapt it to my solution accordingly.
perplexity = 1
N = 0
for word in testset:
if word in unigram:
N += 1
perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))
UPDATE:
As you asked for a complete working example, here's a very simple one.
Suppose this is our corpus:
corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""
Here's how we construct the unigram model first:
import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)
#here you construct the unigram language model
def unigram(tokens):
model = collections.defaultdict(lambda: 0.01)
for f in tokens:
try:
model[f] += 1
except KeyError:
model [f] = 1
continue
N = float(sum(model.values()))
for word in model:
model[word] = model[word]/N
return model
Our model here is smoothed. For words outside the scope of its knowledge, it assigns a low probability of 0.01. I already told you how to compute perplexity:
#computes perplexity of the unigram model on a testset
def perplexity(testset, model):
testset = testset.split()
perplexity = 1
N = 0
for word in testset:
N += 1
perplexity = perplexity * (1/model[word])
perplexity = pow(perplexity, 1/float(N))
return perplexity
Now we can test this on two different test sets:
testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"
model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)
for which you get the following result:
>>>
49.09452736318415
99.99999999999997
Note that when dealing with perplexity, we try to reduce it. A language model that has less perplexity with regards to a certain test set is more desirable than one with a bigger perplexity. In the first test set, the word Monty was included in the unigram model, so the respective number for perplexity was also smaller.
Thanks for the code snippet! Shouldn't:
for word in model:
model[word] = model[word]/float(sum(model.values()))
be rather:
v = float(sum(model.values()))
for word in model:
model[word] = model[word]/v
Oh ... I see was already answered ...