Computation of Kullback-Leibler (KL) distance between text-documents using numpy - python-2.7

My goal is to compute the KL distance between the following text documents:
1)The boy is having a lad relationship
2)The boy is having a boy relationship
3)It is a lovely day in NY
I first of all vectorised the documents in order to easily apply numpy
1)[1,1,1,1,1,1,1]
2)[1,2,1,1,1,2,1]
3)[1,1,1,1,1,1,1]
I then applied the following code for computing KL distance between the texts:
import numpy as np
import math
from math import log
v=[[1,1,1,1,1,1,1],[1,2,1,1,1,2,1],[1,1,1,1,1,1,1]]
c=v[0]
def kl(p, q):
p = np.asarray(p, dtype=np.float)
q = np.asarray(q, dtype=np.float)
return np.sum(np.where(p != 0,(p-q) * np.log10(p / q), 0))
for x in v:
KL=kl(x,c)
print KL
Here is the result of the above code: [0.0, 0.602059991328, 0.0].
Texts 1 and 3 are completely different, but the distance between them is 0, while texts 1 and 2, which are highly related has a distance of 0.602059991328. This isn't accurate.
Does anyone has an idea of what I'm not doing right with regards to KL? Many thanks for your suggestions.

Though I hate to add another answer, there are two points here. First, as Jaime pointed out in the comments, KL divergence (or distance - they are, according to the following documentation, the same) is designed to measure the difference between probability distributions. This means basically that what you pass to the function should be two array-likes, the elements of each of which sum to 1.
Second, scipy apparently does implement this, with a naming scheme more related to the field of information theory. The function is "entropy":
scipy.stats.entropy(pk, qk=None, base=None)
http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.entropy.html
From the docs:
If qk is not None, then compute a relative entropy (also known as
Kullback-Leibler divergence or Kullback-Leibler distance) S = sum(pk *
log(pk / qk), axis=0).
The bonus of this function as well is that it will normalize the vectors you pass it if they do not sum to 1 (though this means you have to be careful with the arrays you pass - ie, how they are constructed from data).
Hope this helps, and at least a library provides it so don't have to code your own.

After a bit of googling to undersand the KL concept, I think that your problem is due to the vectorization : you're comparing the number of appearance of different words. You should either link your column indice to one word, or use a dictionnary:
# The boy is having a lad relationship It lovely day in NY
1)[1 1 1 1 1 1 1 0 0 0 0 0]
2)[1 2 1 1 1 0 1 0 0 0 0 0]
3)[0 0 1 0 1 0 0 1 1 1 1 1]
Then you can use your kl function.
To automatically vectorize to a dictionnary, see How to count the frequency of the elements in a list? (collections.Counter is exactly what you need). Then you can loop over the union of the keys of the dictionaries to compute the KL distance.

A potential issue might be in your NP definition of KL. Read the wikipedia page for formula: http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
Note that you multiply (p-q) by the log result. In accordance with the KL formula, this should only be p:
return np.sum(np.where(p != 0,(p) * np.log10(p / q), 0))
That may help...

Related

Jaccard similarity in python

I am trying to find the jaccard similarity between two documents. However, i am having hard time to understand how the function sklearn.metrics.jaccard_similarity_score() works behind the scene.As per my understanding the Jaccard's sim = intersection of the terms in docs/ union of the terms in docs.
Consider below example:
My DTM for the two documents is:
array([[1, 1, 1, 1, 2, 0, 1, 0],
[2, 1, 1, 0, 1, 1, 0, 1]], dtype=int64)
above func. give me the jaccard sim score
print(sklearn.metrics.jaccard_similarity_score(tf_matrix[0,:],tf_matrix[1,:]))
0.25
I am trying to find the score on my own as :
intersection of terms in both the docs = 4
total terms in doc 1 = 6
total terms in doc 2 = 6
Jaccard = 4/(6+6-4)= .5
Can someone please help me understand if there is something obvious i am missing here.
As stated here:
In binary and multiclass classification, the Jaccard similarity coefficient score is equal to the classification accuracy.
Therefore in your example it is calculating the proportion of matching elements. That's why you're getting 0.25 as the result.
According to me
intersection of terms in both the docs = 2.
peek to peek intersection according to their respective index. As we need to predict correct value for our model.
Normal Intersection = 4. Leaving the order of index.
# so,
jaccard_score = 2/(6+6-4) = 0.25

Query on plotting Lorenz curves on Stata

I am trying to plot a lorenz curve, using the following command:
glcurve drugs, sortvar(death) pvar(rank) glvar(yord) lorenz nograph
generate rank1=rank
label variable rank "Cum share of mortality"
label variable rank1 "Equality Line"
twoway (line rank1 rank, sort clwidth(medthin) clpat(longdash))(line yord rank , sort clwidth(medthin) clpat(red)), ///
ytitle(Cumulative share of drug activity, size(medsmall)) yscale(titlegap(2)) xtitle(Cumulative share of mortality (2012), size(medsmall)) ///
legend(rows(5)) xscale(titlegap(5)) legend(region(lwidth(none))) plotregion(margin(zero)) ysize(6.75) xsize(6) plotregion(lcolor(none))
However, in the resultant curves, the Line of equality does not start from 0, is there a way to fix this?
Is it recommended to use the following in order to get the perfect 45 degree line of equality:
(function y=x, range(0 1)
Also, how many minimum observations are required to plot the above graph? Does it work well with 2 observations as well?
The reason your Line of Perfect Equality does not pass through (0,0) is because the values for your variable do not contain 0.
The smallest value you will have for rank will be 1/_N. Although this value will asymptotically approach 0, it will never actually reach 0.
To see this, try:
quietly sum rank
di r(min)
di 1/_N
Further, by applying the program code to your data (beginning around line 152 in the ado file and removing unnecessary bits), one can easily see that yord cannot take on a value of 0 without values of 0 for drugs:
glcurve drugs, sortvar(death) pvar(rank) glvar(yord) lorenz nograph
sort death drugs , stable
gen double rank1 = _n / _N
qui sum drugs
gen yord1= (sum(drugs) / _N) / r(mean)
The best way to plot your Equality would be the method from your edit, namely:
twoway(function y = x, ra(0 1))
One quick yet (very) crude fix to force the lorenz curve to start at the origin (if it doesn't already) is to add an observation to the data after obtaining rank and yord, and then deleting it after you have your curve:
glcurve drugs, sortvar(death) pvar(rank) glvar(yord) lorenz nograph
expand 2 in 1
replace yord = 0 in 1
replace rank = 0 in 1
twoway (function y = x, ra(0 1)) ///
(line yord rank)
drop in 1
Like I said, this is admittedly crude and even somewhat ill advised, but I can't see a much better alternative at the moment, and with this method you will not be altering any of the other values of yord by running glcurve on the extrapolated data.

Speedy test on R data frame to see if row values in one column are inside another column in the data frame

I have a data frame of marketing data with 22k records and 6 columns, 2 of which are of interest.
Variable
FO.variable
Here's a link with the dput output of a sample of the dataframe: http://dpaste.com/2SJ6DPX
Please let me know if there's a better way of sharing this data.
All I want to do is create an additional binary keep column which should be:
1 if FO.variable is inside Variable
0 if FO.Variable is not inside Variable
Seems like a simple thing...in Excel I would just add another column with an "if" formula and then paste the formula down. I've spent the past hours trying to get this and R and failing.
Here's what I've tried:
Using grepl for pattern matching. I've used grepl before but this time I'm trying to pass a column instead of a string. My early attempts failed because I tried to force grepl and ifelse resulting in grepl using the first value in the column instead of the entire thing.
My next attempt was to use transform and grep based off another post on SO. I didn't think this would give me my exact answer but I figured it would get me close enough for me to figure it out from there...the code ran for a while than errored because invalid subscript.
transform(dd, Keep = FO.variable[sapply(variable, grep, FO.variable)])
My next attempt was to use str_detect, but I don't think this is the right approach because I want the row level value and I think 'any' will literally use any value in the vector?
kk <- sapply(dd$variable, function(x) any(sapply(dd$FO.variable, str_detect, string = x)))
EDIT: Just tried a for loop. I would prefer a vectorized approach but I'm pretty desperate at this point. I haven't used for-loops before as I've avoided them and stuck to other solutions. It doesn't seem to be working quite right not sure if I screwed up the syntax:
for(i in 1:nrow(dd)){
if(dd[i,4] %in% dd[i,2])
dd$test[i] <- 1
}
As I mentioned, my ideal output is an additional column with 1 or 0 if FO.variable was inside variable. For example, the first three records in the sample data would be 1 and the 4th record would be zero since "Direct/Unknown" is not within "Organic Search, System Email".
A bonus would be if a solution could run fast. The apply options were taking a long, long time perhaps because they were looping over every iteration across both columns?
This turned out to not nearly be as simple as I would of thought. Or maybe it is and I'm just a dunce. Either way, I appreciate any help on how to best approach this.
I read the data
df = dget("http://dpaste.com/2SJ6DPX.txt")
then split the 'variable' column into its parts and figured out the lengths of each entry
v = strsplit(as.character(df$variable), ",", fixed=TRUE)
len = lengths(v) ## sapply(v, length) in R-3.1.3
Then I unlisted v and created an index that maps the unlisted v to the row from which it came from
uv = unlist(v)
idx = rep(seq_along(v), len)
Finally, I found the indexes for which uv was equal to its corresponding entry in FO.variable
test = (uv == as.character(df$FO.variable)[idx])
df$Keep = FALSE
df$Keep[ idx[test] ] = TRUE
Or combined (it seems more useful to return the logical vector than the modified data.frame, which one could obtain with dd$Keep = f0(dd))
f0 = function(dd) {
v = strsplit(as.character(dd$variable), ",", fixed=TRUE)
len = lengths(v)
uv = unlist(v)
idx = rep(seq_along(v), len)
keep = logical(nrow(dd))
keep[ idx[uv == as.character(dd$FO.variable)[idx]] ] = TRUE
keep
}
(This could be made faster using the fact that the columns are factors, but maybe that's not intentional?) Compared with (the admittedly simpler and easier to understand)
f1 = function(dd)
mapply(grepl, dd$FO.variable, dd$variable, fixed=TRUE)
f1a = function(dd)
mapply(grepl, as.character(dd$FO.variable),
as.character(dd$variable), fixed=TRUE)
f2 = function(dd)
apply(dd, 1, function(x) grepl(x[4], x[2], fixed=TRUE))
with
> library(microbenchmark)
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), unname(f2(df)))
[1] TRUE
> microbenchmark(f0(df), f1(df), f1a(df), f2(df))
Unit: microseconds
expr min lq mean median uq max neval
f0(df) 57.559 64.6940 70.26804 69.4455 74.1035 98.322 100
f1(df) 573.302 603.4635 625.32744 624.8670 637.1810 766.183 100
f1a(df) 138.527 148.5280 156.47055 153.7455 160.3925 246.115 100
f2(df) 494.447 518.7110 543.41201 539.1655 561.4490 677.704 100
Two subtle but important additions during the development of the timings were to use fixed=TRUE in the regular expression, and to coerce the factors to character.
I would go with a simple mapply in your case, as you correctly said, by row operations will be very slow. Also, (as suggested by Martin) setting fixed = TRUE and apriori converting to character will significantly improve performance.
transform(dd, Keep = mapply(grepl,
as.character(FO.variable),
as.character(variable),
fixed = TRUE))
# VisitorIDTrue variable value FO.variable FO.value Keep
# 22 44888657 Direct / Unknown,Organic Search 1 Direct / Unknown 1 TRUE
# 2 44888657 Direct / Unknown,System Email 1 Direct / Unknown 1 TRUE
# 6 44888657 Direct / Unknown,TV 1 Direct / Unknown 1 TRUE
# 10 44888657 Organic Search,System Email 1 Direct / Unknown 1 FALSE
# 18 44888657 Organic Search,TV 1 Direct / Unknown 1 FALSE
# 14 44888657 System Email,TV 1 Direct / Unknown 1 FALSE
# 24 44888657 Direct / Unknown,Organic Search 1 Organic Search 1 TRUE
# 4 44888657 Direct / Unknown,System Email 1 Organic Search 1 FALSE
...
Here is a data.table approach that I think is very similar in spirit to Martin's:
require(data.table)
dt <- data.table(df)
dt[,`:=`(
fch = as.character(FO.variable),
rn = 1:.N
)]
dt[,keep:=FALSE]
dtvars <- dt[,strsplit(as.character(variable),',',fixed=TRUE),by=rn]
setkey(dt,rn,fch)
dt[dtvars,keep:=TRUE]
dt[,c("fch","rn"):=NULL]
The idea is to
identify all pairs of rn & variable (saved in dtvars) and
see which of these pairs match with rn & F0.variable pairs (in the original table, dt).

Fitting a Gaussian, getting a straight line. Python 2.7

As my title suggests, I'm trying to fit a Gaussian to some data and I'm just getting a straight line. I've been looking at these other discussion Gaussian fit for Python and Fitting a gaussian to a curve in Python which seem to suggest basically the same thing. I can make the code in those discussions work fine for the data they provide, but it won't do it for my data.
My code looks like this:
import pylab as plb
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy import asarray as ar,exp
y = y - y[0] # to make it go to zero on both sides
x = range(len(y))
max_y = max(y)
n = len(y)
mean = sum(x*y)/n
sigma = np.sqrt(sum(y*(x-mean)**2)/n)
# Someone on a previous post seemed to think this needed to have the sqrt.
# Tried it without as well, made no difference.
def gaus(x,a,x0,sigma):
return a*exp(-(x-x0)**2/(2*sigma**2))
popt,pcov = curve_fit(gaus,x,y,p0=[max_y,mean,sigma])
# It was suggested in one of the other posts I looked at to make the
# first element of p0 be the maximum value of y.
# I also tried it as 1, but that did not work either
plt.plot(x,y,'b:',label='data')
plt.plot(x,gaus(x,*popt),'r:',label='fit')
plt.legend()
plt.title('Fig. 3 - Fit for Time Constant')
plt.xlabel('Time (s)')
plt.ylabel('Voltage (V)')
plt.show()
The data I am trying to fit is as follows:
y = array([ 6.95301373e+12, 9.62971320e+12, 1.32501876e+13,
1.81150568e+13, 2.46111132e+13, 3.32321345e+13,
4.45978682e+13, 5.94819771e+13, 7.88394616e+13,
1.03837779e+14, 1.35888594e+14, 1.76677210e+14,
2.28196006e+14, 2.92781632e+14, 3.73133045e+14,
4.72340762e+14, 5.93892782e+14, 7.41632194e+14,
9.19750269e+14, 1.13278296e+15, 1.38551838e+15,
1.68291212e+15, 2.02996957e+15, 2.43161742e+15,
2.89259207e+15, 3.41725793e+15, 4.00937676e+15,
4.67187762e+15, 5.40667931e+15, 6.21440313e+15,
7.09421973e+15, 8.04366842e+15, 9.05855930e+15,
1.01328502e+16, 1.12585509e+16, 1.24257598e+16,
1.36226443e+16, 1.48356404e+16, 1.60496345e+16,
1.72482199e+16, 1.84140400e+16, 1.95291969e+16,
2.05757166e+16, 2.15360187e+16, 2.23933053e+16,
2.31320228e+16, 2.37385276e+16, 2.42009864e+16,
2.45114362e+16, 2.46427484e+16, 2.45114362e+16,
2.42009864e+16, 2.37385276e+16, 2.31320228e+16,
2.23933053e+16, 2.15360187e+16, 2.05757166e+16,
1.95291969e+16, 1.84140400e+16, 1.72482199e+16,
1.60496345e+16, 1.48356404e+16, 1.36226443e+16,
1.24257598e+16, 1.12585509e+16, 1.01328502e+16,
9.05855930e+15, 8.04366842e+15, 7.09421973e+15,
6.21440313e+15, 5.40667931e+15, 4.67187762e+15,
4.00937676e+15, 3.41725793e+15, 2.89259207e+15,
2.43161742e+15, 2.02996957e+15, 1.68291212e+15,
1.38551838e+15, 1.13278296e+15, 9.19750269e+14,
7.41632194e+14, 5.93892782e+14, 4.72340762e+14,
3.73133045e+14, 2.92781632e+14, 2.28196006e+14,
1.76677210e+14, 1.35888594e+14, 1.03837779e+14,
7.88394616e+13, 5.94819771e+13, 4.45978682e+13,
3.32321345e+13, 2.46111132e+13, 1.81150568e+13,
1.32501876e+13, 9.62971320e+12, 6.95301373e+12,
4.98705540e+12])
I would show you what it looks like, but apparently I don't have enough reputation points...
Anyone got any idea why it's not fitting properly?
Thanks for your help :)
The importance of the initial guess, p0 in curve_fit's default argument list, cannot be stressed enough.
Notice that the docstring mentions that
[p0] If None, then the initial values will all be 1
So if you do not supply it, it will use an initial guess of 1 for all parameters you're trying to optimize for.
The choice of p0 affects the speed at which the underlying algorithm changes the guess vector p0 (ref. the documentation of least_squares).
When you look at the data that you have, you'll notice that the maximum and the mean, mu_0, of the Gaussian-like dataset y, are
2.4e16 and 49 respectively. With the peak value so large, the algorithm, would need to make drastic changes to its initial guess to reach that large value.
When you supply a good initial guess to the curve fitting algorithm, convergence is more likely to occur.
Using your data, you can supply a good initial guess for the peak_value, the mean and sigma, by writing them like this:
y = np.array([...]) # starting from the original dataset
x = np.arange(len(y))
peak_value = y.max()
mean = x[y.argmax()] # observation of the data shows that the peak is close to the center of the interval of the x-data
sigma = mean - np.where(y > peak_value * np.exp(-.5))[0][0] # when x is sigma in the gaussian model, the function evaluates to a*exp(-.5)
popt,pcov = curve_fit(gaus, x, y, p0=[peak_value, mean, sigma])
print(popt) # prints: [ 2.44402560e+16 4.90000000e+01 1.20588976e+01]
Note that in your code, for the mean you take sum(x*y)/n , which is strange, because this would modulate the gaussian by a polynome of degree 1 (it multiplies a gaussian with a monotonously increasing line of constant slope) before taking the mean. That will offset the mean value of y (in this case to the right). A similar remark can be made for your calculation of sigma.
Final remark: the histogram of y will not resemble a Gaussian, as y is already a Gaussian. The histogram will merely bin (count) values into different categories (answering the question "how many datapoints in y reach a value between [a, b]?").

Calculating the distance between characters

Problem: I have a large number of scanned documents that are linked to the wrong records in a database. Each image has the correct ID on it somewhere that says where it belongs in the db.
I.E. A DB row could be:
| user_id | img_id | img_loc |
| 1 | 1 | /img.jpg|
img.jpg would have the user_id (1) on the image somewhere.
Method/Solution: Loop through the database. Pull the image text in to a variable with OCR and check if user_id is found anywhere in the variable. If not, flag the record/image in a log, if so do nothing and move on.
My example is simple, in the real world I have a guarantee that user_id wouldn't accidentally show up on the wrong form (it is of a specific format that has its own significance)
Right now it is working. However, it is incredibly strict. If you've worked with OCR you understand how fickle it can be. Sometimes a 7 = 1 or a 9 = 7, etc. The result is a large number of false positives. Especially among images with low quality scans.
I've addressed some of the image quality issues with some processing on my side - increase image size, adjust the black/white threshold and had satisfying results. I'd like to add the ability for the prog to recognize, for example, that "81*7*23103" is not very far from "81*9*23103"
The only way I know how to do that is to check for strings >= to the length of what I'm looking for. Calculate the distance between each character, calc an average and give it a limit on what is a good average.
Some examples:
Ex 1
81723103 - Looking for this
81923103 - Found this
--------
00200000 - distances between characters
0 + 0 + 2 + 0 + 0 + 0 + 0 + 0 = 2
2/8 = .25 (pretty good match. 0 = perfect)
Ex 2
81723103 - Looking
81158988 - Found
--------
00635885 - distances
0 + 0 + 6 + 3 + 5 + 8 + 8 + 5 = 35
35/8 = 4.375 (Not a very good match. 9 = worst)
This way I can tell it "Flag the bottom 30% only" and dump anything with an average distance > 6.
I figure I'm reinventing the wheel and wanted to share this for feedback. I see a huge increase in run time and a performance hit doing all these string operations over what I'm currently doing.