Getting same value for Precision and Recall (K-NN) using sklearn - python-2.7

Updated question:
I did this, but I am getting the same result for both precision and recall is it because I am using average ='binary'?
But when I use average='macro' I get this error message:
Test a custom review
messageC:\Python27\lib\site-packages\sklearn\metrics\classification.py:976:
DeprecationWarning: From version 0.18, binary input will not be
handled specially when using averaged precision/recall/F-score. Please
use average='binary' to report only the positive class performance.
'positive class performance.', DeprecationWarning)
Here is my updated code:
path = 'opinions.tsv'
data = pd.read_table(path,header=None,skiprows=1,names=['Sentiment','Review'])
X = data.Review
y = data.Sentiment
#Using CountVectorizer to convert text into tokens/features
vect = CountVectorizer(stop_words='english', ngram_range = (1,1), max_df = .80, min_df = 4)
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1, test_size= 0.2)
#Using training data to transform text into counts of features for each message
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
#Accuracy using KNN Model
KNN = KNeighborsClassifier(n_neighbors = 3)
KNN.fit(X_train_dtm, y_train)
y_pred = KNN.predict(X_test_dtm)
print('\nK Nearest Neighbors (NN = 3)')
#Naive Bayes Analysis
tokens_words = vect.get_feature_names()
print '\nAnalysis'
print'Accuracy Score: %f %%'% (metrics.accuracy_score(y_test,y_pred)*100)
print "Precision Score: %f%%" % precision_score(y_test,y_pred, average='binary')
print "Recall Score: %f%%" % recall_score(y_test,y_pred, average='binary')
By using the code above I get same value for precision and recall.
Thank you for answering my question, much appreciated.

To calculate precision and recall metrics, you should import the according methods from sklearn.metrics.
As stated in the documentation, their parameters are 1-d arrays of true and predicted labels:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
print('Calculating the metrics...')
recision_score(y_true, y_pred, average='macro')
>>> 0.22
recall_score(y_true, y_pred, average='macro')
>>> 0.33

Related

PyMC3 Bayesian Inference with NUTS initialization

I'm trying to implement a simple Bayesian Inference using a ODE model. I want to use the NUTS algorithm to sample but it gives me an initialization error. I do not know much about the PyMC3 as I'm new to this. Please take a look and tell me what is wrong.
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
import seaborn
import pymc3 as pm
import theano.tensor as T
from theano.compile.ops import as_op
#Actual Solution of the Differential Equation(Used to generate data)
def actual(a,b,x):
Y = np.exp(-b*x)*(a*np.exp(b*x)*(b*x-1)+a+b**2)/b**2
return Y
#Method For Solving the ODE
def lv(xdata, a=5.0, b=0.2):
def dy_dx(y, x):
return a*x - b*y
y0 = 1.0
Y, dict = odeint(dy_dx,y0,xdata,full_output=True)
return Y
#Generating Data for Bayesian Inference
a0, b0 = 5, 0.2
xdata = np.linspace(0, 21, 100)
ydata = actual(a0,b0,xdata)
# Adding some error to the ydata points
yerror = 10*np.random.rand(len(xdata))
ydata += np.random.normal(0.0, np.sqrt(yerror))
ydata = np.ravel(ydata)
#as_op(itypes=[T.dscalar, T.dscalar], otypes=[T.dvector])
def func(al,be):
Q = lv(xdata, a=al, b=be)
return np.ravel(Q)
# Number of Samples and Initial Conditions
nsample = 5000
y0 = 1.0
# Model for Bayesian Inference
model = pm.Model()
with model:
# Priors for unknown model parameters
alpha = pm.Uniform('alpha', lower=a0/2, upper=a0+a0/2)
beta = pm.Uniform('beta', lower=b0/2, upper=b0+b0/2)
# Expected value of outcome
mu = func(alpha,beta)
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal('Y_obs', mu=mu, sd=yerror, observed=ydata)
trace = pm.sample(nsample, nchains=1)
pm.traceplot(trace)
plt.show()
The error that I get is
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Initializing NUTS failed. Falling back to elementwise auto-assignment.
Any help would be really appreciated

Custom multivariate Dirichlet priors in pymc3

Is it possible create custom multivariate distributions in pymc3? In the following, I have tried to create a linear transformation of a Dirichlet distribution. All variants on this have returned numerous errors, perhaps to do with theano data types? Any help would be gratefully appreciated.
import numpy as np
import pymc3 as pymc
import theano.tensor as tt
# data
n = 5
prior_params = np.ones(n - 1) / (n - 1)
mx = np.array([[0.25 , 0.5 , 0.75 , 1. ],
[0.25 , 0.333, 0.25 , 0. ],
[0.25 , 0.167, 0. , 0. ],
[0.25 , 0. , 0. , 0. ]])
# Note that the matrix mx takes the unit simplex into the unit simplex.
# custom log-liklihood
def generate_function(mx, prior_params):
def log_trunc_dir(x):
return pymc.Dirichlet.dist(a=prior_params).logp(mx.dot(x.T)).eval()
return log_trunc_dir
#model
with pymc.Model() as simple_model:
x = pymc.Dirichlet('x', a=np.ones(n - 1))
q = pymc.DensityDist('q', generate_function(mx, prior_params), observed={'x': x})
Thanks to significant help from the PyMC3 development community, I can post the
following working example of a customised Dirichlet prior in PyMC3.
import pymc3 as pm
import numpy as np
import scipy.special as special
import theano.tensor as tt
import matplotlib.pyplot as plt
n = 4
with pm.Model() as model:
prior = np.ones(n) / n
def dirich_logpdf(value=prior):
return -n * special.gammaln(1/n) + (-1 + 1/n) * tt.log(value).sum()
stick = pm.distributions.transforms.StickBreaking()
probs = pm.DensityDist('probs', dirich_logpdf, shape=n,
testval=np.array(prior), transform=stick)
data = np.array([5, 7, 1, 0])
sfs_obs = pm.Multinomial('sfs_obs', n=np.sum(data), p=probs, observed=data)
with model:
step = pm.Metropolis()
trace = pm.sample(100000, tune=10000, step=step)
print('MLE = ', data / np.sum(data))
print(pm.summary(trace))
pm.traceplot(trace, [probs])
plt.show()

Python: Data fitting with scipy.optimize.curve_fit with sigma = 0

I'm trying to fit a curve with scipy.optimize.curve_fit and it works pretty good so far, except in the case that a value in my sigma array is zero. I understand that the algorithm can't handle this, as I divide by zero in this case. From the scipy documentation:
sigma : None or M-length sequence, optional
If not None, the uncertainties in the ydata array. These are used as weights in the least-squares problem i.e. minimising np.sum( ((f(xdata, *popt) - ydata) / sigma)**2 ) If None, the uncertainties are assumed to be 1.
Here's what my code looks like:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = [0.125, 0.375, 0.625, 0.875, 1.125, 1.375, 1.625, 1.875, 2.125, 2.375, 2.625, 2.875, 3.125, 3.375, 3.625, 3.875, 4.125, 4.375]
y_para = [0, 0, 0.0414, 0.2164, 0.2616, 0.4254, 0.5698, 0.5921, 0.6286, 0.6452, 0.5879, 0.6032, 0.6667, 0.6325, 0.7629, 0.7164, 0.7091, 0.7887]
err = [0, 0, 0.0391, 0.0331, 0.0943, 0.0631, 0.1219, 0.1063, 0.0912, 0.0516, 0.0365, 0.0327, 0.0227, 0.103, 0.1344, 0.0697, 0.0114, 0.0465]
def logistic_growth(x, A1, A2, x_0, p):
return A2 + (A1-A2)/(1+(x/x_0)**p)
x_plot = np.linspace(0, 4.5, 100)
bounds_para = ([0.,0,-np.inf,-np.inf],[0.0000000001, 1,np.inf,np.inf])
paras, paras_cov = curve_fit(logistic_growth, x, y_para, bounds = bounds_para, sigma = err, absolute_sigma=True)
para_curve = logistic_growth(x_plot, *paras)
plt.figure()
plt.errorbar(x,y_para, err, color = 'b', fmt = 'o', label = "Data")
plt.plot(x_plot, para_curve, color = 'b', label = "Fit")
plt.show()
Executing this without the sigma-option in curve_fit works fine, but including it raises:
ValueError: Residuals are not finite in the initial point.
, which results from the zeros in the err-array.
Does anyone know a way to work around this?
Why not just drop the variable? If it has zero variance it cannot contribute in any meaningful way to your analysis.
This is what the scipy doc says about the curve_fit sigma parameter: 'These are used as weights in the least-squares problem ...' Then, in my opinion, they should be inverse to the errors. Here's what I suggest.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = [0.125, 0.375, 0.625, 0.875, 1.125, 1.375, 1.625, 1.875, 2.125, 2.375, 2.625, 2.875, 3.125, 3.375, 3.625, 3.875, 4.125, 4.375]
y_para = [0, 0, 0.0414, 0.2164, 0.2616, 0.4254, 0.5698, 0.5921, 0.6286, 0.6452, 0.5879, 0.6032, 0.6667, 0.6325, 0.7629, 0.7164, 0.7091, 0.7887]
err = [0, 0, 0.0391, 0.0331, 0.0943, 0.0631, 0.1219, 0.1063, 0.0912, 0.0516, 0.0365, 0.0327, 0.0227, 0.103, 0.1344, 0.0697, 0.0114, 0.0465]
weights = [1/max(_,0.001) for _ in err]
print (weights)
def logistic_growth(x, A1, A2, x_0, p):
return A2 + (A1-A2)/(1+(x/x_0)**p)
x_plot = np.linspace(0, 4.5, 100)
bounds_para = ([0.,0,-np.inf,-np.inf],[0.0000000001, 1,np.inf,np.inf])
paras, paras_cov = curve_fit(logistic_growth, x, y_para, bounds = bounds_para,
absolute_sigma=True,
sigma = weights)
para_curve = logistic_growth(x_plot, *paras)
plt.figure()
plt.errorbar(x,y_para, err, color = 'b', fmt = 'o', label = "Data")
plt.plot(x_plot, para_curve, color = 'b', label = "Fit")
plt.show()
This results in the following plot, where those initial data points are made to lie very close to the fitted line.

python, xlrd: Maniplulate spreadsheet data with xlrd function then graph the manipulated data

I am trying to extract data from an excel spreadsheet, then find a percent change between adjacent rows. The columns that I would like to do this manipulation on is column 1 and 4. I would like to then graph these percent changes in two different bar charts using subplots using column 0 as the x axis.
I am able to do everything except extract the data and formulate a percent change between adjacent rows. The formula for the percent change is Current/previous-1 or (r,0)/(r-1,0)-1. Below is my current script:
import xlrd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as tkr
import matplotlib.dates as mdates
import datetime
from matplotlib import rc
rc('mathtext', default='regular')
file_location = "/Users/adampatel/Desktop/psw01.xls"
workbook = xlrd.open_workbook(file_location, on_demand = False)
worksheet = workbook.sheet_by_name('Data 1')
x = [worksheet.cell_value(i+1699, 0) for i in range(worksheet.nrows-1699)]
y1 = [worksheet.cell_value(i+1699, 1) for i in range(worksheet.nrows-1699)]
y2 = [worksheet.cell_value(i+1699, 4) for i in range(worksheet.nrows-1699)]
fig = plt.figure()
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212, sharex = ax1)
start_date = datetime.date(1899, 12, 30)
dates=[start_date + datetime.timedelta(xval) for xval in x]
ax1.xaxis.set_major_locator(mdates.MonthLocator((), bymonthday=1, interval=2))
ax1.xaxis.set_minor_locator(mdates.MonthLocator((), bymonthday=1, interval=1))
ax1.xaxis.set_major_formatter(mdates.DateFormatter("%b'%y"))
ly1 = ax1.bar(dates, y1, 0.9)
ly2 = ax2.bar(dates, y2, 0.9)
ax1.grid()
ax2.grid()
ax1.set_ylim(-3,3)
ax2.set_ylim(-3,3)
fig.text(0.5, 0.04, 'Inventory Weekly Percent Change', ha='center', va='center', size = '14')
fig.text(0.06, 0.5, 'Weekly Percent Change', ha='center', va='center', size = '14', rotation='vertical')
ax1.set_title('Oil', size = '12')
ax2.set_title('Gasoline', size = '12')
plt.savefig('Gasoline Inventories Weekly Percent Change.png', bbox_inches='tight', dpi=300)
plt.show()
Given list of values:
y1 = [1000,1010,950,1050,1100,1030]
Pure python solution:
Use the zip function to create tuples of the numerator and denominator. Then use list comprehension to get a list of the percent changes.
pct_chg = [1.0*num / den - 1 for num, den in zip(y1[1:], y1)]
Numpy solution:
Convert list to numpy array, then perform computation using array slices.
a1 = np.array(y1)
pct_chg = np.divide(a1[1:],a1[:-1])-1
Pandas package solution:
Convert list to Pandas series and use the built-in percent change function
s1 = pd.Series(y1)
pct_chg = s1.pct_change()
Now, pct_chg is a series too. You can get its values in a numpy array via pct_chg.values. Matplotlib should accept numpy arrays as containers in most cases.

scikit-learn RandomForestClassifier - How to interpret tree output?

I have the below code, but I just don't understand how to interpret the tree output data from the RandomForestClassifier, like how the gini was calculated, given the samples and how the totals in the 'value' lists can be higher than the initial samples of 3.
I am comparing this output to a DecisionTreeClassifier, which I can understand and interpret.
Any help is appreciated, thanks!
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
import numpy as np
from sklearn.externals.six import StringIO
import pydot
# Data
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
Y = np.array([0, 1, 1, 0])
# Create object classifiers
clf = RandomForestClassifier()
clf_tree = tree.DecisionTreeClassifier()
# Fit data
clf_tree.fit(X,Y)
clf.fit(X, Y)
# Save data
dot_data = StringIO()
tree.export_graphviz(clf_tree, out_file = dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("orig_tree.pdf")
i_tree = 0
for tree_in_forest in clf.estimators_:
dot_data = StringIO()
tree.export_graphviz(tree_in_forest, out_file = dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
f_name = 'tree_' + str(i_tree) + '.pdf'
graph.write_pdf(f_name)
i_tree += 1
The decision tree:
http://i.stack.imgur.com/XZ7vU.png
A tree from the RandomForestClassifier:
http://i.stack.imgur.com/Bb5t9.png
How the gini was calculated given the samples ?
The gini is computed exactly in the same way for random forest and the decision tree. The Gini values, or variance, correspond to the impurity of the node.
How the totals in the 'value' lists can be higher than the initial samples of 3?
In the case of classification, the value attribute corresponds to the number of samples reaching the leaves.
In the case of random forest, the samples are bootstraped thus in total there is on average 2 / 3 of the original samples, but the overall number of samples hasn't change.