scikit-learn RandomForestClassifier - How to interpret tree output? - python-2.7

I have the below code, but I just don't understand how to interpret the tree output data from the RandomForestClassifier, like how the gini was calculated, given the samples and how the totals in the 'value' lists can be higher than the initial samples of 3.
I am comparing this output to a DecisionTreeClassifier, which I can understand and interpret.
Any help is appreciated, thanks!
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
import numpy as np
from sklearn.externals.six import StringIO
import pydot
# Data
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
Y = np.array([0, 1, 1, 0])
# Create object classifiers
clf = RandomForestClassifier()
clf_tree = tree.DecisionTreeClassifier()
# Fit data
clf_tree.fit(X,Y)
clf.fit(X, Y)
# Save data
dot_data = StringIO()
tree.export_graphviz(clf_tree, out_file = dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("orig_tree.pdf")
i_tree = 0
for tree_in_forest in clf.estimators_:
dot_data = StringIO()
tree.export_graphviz(tree_in_forest, out_file = dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
f_name = 'tree_' + str(i_tree) + '.pdf'
graph.write_pdf(f_name)
i_tree += 1
The decision tree:
http://i.stack.imgur.com/XZ7vU.png
A tree from the RandomForestClassifier:
http://i.stack.imgur.com/Bb5t9.png

How the gini was calculated given the samples ?
The gini is computed exactly in the same way for random forest and the decision tree. The Gini values, or variance, correspond to the impurity of the node.
How the totals in the 'value' lists can be higher than the initial samples of 3?
In the case of classification, the value attribute corresponds to the number of samples reaching the leaves.
In the case of random forest, the samples are bootstraped thus in total there is on average 2 / 3 of the original samples, but the overall number of samples hasn't change.

Related

ValueError: shapes (1,4) and (5,4) not aligned: 4 (dim 1) != 5 (dim 0), when adding my variables to my prediction machine

I am creating a prediction machine with four variables. When I add the variables it all messes up and gives me:
ValueError: shapes (1,4) and (5,4) not aligned: 4 (dim 1) != 5 (dim 0)
code
import pandas as pd
from pandas import DataFrame
from sklearn import linear_model
import tkinter as tk
import statsmodels.api as sm
# Approach 1: Import the data into Python
Stock_Market = pd.read_csv(r'Training_Nis_New2.csv')
df = DataFrame(Stock_Market,columns=['Month 1','Month 2','Month 3','Month
4','Month 5','Month 6','Month 7','Month 8',
'Month 9','Month 10','Month 11','Month
12','FSUTX','MMUKX','FUFRX','RYUIX','Interest R','Housing
Sale','Unemployement Rate','Conus Average Temperature
Rank','30FSUTX','30MMUKX','30FUFRX','30RYUIX'])
X = df[['Month 1','Interest R','Housing Sale','Unemployement Rate','Conus Average Temperature Rank']]
# here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets
Y = df[['30FSUTX','30MMUKX','30FUFRX','30RYUIX']]
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)
# prediction with sklearn
# prediction with sklearn
HS=5.5
UR=6.7
CATR=8.9
New_Interest_R = 4.6
print('Predicted Stock Index Price: \n', regr.predict([[UR ,HS ,CATR
,New_Interest_R]]))
# with statsmodel
X = df[['Month 1','Interest R','Housing Sale','Unemployement Rate','Conus Average Temperature Rank']]
Y = df['30FSUTX']
print('\n\n*** Fund = FSUTX')
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)

want to create an empty matrix of unknown dimension and append feature vectors

I want to append HoG feature vectors to an empty matrix of unknown dimension. Is it required to specify the dimension of the matrix in advance? I have tried some code in python but it says all the input arrays must have same dimension.
import matplotlib.pyplot as plt
from skimage.feature import hog
from skimage import data, exposure, img_as_float
from skimage import data
import numpy as np
from scipy import linalg
import cv2
import glob
shape = (16576, 1)
X = np.empty(shape)
print X.shape
hog_image = np.empty(shape)
hog_image_rescaled = np.empty(shape)
for img in glob.glob("/home/madhuri/pythoncode/faces/*.jpg"):
n= cv2.imread(img)
gray = cv2.cvtColor(n, cv2.COLOR_RGB2GRAY)
hog_image = hog(gray, orientations=9, pixels_per_cell=(16, 16),
cells_per_block=(3, 3), visualise=False)
hog_image_rescaled = exposure.rescale_intensity(hog_image,
in_range=(0,10))
X = np.append(X, hog_image_rescaled, axis=1)
print 'X is'
print np.shape(X)
X = [] # use an 'empty' list
# hog_image = np.empty(shape) # no point initializing these variables
# hog_image_rescaled = np.empty(shape) # you just reassign them in the loop
for img in glob.glob("/home/madhuri/pythoncode/faces/*.jpg"):
n= cv2.imread(img)
gray = cv2.cvtColor(n, cv2.COLOR_RGB2GRAY)
hog_image = hog(gray, orientations=9, pixels_per_cell=(16, 16),
cells_per_block=(3, 3), visualise=False)
hog_image_rescaled = exposure.rescale_intensity(hog_image,
in_range=(0,10))
X.append(hog_image_rescaled)
Now X will be a list of rescaled images. Those elements can now be concatenated on which ever dimension is appropriate:
np.concatenate(X, axis=1)
np.stack(X)
# etc
The list model of
alist = []
for ....
alist.append(...)
does not translate well to arrays. np.append is a cover for np.concatenate, and makes a new array, which is more expensive than list append. And defining a good starting 'empty' array for such a loop is tricky. np.empty is not appropriate:
In [977]: np.empty((2,3))
Out[977]:
array([[1.48e-323, 1.24e-322, 1.33e-322],
[1.33e-322, 1.38e-322, 1.38e-322]])
In [978]: np.append(_, np.zeros((2,1)), axis=1)
Out[978]:
array([[1.48e-323, 1.24e-322, 1.33e-322, 0.00e+000],
[1.33e-322, 1.38e-322, 1.38e-322, 0.00e+000]])

Getting same value for Precision and Recall (K-NN) using sklearn

Updated question:
I did this, but I am getting the same result for both precision and recall is it because I am using average ='binary'?
But when I use average='macro' I get this error message:
Test a custom review
messageC:\Python27\lib\site-packages\sklearn\metrics\classification.py:976:
DeprecationWarning: From version 0.18, binary input will not be
handled specially when using averaged precision/recall/F-score. Please
use average='binary' to report only the positive class performance.
'positive class performance.', DeprecationWarning)
Here is my updated code:
path = 'opinions.tsv'
data = pd.read_table(path,header=None,skiprows=1,names=['Sentiment','Review'])
X = data.Review
y = data.Sentiment
#Using CountVectorizer to convert text into tokens/features
vect = CountVectorizer(stop_words='english', ngram_range = (1,1), max_df = .80, min_df = 4)
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1, test_size= 0.2)
#Using training data to transform text into counts of features for each message
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
#Accuracy using KNN Model
KNN = KNeighborsClassifier(n_neighbors = 3)
KNN.fit(X_train_dtm, y_train)
y_pred = KNN.predict(X_test_dtm)
print('\nK Nearest Neighbors (NN = 3)')
#Naive Bayes Analysis
tokens_words = vect.get_feature_names()
print '\nAnalysis'
print'Accuracy Score: %f %%'% (metrics.accuracy_score(y_test,y_pred)*100)
print "Precision Score: %f%%" % precision_score(y_test,y_pred, average='binary')
print "Recall Score: %f%%" % recall_score(y_test,y_pred, average='binary')
By using the code above I get same value for precision and recall.
Thank you for answering my question, much appreciated.
To calculate precision and recall metrics, you should import the according methods from sklearn.metrics.
As stated in the documentation, their parameters are 1-d arrays of true and predicted labels:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
print('Calculating the metrics...')
recision_score(y_true, y_pred, average='macro')
>>> 0.22
recall_score(y_true, y_pred, average='macro')
>>> 0.33

Plotting graph using pylab

I am trying to plot a graph. It has a list contains action name (text) and another list which contains action's frequency (int).
I want to plot a connected graph. This is the code I've written:
xTicks=np.array(action)
x=np.array(count)
y=np.array(freq)
pl.xticks(x,xTicks)
pl.xticks(rotation=90)
pl.plot(x,y)
pl.show()
In the list xTicks, I have actions and in the list y, I have their frequencies .
With the above code, I am getting this graph:
Why am I getting extra spaces on x axis? It should be symmetric and the size of lists are 130-135 so how can I scroll it?
You need to set x to an evenly spaced list in order to get your x ticks to be evenly spaced. The following is an example with some made up data:
import matplotlib.pyplot as plt
import numpy as np
action = ["test1", "test2", "test3", "test4", "test5", "test6", "test7", "test8", "test9"]
freq = [5,3,7,4,8,3,5,1,12]
y=np.array(freq)
xTicks=np.array(action)
x = np.arange(0,len(action),1) # evenly spaced list with the same length as "freq"
plt.plot(x,y)
plt.xticks(x, xTicks, rotation=90)
plt.show()
This produces the following plot:
Update:
A simple example of a slider is shown below. You will have to make changes to this in order to get it exactly how you want but it will be a start:
from matplotlib.widgets import Slider
freq = [5,3,7,4,8,3,5,1,12,5,3,7,4,8,3,5,1,12,5,3,7,4,8,3,5,1,12,4,9,1]
y=np.array(freq)
x = np.arange(0,len(freq),1) # evenly spaced list with the same length as "action"
fig, ax = plt.subplots()
plt.subplots_adjust(left=0.25, bottom=0.25)
l, = plt.plot(x, y, lw=2, color='red')
axfreq = plt.axes([0.25, 0.1, 0.65, 0.03], facecolor="lightblue")
sfreq = Slider(axfreq, 'Slider', 0.1, 10, valinit=3)
def update(val):
l.set_xdata(val* x)
fig.canvas.draw_idle()
sfreq.on_changed(update)
plt.show()
This produces the following graph which has a slider:

number of parameters in Caffe LENET or Imagenet models

How to calculate number of parameters in a model e.g. LENET for mnist, or ConvNet for imagent model etc.
Is there any specific function in caffe that returns or saves number of parameters in a model.
regards
Here is a python snippet to compute the number of parameters in a Caffe model:
import caffe
caffe.set_mode_cpu()
import numpy as np
from numpy import prod, sum
from pprint import pprint
def print_net_parameters (deploy_file):
print "Net: " + deploy_file
net = caffe.Net(deploy_file, caffe.TEST)
print "Layer-wise parameters: "
pprint([(k, v[0].data.shape) for k, v in net.params.items()])
print "Total number of parameters: " + str(sum([prod(v[0].data.shape) for k, v in net.params.items()]))
deploy_file = "/home/ubuntu/deploy.prototxt"
print_net_parameters(deploy_file)
# Sample output:
# Net: /home/ubuntu/deploy.prototxt
# Layer-wise parameters:
#[('conv1', (96, 3, 11, 11)),
# ('conv2', (256, 48, 5, 5)),
# ('conv3', (384, 256, 3, 3)),
# ('conv4', (384, 192, 3, 3)),
# ('conv5', (256, 192, 3, 3)),
# ('fc6', (4096, 9216)),
# ('fc7', (4096, 4096)),
# ('fc8', (819, 4096))]
# Total number of parameters: 60213280
https://gist.github.com/kaushikpavani/a6a32bd87fdfe5529f0e908ed743f779
I can offer an explicit way to do this via the Matlab interface (make sure the matcaffe is installed first).
Basically, you extract set of parameters from each network layer and count them.
In Matlab:
% load the network
net_model = <path to your *deploy.prototxt file>
net_weights = <path to your *.caffemodel file>
phase = 'test';
test_net = caffe.Net(net_model, net_weights, phase);
% get the list of layers
layers_list = test_net.layer_names;
% for those layers which have parameters, count them
counter = 0;
for j = 1:length(layers_list),
if ~isempty(test_net.layers(layers_list{j}).params)
feat = test_net.layers(layers_list{j}).params(1).get_data();
counter = counter + numel(feat)
end
end
In the end, 'counter' contains the number of parameters.