How to train a classifier with an array of arrays? - python-2.7

I want to use a decision tree classifier in order to predict something.
As you can see here:
from sklearn import tree
sample1 = [120,1]
sample2 = [123,3]
features = [sample1,sample2]
labels = [0,1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
I have two samples:
Sample one: [120,1] which I labelled as 0
Sample two: [123,3] which I labelled as 1
So far so good.
But now, instead of this samples, I want to train using an array, something like:
features = [[120,120.2][1, 1.2]]
and the respective label for this sample is:
label = [1]
So my code should be:
from sklearn import tree
features = [[120,120.2][1, 1.2]]
labels = [1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
I'm getting the following error:
TypeError: list indices must be integers, not tuple
I understand that the classifier wants a list of integers, and not tuples.
And a solution may be:
features = [[120, 120.2, 1, 1.2]]
labels = [1]
But I don't want to mix up the data, since I have it separately into arrays.
Is there any way I can train my classifier with arrays of arrays of data?
Thanks

No you can't use this format with your data, you need to aggregate them in one array.
The expected shape is (n_samples, n_features).
It's even more logic, because an example is described by some features and by using the expected format it describes better your data.

Related

Tensor flow shuffle a tensor for batch gradient

To whom it may concern,
I am pretty new to tensorflow. I am trying to solve the famous MNIST problem for CNN. But i have encountered difficulty when i have to resuffle the x_training data (which is a [40000, 28, 28, 1] shape data.
my code is as below:
x_train_final = tf.reshape(x_train_final, [-1, image_width, image_width, 1])
x_train_final = tf.cast(x_train_final, dtype=tf.float32)
perm = np.arange(num_training_example).astype(np.int32)
np.random.shuffle(perm)
x_train_final = x_train_final[perm]
Below errors happened:
ValueError: Shape must be rank 1 but is rank 2 for 'strided_slice_1371' (op: 'StridedSlice') with input shapes: [40000,28,28,1], [1,40000], [1,40000], [1].
Anyone can advise how can i work around this? Thanks.
I would suggest you to make use of scikit's shuffle function.
from sklearn.utils import shuffle
x_train_final = shuffle(x_train_final)
Also, you can pass in multiple arrays and shuffle function will reorganize(shuffle) the data in those multiple arrays maintaining same shuffling order in all those arrays. So with that, you can even pass in your label dataset as well.
Ex:
X_train, y_train = shuffle(X_train, y_train)

What parameters to pass to svm function of scikit learn library for document classification

I am new to Machine Learning. I am working on document classification.
For that I am trying to train SVM on a subset of "20 newsgroup" dataset.
I am using scikit learn for this.
link: SVM - Scikit Learn
As a training set I have taken 3 categories of news and 40 documents in each category.
For each document, I have done the following so far:
tokenization
removing stop words (i.e. 'the','on','in' etc.)
lemmatization (stemming words) (i.e. 'runs','running','ran' = 'run')
calculating tf-idf score for the remaining words
(labels[ ]:list containing category labels for each document)
(final_list[ ]: list containing list of words and their tf-idf scores for each documents. i.e.
final_list=[
[['run',0.16544],['ground',0.1224]...]
[['disk',0.9677],['pc',0.8888]....]
.....
.....
])
As other classifiers, SVC, NuSVC, and LinearSVC take as input two arrays:
an array X of size [n_samples, n_features] holding the training samples,
and an array y of class labels (strings or integers), size [n_samples]:
Sample code from Scikit Learn webpage:(for numeric data)
>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]
>>> y = [0, 1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
After being fitted, the model can then be used to predict new values:
>>> clf.predict([[2., 2.]])
array([1])
Now when I am using the following lines:
>>> from sklearn import svm
>>> clf = svm.SVC()
>>> clf.fit(final_list, labels)
it is giving,
ValueError: setting an array element with a sequence.
I am not sure what parameters to pass to svm function for my problem statement.

Getting indices while using train test split in scikit

In order to split my data into train and test data separately, I'm using
sklearn.cross_validation.train_test_split function.
When I supply my data and labels as list of lists to this function, it returns train and test data in two separate lists.
I want to get the indices of the train and test data elements from the original data list.
Can anyone help me out with this?
Thanks in advance
You can supply the index vector as an additional argument. Using the example from sklearn:
import numpy as np
from sklearn.cross_validation import train_test_split
X, y,indices = (0.1*np.arange(10)).reshape((5, 2)),range(10,15),range(5)
X_train, X_test, y_train, y_test,indices_train,indices_test = train_test_split(X, y,indices, test_size=0.33, random_state=42)
indices_train,indices_test
#([2, 0, 3], [1, 4])
Try the below solutions (depending on whether you have imbalance):
NUM_ROWS = train.shape[0]
TEST_SIZE = 0.3
indices = np.arange(NUM_ROWS)
# usual train-val split
train_idx, val_idx = train_test_split(indices, test_size=TEST_SIZE, train_size=None)
# stratified train-val split as per Response's proportion (if imbalance)
strat_train_idx, strat_val_idx = train_test_split(indices, test_size=TEST_SIZE, stratify=y)

DataConverstionWarning with GridsearchCV in Sklearn

I'm getting the following warning repeatedly when using GridsearchCV in Sklearn
"DataConversionWarning: Copying input dataframe for slicing."
I tried running some of the models separately outside of Gridsearch and didn't get any warnings. It also didn't prevent Gridsearch from finding a model.
I have 2 Questions:
1) What does this error mean?
2) What are the implications for my output, if any?
The relevant parts of the code are below:
df = pd.read_csv(os.path.join(filepath, "Modeling_Set.csv")) #loads main data
keep_vars = pd.read_csv(os.path.join(filepath, "keep_vars.csv")) #loads a list of variables to keep from a CSV list
model_vars = keep_vars[keep_vars['keep']==1]['name'] #creates a list of vars to keep
modeling_df = df[model_vars] #creates the df with only keep vars
model_feature_vars = model_vars[:-1]
#Splits test and train data
X_train, X_test, y_train, y_test = train_test_split(modeling_df[model_feature_vars], modeling_df['Segment'], test_size=0.30, random_state=42)
#sets up models
#Range of parameters for gridsearch with decision trees
max_depth = range(2,20,2)
min_samples_split = range(2,10,2)
features = range(2, len(X_train.columns))
#set up for decision trees with gridsearch
parametersDT ={'feature_selection__k':features,
'feature_selection__score_func':(chi2, f_classif),
'classification__criterion':('gini','entropy'),
'classification__max_depth':max_depth,
'classification__min_samples_split':min_samples_split}
DT_with_K_Best = Pipeline([
('feature_selection', SelectKBest()),
('classification', DecisionTreeClassifier())
])
clf_DT = GridSearchCV(DT_with_K_Best, parametersDT, cv=10, verbose=2, scoring='f1_weighted', n_jobs = -2)
clf_DT.fit(X_train,y_train)
As far as I can tell it only means that the DataFrame you're using is copied before being fed to the model.
This shouldn't affect the training results. It's only an efficiency problem, unrelated to the performance of the classifier.

PCA for dimensionality reduction before Random Forest

I am working on binary class random forest with approximately 4500 variables. Many of these variables are highly correlated and some of them are just quantiles of an original variable. I am not quite sure if it would be wise to apply PCA for dimensionality reduction. Would this increase the model performance?
I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.
Many thanks in advance.
My experience is that PCA before RF is not an great advantage if any. Principal component regression(PCR) is e.g. when, PCA assists to regularize training features before OLS linear regression and that is very needed for sparse data-sets. As RF itself already performs a good/fair regularization without assuming linearity, it is not necessarily an advantage. That said, I found my self writing a PCA-RF wrapper for R two weeks ago. The code includes a simulated data set of a data set of 100 features comprising only 5 true linear components. Under such cercumstances it is infact a small advantage to pre-filter with PCA
The code is a seamless implementation, such that every RF parameters are simply passed on to RF. Loading vector are saved in model_fit to use during prediction.
#I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.
The easy way is to run without PCA and obtain variable importances and expect to find something similar for PCA-RF.
The tedious way, wrap the PCA-RF in a new bagging scheme with your own variable importance code. Could be done in 50-100 lines or so.
The souce-code suggestion for PCA-RF:
#wrap PCA around randomForest, forward any other arguments to randomForest
#define as new S3 model class
train_PCA_RF = function(x,y,ncomp=5,...) {
f.args=as.list(match.call()[-1])
pca_obj = princomp(x)
rf_obj = do.call(randomForest,c(alist(x=pca_obj$scores[,1:ncomp]),f.args[-1]))
out=mget(ls())
class(out) = "PCA_RF"
return(out)
}
#print method
print.PCA_RF = function(object) print(object$rf_obj)
#predict method
predict.PCA_RF = function(object,Xtest=NULL,...) {
print("predicting PCA_RF")
f.args=as.list(match.call()[-1])
if(is.null(f.args$Xtest)) stop("cannot predict without newdata parameter")
sXtest = predict(object$pca_obj,Xtest) #scale Xtest as Xtrain was scaled before
return(do.call(predict,c(alist(object = object$rf_obj, #class(x)="randomForest" invokes method predict.randomForest
newdata = sXtest), #newdata input, see help(predict.randomForest)
f.args[-1:-2]))) #any other parameters are passed to predict.randomForest
}
#testTrain predict #
make.component.data = function(
inter.component.variance = .9,
n.real.components = 5,
nVar.per.component = 20,
nObs=600,
noise.factor=.2,
hidden.function = function(x) apply(x,1,mean),
plot_PCA =T
){
Sigma=matrix(inter.component.variance,
ncol=nVar.per.component,
nrow=nVar.per.component)
diag(Sigma) = 1
x = do.call(cbind,replicate(n = n.real.components,
expr = {mvrnorm(n=nObs,
mu=rep(0,nVar.per.component),
Sigma=Sigma)},
simplify = FALSE)
)
if(plot_PCA) plot(prcomp(x,center=T,.scale=T))
y = hidden.function(x)
ynoised = y + rnorm(nObs,sd=sd(y)) * noise.factor
out = list(x=x,y=ynoised)
pars = ls()[!ls() %in% c("x","y","Sigma")]
attr(out,"pars") = mget(pars) #attach all pars as attributes
return(out)
}
A run code example:
#start script------------------------------
#source above from separate script
#test
library(MASS)
library(randomForest)
Data = make.component.data(nObs=600)#plots PC variance
train = list(x=Data$x[ 1:300,],y=Data$y[1:300])
test = list(x=Data$x[301:600,],y=Data$y[301:600])
rf = randomForest (train$x, train$y,ntree =50) #regular RF
rf2 = train_PCA_RF(train$x, train$y,ntree= 50,ncomp=12)
rf
rf2
pred_rf = predict(rf ,test$x)
pred_rf2 = predict(rf2,test$x)
cat("rf, R^2:",cor(test$y,pred_rf )^2,"PCA_RF, R^2", cor(test$y,pred_rf2)^2)
cor(test$y,predict(rf ,test$x))^2
cor(test$y,predict(rf2,test$x))^2
pairs(list(trueY = test$y,
native_rf = pred_rf,
PCA_RF = pred_rf2)
)
You can have a look here to get a better idea. The link says use PCA for smaller datasets!! Some of my colleagues have used Random Forests for the same purpose when working with Genomes. They had ~30000 variables and large amount of RAM.
Another thing I found is that Random Forests use up a lot of Memory and you have 4500 variables. So, may be you could apply PCA to the individual Trees.