Understanding Deep Learning model accuracy - python-2.7

I need help in understanding the accuracy and dataset output format for Deep Learning model.
I did some training for deep learning based on this site : https://machinelearningmastery.com/deep-learning-with-python2/
I did the example for pima-indian-diabetes dataset, and iris flower dataset. I train my computer for pima-indian-diabetes dataset using script from this : http://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
Then I train my computer for iris-flower dataset using below script.
# import package
import numpy
from pandas import read_csv
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from keras.callbacks import ModelCheckpoint
# fix random seed for reproductibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataframe = read_csv("iris_2.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:4].astype(float)
Y = dataset[:,4]
# encode class value as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
### one-hot encoder ###
dummy_y = np_utils.to_categorical(encoded_Y)
# define base model
def baseline_model():
# create model
model = Sequential()
model.add(Dense(4, input_dim=4, init='normal', activation='relu'))
model.add(Dense(3, init='normal', activation='sigmoid'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model_json = model.to_json()
with open("iris.json", "w") as json_file:
json_file.write(model_json)
model.save_weights('iris.h5')
return model
estimator = KerasClassifier(build_fn=baseline_model, nb_epoch=1000, batch_size=6, verbose=0)
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, dummy_y, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
Everything works fine until I decided to try on other dataset from this link : https://archive.ics.uci.edu/ml/datasets/Glass+Identification
At first I train this new dataset using the pime-indian-diabetes dataset script's example and change the value for X and Y variable to this
dataset = numpy.loadtxt("glass.csv", delimiter=",")
X = dataset[:,0:10]
Y = dataset[:,10]
and also the value for the neuron layer to this
model = Sequential()
model.add(Dense(10, input_dim=10, init='uniform', activation='relu'))
model.add(Dense(10, init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))
the result produce accuracy = 32.71%
Then I changed the output column of this dataset which is originally in integer (1~7) to string (a~g) and use the example's script for the iris-flower dataset by doing some modification to it
import numpy
from pandas import read_csv
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
seed = 7
numpy.random.seed(seed)
dataframe = read_csv("glass.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:10].astype(float)
Y = dataset[:,10]
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
def create_baseline():
model = Sequential()
model.add(Dense(10, input_dim=10, init='normal', activation='relu'))
model.add(Dense(1, init='normal', activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_json = model.to_json()
with open("glass.json", "w") as json_file:
json_file.write(model_json)
model.save_weights('glass.h5')
return model
estimator = KerasClassifier(build_fn=create_baseline, nb_epoch=1000, batch_size=10, verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
I did not use 'dummy_y' variable as refer to this tutorial : http://machinelearningmastery.com/binary-classification-tutorial-with-the-keras-deep-learning-library/
I check that the dataset using alphabet as the output and thinking that maybe I can reuse that script to train the new glass dataset that I modified.
This time the results become like this
Baseline : 68.42% (3.03%)
From the article, that 68% and 3% means the mean and standard deviation of model accuracy.
My 1st question is when do I use integer or alphabet as the output column? and is this kind of accuracy result common when we tempered with the dataset like changing the output from integer to string/alphabet?
My 2nd question is how do I know how many neuron I have to put for each layer? Is it related to what backend I use when compiling the model(Tensorflow or Theano)?
Thank you in advance.

First question
It doesn't matter, as you can see here:
Y = range(10)
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
print encoded_Y
Y = ['a', 'b', 'c', 'd', 'e', 'f','g','h','i','j']
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
print encoded_Y
results:
[0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4 5 6 7 8 9]
Which means that your classifier sees exactly the same labels.
Second question
There is no absolutely correct answer for this question, but for sure it does not depend on your backend.
You should try and experiment with different number of neurons, number of layers, types of layers and all other network parameters in order to understand what is the best architecture to your problem.
With experience you will develop both a good intuition as for what parameters will be better for which type of problems as well as a good method for the experimentation.
The best rule of thumb (assuming you have the dataset required to sustain such a strategy) I've heard is "Make your network as large as you can until it overfit, add regularization until it does not overfit - repeat".

Per parts. First, if your output includes values ​​of [0, 5] it is
impossible that using the sigmoid activation you can obtain that.
The sigmoid function has a range of [0, 1]. You could use an
activation = linear (without activation). But I think it's a bad approach because your problem is not to estimate a continuous value.
Second, the question you should ask yourself is not so much the type
of data you are using (in the sense of how you store the
information). Is it a string? Is it an int? Is it a float? It does
not matter, but you have to ask what kind of problem you are trying
to solve.
In this case, the problem should not be treated as a regression
(estimate a continuous value). Because your output are categorical,
numbers but categorical. Really you want to classifying between:
Type of glass: (class attribute).
When do a classification problem the following configuration is
"normally" used:
The class is encoded by one-hot encoding. It is nothing more than a vector of 0's and a single one in the corresponding class.
For instance: class 3 (0 count) and have 6 classes -> [0, 0, 0, 1, 0, 0] (as many zeros as classes you have).
As you see now, we dont have a single output, your model must be as outputs as your Y (6 classes). That way the last layer should
have as many neurons as classes. Dense (classes, ...).
You are also interested in the fact that the output is the probability of belonging to each class, that is: p (y = class_0),
... p (y_class_n). For this, the softmax activation layer is used,
which is to ensure that the sum of all the probabilities is 1.
You have to change the loss for the categorical_crossentropy so that it is able to work together with the softmax. And use the metric categorical_accuracy.
seed = 7
numpy.random.seed(seed)
dataframe = read_csv("glass.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:10].astype(float)
Y = dataset[:,10]
encoder = LabelEncoder()
encoder.fit(Y)
from keras.utils import to_categorical
encoded_Y = to_categorical(encoder.transform(Y))
def create_baseline():
model = Sequential()
model.add(Dense(10, input_dim=10, init='normal', activation='relu'))
model.add(Dense(encoded_Y.shape[1], init='normal', activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
model_json = model.to_json()
with open("glass.json", "w") as json_file:
json_file.write(model_json)
model.save_weights('glass.h5')
return model
model = create_baseline()
model.fit(X, encoded_Y, epochs=1000, batch_size=100)
The number of neurons does not depend on the backend you use.
But if it is true that you will never have the same results. That's
because there are enough stochastic processes within a network:
initialization, dropout (if you use), batch order, etc.
What is known is that expanding the number of neurons per dense
makes the model more complex and therefore has more potential to
represent your problem but is more difficult to learn and more
expensive both in time and in calculations. That way you always have
to look for a balance.
At the moment there is no clear evidence that it is better:
expand the number of neurons per layer.
add more layers.
There are models that use one architecture and others the other.
Using this architecture you get the following result:
Epoch 1000/1000
214/214 [==============================] - 0s 17us/step - loss: 0.0777 - categorical_accuracy: 0.9953
Using this architecture you get the following result:

Related

XGboost Google-AI-Model expecting float values instead of using Categorical values and converting them

I'm trying to run a simple XGBoost Prediction based on Google Cloud using this simple example https://cloud.google.com/ml-engine/docs/scikit/getting-predictions-xgboost#get_online_predictions
The model is building fine, but when I try to run a prediction with a sample input JSON it fails with error "Could not initialize DMatrix from inputs: could not convert string to float:" as shown in the screen below. I understand this is happening because the test-input has strings, I was hoping the Google machine learning model should have information to convert the categorical values to floats. I cannot expect my user to submit-online-prediction-request with float values.
Based on the tutorial it should work without converting the categorical values to floats. Please advise, I have attached the GIF with more details. Thanks
import json
import numpy as np
import os
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
# these are the column labels from the census data files
COLUMNS = (
'age',
'workclass',
'fnlwgt',
'education',
'education-num',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'capital-gain',
'capital-loss',
'hours-per-week',
'native-country',
'income-level'
)
# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'native-country'
)
# load training set
with open('./census_data/adult.data', 'r') as train_data:
raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)
# remove column we are trying to predict ('income-level') from features list
train_features = raw_training_data.drop('income-level', axis=1)
# create training labels list
train_labels = (raw_training_data['income-level'] == ' >50K')
# load test set
with open('./census_data/adult.test', 'r') as test_data:
raw_testing_data = pd.read_csv(test_data, names=COLUMNS, skiprows=1)
# remove column we are trying to predict ('income-level') from features list
test_features = raw_testing_data.drop('income-level', axis=1)
# create training labels list
test_labels = (raw_testing_data['income-level'] == ' >50K.')
# convert data in categorical columns to numerical values
encoders = {col:LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
train_features[col] = encoders[col].fit_transform(train_features[col])
for col in CATEGORICAL_COLUMNS:
test_features[col] = encoders[col].fit_transform(test_features[col])
# load data into DMatrix object
dtrain = xgb.DMatrix(train_features, train_labels)
dtest = xgb.DMatrix(test_features)
# train XGBoost model
bst = xgb.train({}, dtrain, 20)
bst.save_model('./model.bst')
Here is a fix. Put the input shown in the Google documentation in a file input.json, then run this. The output is input_numerical.json and prediction will succeed if you use that in place of input.json.
This code is just preprocessing categorical columns to numerical forms using the same procedure as was done with training and test data.
import json
import pandas as pd
from sklearn.preprocessing import LabelEncoder
COLUMNS = (
"age",
"workclass",
"fnlwgt",
"education",
"education-num",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"capital-gain",
"capital-loss",
"hours-per-week",
"native-country",
"income-level",
)
# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
"workclass",
"education",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"native-country",
)
with open("./input.json", "r") as json_lines:
rows = [json.loads(line) for line in json_lines]
prediction_features = pd.DataFrame(rows, columns=(COLUMNS[:-1]))
encoders = {col: LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
prediction_features[col] = encoders[col].fit_transform(prediction_features[col])
with open("input_numerical.json", "w") as input_numerical:
for index, row in prediction_features.iterrows():
input_numerical.write(row.to_json(orient="values") + "\n")
I created this Google Issues Tracker ticket as the Google documentation is missing this important step.
You can use pandas to convert categorical strings into codes for model inputs. For prediction input you can define a dictionary for each category with corresponding category values and codes. For example, for workclass:
df['workclass_cat'] = df['workclass'].astype('category')
df['workclass_cat'] = df['workclass_cat'].cat.codes
workclass_dict = dict(zip(list(df['workclass'].values), list(df['workclass_cat'].values)))
If a prediction input is 'somestring' you can access its code as follows:
category_input = workclass_dict['somestring']
XGBoost models take floats as input. In your training script you converted the categorical variables into numbers. The same transformation needs to be done when submitting a prediction.

DataConverstionWarning with GridsearchCV in Sklearn

I'm getting the following warning repeatedly when using GridsearchCV in Sklearn
"DataConversionWarning: Copying input dataframe for slicing."
I tried running some of the models separately outside of Gridsearch and didn't get any warnings. It also didn't prevent Gridsearch from finding a model.
I have 2 Questions:
1) What does this error mean?
2) What are the implications for my output, if any?
The relevant parts of the code are below:
df = pd.read_csv(os.path.join(filepath, "Modeling_Set.csv")) #loads main data
keep_vars = pd.read_csv(os.path.join(filepath, "keep_vars.csv")) #loads a list of variables to keep from a CSV list
model_vars = keep_vars[keep_vars['keep']==1]['name'] #creates a list of vars to keep
modeling_df = df[model_vars] #creates the df with only keep vars
model_feature_vars = model_vars[:-1]
#Splits test and train data
X_train, X_test, y_train, y_test = train_test_split(modeling_df[model_feature_vars], modeling_df['Segment'], test_size=0.30, random_state=42)
#sets up models
#Range of parameters for gridsearch with decision trees
max_depth = range(2,20,2)
min_samples_split = range(2,10,2)
features = range(2, len(X_train.columns))
#set up for decision trees with gridsearch
parametersDT ={'feature_selection__k':features,
'feature_selection__score_func':(chi2, f_classif),
'classification__criterion':('gini','entropy'),
'classification__max_depth':max_depth,
'classification__min_samples_split':min_samples_split}
DT_with_K_Best = Pipeline([
('feature_selection', SelectKBest()),
('classification', DecisionTreeClassifier())
])
clf_DT = GridSearchCV(DT_with_K_Best, parametersDT, cv=10, verbose=2, scoring='f1_weighted', n_jobs = -2)
clf_DT.fit(X_train,y_train)
As far as I can tell it only means that the DataFrame you're using is copied before being fed to the model.
This shouldn't affect the training results. It's only an efficiency problem, unrelated to the performance of the classifier.

Loading files to perform Kmean using sklearn

I have 100 files that contain system call traces. Each files is presented as seen below:
setpgrp ioctl setpgrp ioctl ioctl ....
I am trying to load these files and perform kmean calculation on them to cluster them based on similarities. Based on a tutorial on the sklearn webpage I written the following:
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics
from sklearn.datasets import load_files
from sklearn.cluster import KMeans, MiniBatchKMeans
import numpy as np
# parse commandline arguments
op = OptionParser()
op.add_option("--lsa",
dest="n_components", type="int",
help="Preprocess documents with latent semantic analysis.")
op.add_option("--no-minibatch",
action="store_false", dest="minibatch", default=True,
help="Use ordinary k-means algorithm (in batch mode).")
op.add_option("--use-idf",
action="store_false", dest="use_idf", default=True,
help="Disable Inverse Document Frequency feature weighting.")
op.add_option("--n-features", type=int, default=10000,
help="Maximum number of features (dimensions)"
" to extract from text.")
op.add_option("--verbose",
action="store_true", dest="verbose", default=False,
help="Print progress reports inside k-means algorithm.")
print(__doc__)
op.print_help()
(opts, args) = op.parse_args()
if len(args) > 0:
op.error("this script takes no arguments.")
sys.exit(1)
print("Loading training data:")
trainingdata = load_files('C:\data\Training data')
print("%d documents" % len(trainingdata.data))
print()
print("Extracting features from the training trainingdata using a sparse vectorizer")
if opts.use_idf:
vectorizer = TfidfVectorizer(input="file",min_df=1)
X = vectorizer.fit_transform(trainingdata.data)
print("n_samples: %d, n_features: %d" % X.shape)
print()
if opts.n_components:
print("Performing dimensionality reduction using LSA")
# Vectorizer results are normalized, which makes KMeans behave as
# spherical k-means for better results. Since LSA/SVD results are
# not normalized, we have to redo the normalization.
svd = TruncatedSVD(opts.n_components)
lsa = make_pipeline(svd, Normalizer(copy=False))
X = lsa.fit_transform(X)
explained_variance = svd.explained_variance_ratio_.sum()
print("Explained variance of the SVD step: {}%".format(
int(explained_variance * 100)))
print()
However it seems that none of the files in the dataset directory get loaded into the memory when though all files are available. I get the following error when executing the program:
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
Can anyone tell me why the dataset is not being loaded? What am I doing wrong?
I finally managed to load the files. The approach to use Kmean in sklearn is to vectorize the training data (using tfidf or count_vectorizer), then transform your test data using the vectorization of your training data. Once that is done you can initialize the Kmean parameters, use the training data set vectors to create the kmean cluster. Finally you can cluster your test data around your training data centroid.
The following code does what is explained above.
#Read the data in a directory:
def readfile(dataDir):
data_set = []
for file in os.listdir(dataDir):
trainingfiles = os.path.join(dataDir, file)
if os.path.isfile(trainingfiles):
data = open(trainingfiles, 'r')
dataread=str.decode(data.read())
data_set.append(dataread)
return data_set
#fitting tfidf transfrom for training data
tfidf_vectorizer_trainingset = tfidf_vectorizer.fit_transform(readfile(trainingdataDir)).toarray()
#transform the test set based on the training set
tfidf_vectorizer_testset = tfidf_vectorizer.transform(readfile(testingdataDir)).toarray()
# Kmean Clustering parameters
kmean_parameters = KMeans(n_clusters=number_of_clusters, init='k-means++', max_iter=100, n_init=1)
#Cluster the training data based on the parameters
KmeanAnalysis_training = kmean_parameters.fit(tfidf_vectorizer_trainingset)
#transform the test data based on the clustering of the training data
KmeanAnalysis_test = kmean_parameters.transform(tfidf_vectorizer_testset)

Scikit-Learn One-hot-encode before or after train/test split

I am looking at two scenarios building a model using scikit-learn and I can not figure out why one of them is returning a result that is so fundamentally different than the other. The only thing different between the two cases (that I know of) is that in one case I am one-hot-encoding the categorical variables all at once (on the whole data) and then splitting between training and test. In the second case I am splitting between training and test and then one-hot-encoding both sets based off of the training data.
The latter case is technically better for judging the generalization error of the process but this case is returning a normalized gini that is dramatically different (and bad - essentially no model) compared to the first case. I know the first case gini (~0.33) is in line with a model built on this data.
Why is the second case returning such a different gini? FYI The data set contains a mix of numeric and categorical variables.
Method 1 (one-hot encode entire data and then split) This returns: Validation Sample Score: 0.3454355044 (normalized gini).
from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston
def gini(solution, submission):
df = zip(solution, submission, range(len(solution)))
df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True)
rand = [float(i+1)/float(len(df)) for i in range(len(df))]
totalPos = float(sum([x[0] for x in df]))
cumPosFound = [df[0][0]]
for i in range(1,len(df)):
cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0])
Lorentz = [float(x)/totalPos for x in cumPosFound]
Gini = [Lorentz[i]-rand[i] for i in range(len(df))]
return sum(Gini)
def normalized_gini(solution, submission):
normalized_gini = gini(solution, submission)/gini(solution, solution)
return normalized_gini
# Normalized Gini Scorer
gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)
if __name__ == '__main__':
dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",")
y=dat[['Hazard']].values.ravel()
dat=dat.drop(['Hazard','Id'],axis=1)
folds=train_test_split(range(len(y)),test_size=0.30, random_state=15) #30% test
#First one hot and make a pandas df
dat_dict=dat.T.to_dict().values()
vectorizer = DV( sparse = False )
vectorizer.fit( dat_dict )
dat= vectorizer.transform( dat_dict )
dat=pd.DataFrame(dat)
train_X=dat.iloc[folds[0],:]
train_y=y[folds[0]]
test_X=dat.iloc[folds[1],:]
test_y=y[folds[1]]
rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15)
rf.fit(train_X,train_y)
y_submission=rf.predict(test_X)
print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))
Method 2 (first split and then one-hot encode) This returns: Validation Sample Score: 0.0055124452 (normalized gini).
from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston
def gini(solution, submission):
df = zip(solution, submission, range(len(solution)))
df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True)
rand = [float(i+1)/float(len(df)) for i in range(len(df))]
totalPos = float(sum([x[0] for x in df]))
cumPosFound = [df[0][0]]
for i in range(1,len(df)):
cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0])
Lorentz = [float(x)/totalPos for x in cumPosFound]
Gini = [Lorentz[i]-rand[i] for i in range(len(df))]
return sum(Gini)
def normalized_gini(solution, submission):
normalized_gini = gini(solution, submission)/gini(solution, solution)
return normalized_gini
# Normalized Gini Scorer
gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)
if __name__ == '__main__':
dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",")
y=dat[['Hazard']].values.ravel()
dat=dat.drop(['Hazard','Id'],axis=1)
folds=train_test_split(range(len(y)),test_size=0.3, random_state=15) #30% test
#first split
train_X=dat.iloc[folds[0],:]
train_y=y[folds[0]]
test_X=dat.iloc[folds[1],:]
test_y=y[folds[1]]
#One hot encode the training X and transform the test X
dat_dict=train_X.T.to_dict().values()
vectorizer = DV( sparse = False )
vectorizer.fit( dat_dict )
train_X= vectorizer.transform( dat_dict )
train_X=pd.DataFrame(train_X)
dat_dict=test_X.T.to_dict().values()
test_X= vectorizer.transform( dat_dict )
test_X=pd.DataFrame(test_X)
rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15)
rf.fit(train_X,train_y)
y_submission=rf.predict(test_X)
print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))
While the previous comments correctly suggest it is best to map over your entire feature space first, in your case both the Train and Test contain all of the feature values in all of the columns.
If you compare the vectorizer.vocabulary_ between the two versions, they are exactly the same, so there is no difference in mapping. Hence, it cannot be causing the problem.
The reason Method 2 fails is because your dat_dict gets re-sorted by the original index when you execute this command.
dat_dict=train_X.T.to_dict().values()
In other words, train_X has a shuffled index going into this line of code. When you turn it into a dict, the dict order re-sorts into the numerical order of the original index. This causes your Train and Test data become completely de-correlated with y.
Method 1 doesn't suffer from this problem, because you shuffle the data after the mapping.
You can fix the issue by adding a .reset_index() both times you assign the dat_dict in Method 2, e.g.,
dat_dict=train_X.reset_index(drop=True).T.to_dict().values()
This ensures the data order is preserved when converting to a dict.
When I add that bit of code, I get the following results:
- Method 1: Validation Sample Score: 0.3454355044 (normalized gini)
- Method 2: Validation Sample Score: 0.3438430991 (normalized gini)
I can't get your code to run, but my guess is that in the test dataset either
you're not seeing all the levels of some of the categorical variables, and hence if you calculate your dummy variables just on this data, you'll actually have different columns.
Otherwise, maybe you have the same columns but they're in a different order?

scikit-learn PCA doesn't have 'score' method

I am trying to identify the type of noise based on that article:
Model selection with Probabilistic (PCA) and Factor Analysis (FA)
I am using scikit-learn-0.14.1.win32-py2.7 on win8 64bit
I know that it refers on version 0.15, however at the version 0.14 documentation it mentions that the score method is available for PCA so I guess it should normally work:
sklearn.decomposition.ProbabilisticPCA
The problem is that no matter which PCA I will use for the *cross_val_score*, I always get a type error message saying that the estimator PCA does not have a score method:
*TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator PCA(copy=True, n_components=None, whiten=False) does not.*
Any ideas why is that happening?
Many thanks in advance
Christos
X has 1000 samples of 40 features
here is a portion of the code:
import numpy as np
import csv
from scipy import linalg
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.covariance import ShrunkCovariance, LedoitWolf
#read in the training data
train_path = '<train data path>/train.csv'
reader = csv.reader(open(train_path,"rb"),delimiter=',')
train = list(reader)
X = np.array(train).astype('float')
n_samples = 1000
n_features = 40
n_components = np.arange(0, n_features, 4)
def compute_scores(X):
pca = PCA()
pca_scores = []
for n in n_components:
pca.n_components = n
pca_scores.append(np.mean(cross_val_score(pca, X, n_jobs=1)))
return pca_scores
pca_scores = compute_scores(X)
n_components_pca = n_components[np.argmax(pca_scores)]
Ok, I think I found the problem. it is not working with PCA, but it does work with PPCA
However, by not providing a cv number the cross_val_score automatically sets 3-fold cross validation
that created 3 sets with sizes 334, 333 and 333 (my initial training set contains 1000 samples)
Since nympy.mean cannot make a comparison between sets with different sizes (334 vs 333), python rises an exception.
thx