Keras loss is nan when using inputing data from csv with numpy - python-2.7

I'm trying to use the TensorFlow's Boston housing price example to learn how to use TensorFlow/Keras for regressions, but I keep running into a problem using my own data, even when I make as small of changes as possible. After giving up on writing everything myself, I simply changed the two lines of the code that input the data:
boston_housing = keras.datasets.boston_housing
(train_data, train_labels), (test_data, test_labels) = boston_housing.load_data()
to something that, after looking online, should also create numpy arrays from my csv:
np_array = genfromtxt('trainingdata.csv', delimiter=',')
np_array = np.delete(np_array, (0), axis=0) # Remove header
test_np_array = np_array[:800,:]
tr_np_array = np_array[800:,:] # Separate out test and train data
train_labels = tr_np_array[:, 20] # Get the last row for the labels
test_labels = test_np_array[:, 20]
train_data = np.delete(tr_np_array, (20), axis=1)
test_data = np.delete(test_np_array, (20), axis=1) # Remove the last row so the data is only the features
Everything I can look at seems right โ€“ the shapes of the arrays are all correct, the arrays do seem to be correct-looking numpy arrays, the features do seem to become normalized, etc. and yet when I set verbose to 1 on model.fit(...), the very first lines of output show a problem with loss:
Epoch 1/500
32/2560 [..............................] - ETA: 18s - loss: nan - mean_absolute_error: nan
2016/2560 [======================>.......] - ETA: 0s - loss: nan - mean_absolute_error: nan
2560/2560 [==============================] - 0s 133us/step - loss: nan - mean_absolute_error: nan - val_loss: nan - val_mean_absolute_error: nan
I'm especially confused because every other place on stack overflow where I've seen the "TensorFlow loss is 'NaN'" error, it has generally a) been with a custom loss function, and b) once the model has trained for a while, not (as here) within the first 52 passes. Where that's not the case, it's because the data wasn't normalized, but I do that later in the code, and the normalization works for the housing pricing example and prints out numbers clustered around 0. At this point, my best guess is that it's a problem with the genfromtxt command, but if anyone can see what I'm doing wrong or where I might find my issue, I'd be incredibly appreciative.
Edit:
Here is the full code for the program. Commenting out lines 13 through 26 and uncommenting lines 10 and 11 make the program work perfectly. Commenting out lines 13 and 14 and uncommenting 16 and 17 was my attempt at using pandas, but that led to the same errors.
import tensorflow as tf
from tensorflow import keras
import numpy as np
from numpy import genfromtxt
import pandas as pd
print(tf.__version__)
# boston_housing = keras.datasets.boston_housing # Line 10
# (train_data, train_labels), (test_data, test_labels) = boston_housing.load_data()
np_array = genfromtxt('trainingdata.csv', delimiter=',') # Line 13
np_array = np.delete(np_array, (0), axis=0)
# df = pd.read_csv('trainingdata.csv') # Line 16
# np_array = df.get_values()
test_np_array = np_array[:800,:]
tr_np_array = np_array[800:,:]
train_labels = tr_np_array[:, 20]
test_labels = test_np_array[:, 20]
train_data = np.delete(tr_np_array, (20), axis=1)
test_data = np.delete(test_np_array, (20), axis=1) # Line 26
order = np.argsort(np.random.random(train_labels.shape))
train_data = train_data[order]
train_labels = train_labels[order]
mean = train_data.mean(axis=0)
std = train_data.std(axis=0)
train_data = (train_data - mean) / std
test_data = (test_data - mean) / std
labels_mean = train_labels.mean(axis=0)
labels_std = train_labels.std(axis=0)
train_labels = (train_labels - labels_mean) / labels_std
test_labels = (test_labels - labels_mean) / labels_std
def build_model():
model = keras.Sequential([
keras.layers.Dense(64, activation=tf.nn.relu,
input_shape=(train_data.shape[1],)),
keras.layers.Dense(64, activation=tf.nn.relu),
keras.layers.Dense(1)
])
optimizer = tf.train.RMSPropOptimizer(0.001)
model.compile(loss='mse',
optimizer=optimizer,
metrics=['mae'])
return model
model = build_model()
model.summary()
EPOCHS = 500
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = model.fit(train_data, train_labels, epochs=EPOCHS,
validation_split=0.2, verbose=1,
callbacks=[early_stop])
[loss, mae] = model.evaluate(test_data, test_labels, verbose=0)
print("Testing set Mean Abs Error: ${:7.2f}".format(mae * 1000 * labels_std))

Related

Understanding Deep Learning model accuracy

I need help in understanding the accuracy and dataset output format for Deep Learning model.
I did some training for deep learning based on this site : https://machinelearningmastery.com/deep-learning-with-python2/
I did the example for pima-indian-diabetes dataset, and iris flower dataset. I train my computer for pima-indian-diabetes dataset using script from this : http://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
Then I train my computer for iris-flower dataset using below script.
# import package
import numpy
from pandas import read_csv
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from keras.callbacks import ModelCheckpoint
# fix random seed for reproductibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataframe = read_csv("iris_2.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:4].astype(float)
Y = dataset[:,4]
# encode class value as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
### one-hot encoder ###
dummy_y = np_utils.to_categorical(encoded_Y)
# define base model
def baseline_model():
# create model
model = Sequential()
model.add(Dense(4, input_dim=4, init='normal', activation='relu'))
model.add(Dense(3, init='normal', activation='sigmoid'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model_json = model.to_json()
with open("iris.json", "w") as json_file:
json_file.write(model_json)
model.save_weights('iris.h5')
return model
estimator = KerasClassifier(build_fn=baseline_model, nb_epoch=1000, batch_size=6, verbose=0)
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, dummy_y, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
Everything works fine until I decided to try on other dataset from this link : https://archive.ics.uci.edu/ml/datasets/Glass+Identification
At first I train this new dataset using the pime-indian-diabetes dataset script's example and change the value for X and Y variable to this
dataset = numpy.loadtxt("glass.csv", delimiter=",")
X = dataset[:,0:10]
Y = dataset[:,10]
and also the value for the neuron layer to this
model = Sequential()
model.add(Dense(10, input_dim=10, init='uniform', activation='relu'))
model.add(Dense(10, init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))
the result produce accuracy = 32.71%
Then I changed the output column of this dataset which is originally in integer (1~7) to string (a~g) and use the example's script for the iris-flower dataset by doing some modification to it
import numpy
from pandas import read_csv
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
seed = 7
numpy.random.seed(seed)
dataframe = read_csv("glass.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:10].astype(float)
Y = dataset[:,10]
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
def create_baseline():
model = Sequential()
model.add(Dense(10, input_dim=10, init='normal', activation='relu'))
model.add(Dense(1, init='normal', activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_json = model.to_json()
with open("glass.json", "w") as json_file:
json_file.write(model_json)
model.save_weights('glass.h5')
return model
estimator = KerasClassifier(build_fn=create_baseline, nb_epoch=1000, batch_size=10, verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
I did not use 'dummy_y' variable as refer to this tutorial : http://machinelearningmastery.com/binary-classification-tutorial-with-the-keras-deep-learning-library/
I check that the dataset using alphabet as the output and thinking that maybe I can reuse that script to train the new glass dataset that I modified.
This time the results become like this
Baseline : 68.42% (3.03%)
From the article, that 68% and 3% means the mean and standard deviation of model accuracy.
My 1st question is when do I use integer or alphabet as the output column? and is this kind of accuracy result common when we tempered with the dataset like changing the output from integer to string/alphabet?
My 2nd question is how do I know how many neuron I have to put for each layer? Is it related to what backend I use when compiling the model(Tensorflow or Theano)?
Thank you in advance.
First question
It doesn't matter, as you can see here:
Y = range(10)
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
print encoded_Y
Y = ['a', 'b', 'c', 'd', 'e', 'f','g','h','i','j']
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
print encoded_Y
results:
[0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4 5 6 7 8 9]
Which means that your classifier sees exactly the same labels.
Second question
There is no absolutely correct answer for this question, but for sure it does not depend on your backend.
You should try and experiment with different number of neurons, number of layers, types of layers and all other network parameters in order to understand what is the best architecture to your problem.
With experience you will develop both a good intuition as for what parameters will be better for which type of problems as well as a good method for the experimentation.
The best rule of thumb (assuming you have the dataset required to sustain such a strategy) I've heard is "Make your network as large as you can until it overfit, add regularization until it does not overfit - repeat".
Per parts. First, if your output includes values โ€‹โ€‹of [0, 5] it is
impossible that using the sigmoid activation you can obtain that.
The sigmoid function has a range of [0, 1]. You could use an
activation = linear (without activation). But I think it's a bad approach because your problem is not to estimate a continuous value.
Second, the question you should ask yourself is not so much the type
of data you are using (in the sense of how you store the
information). Is it a string? Is it an int? Is it a float? It does
not matter, but you have to ask what kind of problem you are trying
to solve.
In this case, the problem should not be treated as a regression
(estimate a continuous value). Because your output are categorical,
numbers but categorical. Really you want to classifying between:
Type of glass: (class attribute).
When do a classification problem the following configuration is
"normally" used:
The class is encoded by one-hot encoding. It is nothing more than a vector of 0's and a single one in the corresponding class.
For instance: class 3 (0 count) and have 6 classes -> [0, 0, 0, 1, 0, 0] (as many zeros as classes you have).
As you see now, we dont have a single output, your model must be as outputs as your Y (6 classes). That way the last layer should
have as many neurons as classes. Dense (classes, ...).
You are also interested in the fact that the output is the probability of belonging to each class, that is: p (y = class_0),
... p (y_class_n). For this, the softmax activation layer is used,
which is to ensure that the sum of all the probabilities is 1.
You have to change the loss for the categorical_crossentropy so that it is able to work together with the softmax. And use the metric categorical_accuracy.
seed = 7
numpy.random.seed(seed)
dataframe = read_csv("glass.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:10].astype(float)
Y = dataset[:,10]
encoder = LabelEncoder()
encoder.fit(Y)
from keras.utils import to_categorical
encoded_Y = to_categorical(encoder.transform(Y))
def create_baseline():
model = Sequential()
model.add(Dense(10, input_dim=10, init='normal', activation='relu'))
model.add(Dense(encoded_Y.shape[1], init='normal', activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
model_json = model.to_json()
with open("glass.json", "w") as json_file:
json_file.write(model_json)
model.save_weights('glass.h5')
return model
model = create_baseline()
model.fit(X, encoded_Y, epochs=1000, batch_size=100)
The number of neurons does not depend on the backend you use.
But if it is true that you will never have the same results. That's
because there are enough stochastic processes within a network:
initialization, dropout (if you use), batch order, etc.
What is known is that expanding the number of neurons per dense
makes the model more complex and therefore has more potential to
represent your problem but is more difficult to learn and more
expensive both in time and in calculations. That way you always have
to look for a balance.
At the moment there is no clear evidence that it is better:
expand the number of neurons per layer.
add more layers.
There are models that use one architecture and others the other.
Using this architecture you get the following result:
Epoch 1000/1000
214/214 [==============================] - 0s 17us/step - loss: 0.0777 - categorical_accuracy: 0.9953
Using this architecture you get the following result:

Reading Time Series from netCDF with python

I'm trying to create time series from a netCDF file (accessed via Thredds server) with python. The code I use seems correct, but the values of the variable amb reading are 'masked'. I'm new into python and I'm not familiar with the formats. Any idea of how can I read the data?
This is the code I use:
import netCDF4
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
from datetime import datetime, timedelta #
dayFile = datetime.now() - timedelta(days=1)
dayFile = dayFile.strftime("%Y%m%d")
url='http://nomads.ncep.noaa.gov:9090/dods/nam/nam%s/nam1hr_00z' %(dayFile)
# NetCDF4-Python can open OPeNDAP dataset just like a local NetCDF file
nc = netCDF4.Dataset(url)
varsInFile = nc.variables.keys()
lat = nc.variables['lat'][:]
lon = nc.variables['lon'][:]
time_var = nc.variables['time']
dtime = netCDF4.num2date(time_var[:],time_var.units)
first = netCDF4.num2date(time_var[0],time_var.units)
last = netCDF4.num2date(time_var[-1],time_var.units)
print first.strftime('%Y-%b-%d %H:%M')
print last.strftime('%Y-%b-%d %H:%M')
# determine what longitude convention is being used
print lon.min(),lon.max()
# Specify desired station time series location
# note we add 360 because of the lon convention in this dataset
#lati = 36.605; loni = -121.85899 + 360. # west of Pacific Grove, CA
lati = 41.4; loni = -100.8 +360.0 # Georges Bank
# Function to find index to nearest point
def near(array,value):
idx=(abs(array-value)).argmin()
return idx
# Find nearest point to desired location (no interpolation)
ix = near(lon, loni)
iy = near(lat, lati)
print ix,iy
# Extract desired times.
# 1. Select -+some days around the current time:
start = netCDF4.num2date(time_var[0],time_var.units)
stop = netCDF4.num2date(time_var[-1],time_var.units)
time_var = nc.variables['time']
datetime = netCDF4.num2date(time_var[:],time_var.units)
istart = netCDF4.date2index(start,time_var,select='nearest')
istop = netCDF4.date2index(stop,time_var,select='nearest')
print istart,istop
# Get all time records of variable [vname] at indices [iy,ix]
vname = 'dswrfsfc'
var = nc.variables[vname]
hs = var[istart:istop,iy,ix]
tim = dtime[istart:istop]
# Create Pandas time series object
ts = pd.Series(hs,index=tim,name=vname)
The var data are not read as I expected, apparently because data is masked:
>>> hs
masked_array(data = [-- -- -- ..., -- -- --],
mask = [ True True True ..., True True True],
fill_value = 9.999e+20)
The var name, and the time series are correct, as well of the rest of the script. The only thing that doesn't work is the var data retrieved. This is the time serie I get:
>>> ts
2016-10-25 00:00:00.000000 NaN
2016-10-25 01:00:00.000000 NaN
2016-10-25 02:00:00.000006 NaN
2016-10-25 03:00:00.000000 NaN
2016-10-25 04:00:00.000000 NaN
... ... ... ... ...
2016-10-26 10:00:00.000000 NaN
2016-10-26 11:00:00.000006 NaN
Name: dswrfsfc, dtype: float32
Any help will be appreciated!
Hmm, this code looks familiar. ;-)
You are getting NaNs because the NAM model you are trying to access now uses longitude in the range [-180, 180] instead of the range [0, 360]. So if you request loni = -100.8 instead of loni = -100.8 +360.0, I believe your code will return non-NaN values.
It's worth noting, however, that the task of extracting time series from multidimensional gridded data is now much easier with xarray, because you can simply select a dataset closest to a lon,lat point and then plot any variable. The data only gets loaded when you need it, not when you extract the dataset object. So basically you now only need:
import xarray as xr
ds = xr.open_dataset(url) # NetCDF or OPeNDAP URL
lati = 41.4; loni = -100.8 # Georges Bank
# Extract a dataset closest to specified point
dsloc = ds.sel(lon=loni, lat=lati, method='nearest')
# select a variable to plot
dsloc['dswrfsfc'].plot()
Full notebook here: http://nbviewer.jupyter.org/gist/rsignell-usgs/d55b37c6253f27c53ef0731b610b81b4
I checked your approach with xarray. Works great to extract Solar radiation data! I can add that the first point is not defined (NaN) because the model starts calculating there, so there is no accumulated radiation data (to calculate hourly global radiation). So that is why it is masked.
Something everyone overlooked is that the output is not correct. It does look ok (at noon= sunshine, at nmidnight=0, dark), but the daylength is not correct! I checked it for 52 latitude north and 5.6 longitude (east) (November) and daylength is at least 2 hours too much! (The NOAA Panoply viewer for Netcdf databases gives similar results)

Plotting error bars from 2 axis

I'm looking to plot the standard deviation of some array data I've been looking at in python however the data is averaged over both longitude and latitude (Axis 2,3 of my arrays).
What I have so far is a monthly plot that looks like this but I can't get the standard deviations to work Monthly plot
I was just wondering if anyone knew how to get around this problem. Here's the code I've used thus far.
Any help is much appreciated!
# import things
import matplotlib.pyplot as plt
import numpy as np
import netCDF4
# [ date, hour, 0, lon, lat ]
temp = (f.variables['TEMP2'][:, 14:24, 0, :, :]) # temp at 2m
temp2 = (f.variables['TEMP2'][:, 0:14, 0, :, :])
# concatenate back to 24 hour period
tercon = np.concatenate((temp, temp2), axis=1)
ter1 = tercon.mean(axis=(2, 3))
rtemp = np.reshape(ter1, 672)-273
# X axis dates instead of times
date = np.arange(rtemp.shape[0]) # assume that delta time between data is 1
date21 = (date/24.) # use days instead of hours
# change plot size for monthly
rcParams['figure.figsize'] = 15, 5
plt.plot(date21, rtemp , linestyle='-', linewidth=3.0, c='orange')
You should errorbar instead of plot and pass the precalculated standard deviations. The following adapted example uses random data to emulate your temperature data with an hourly resolution and accumulates the data and the standard deviation.
# import things
import matplotlib.pyplot as plt
import numpy as np
# x-axis: day-of-month
date21 = np.arange(1, 31)
# generate random "hourly" data
hourly_temp = np.random.random(30*24)*10 + 20
# mean "temperature"
dayly_mean_temp = hourly_temp.reshape(24,30).mean(axis=0)
# standard deviation per day
dayly_std_temp = hourly_temp.reshape(24,30).std(axis=0)
# create a figure
figure = plt.figure(figsize = (15, 5))
#add an axes to the figure
ax = figure.add_subplot(111)
ax.grid()
ax.errorbar(date21, dayly_mean_temp , yerr=dayly_std_temp, fmt="--o", capsize=15, capthick=3, linestyle='-', linewidth=3.0, c='orange')
plt.show()

Evaluating the predictive accuracy of the NB model

What am I doing wrong with using scikit-learn from nltk to check the accuracy of the naive bayes classifier?
...readFile definition not needed
#divide the data into training and testing sets
data = readFile('Data_test/')
training_set = list_nltk[:2000000]
testing_set = list_nltk[2000000:]
#applied Bag of words as a way to select and extract feature
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(training_set.split('\n'))
#apply tfd
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
#Train the data
clf = MultinomialNB().fit(X_train_tf, training_set.split('\n'))
#now test the accuracy of the naive bayes classifier
test_data_features = count_vect.transform(testing_set)
X_new_tfidf = tf_transformer.transform(test_data_features)
predicted = clf.predict(X_new_tfidf)
print "%.3f" % nltk.classify.accuracy(clf, predicted)
The problem is when I print the nltk.classify.accuracy, it takes forever and I am suspecting this is because I have done something wrong but since I get no error, I can't figure out what it is that is wrong.
Use instead the accuracy_score of the sklearn metrics.
>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
I think you're mixing some things about Supervised Learning.See this answer and try to understand the top of this page.
Your data should be in this form (before doing the Vectorization) :
X = [["The cat is sleeping"], ..., ["The man is dead"]]
Y = [1, ..., 0]
Your have an error at least in this line
clf = MultinomialNB().fit(X_train_tf, training_set.split('\n'))
You need to have your training labels and vectorized data in there but you have the original and vectorized data.
It should look like this:
clf = MultinomialNB().fit(X_train_tf, y_train)
But you don't even have the label y_train data anywhere in your code as far I can tell.

How to identify the ID / name / title of the misclassified text file with sci-kit learn

I am buidling my own classifier for text classification but at the moment I am playing with sci-kit learn in order to figure out few things. I classified few of my text files using NB classifier. I am using 26 text files manually categorised into 2 categories, with each of the files being numbered between 01 - 26 (i.e. '01.txt' and so forth).
My code and results:
import sklearn
from sklearn.datasets import load_files
import numpy as np
bunch = load_files('corpus')
split_pcnt = 0.75
split_size = int(len(bunch.data) * split_pcnt)
X_train = bunch.data[:split_size]
X_test = bunch.data[split_size:]
y_train = bunch.target[:split_size]
y_test = bunch.target[split_size:]
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer
clf_1 = Pipeline([('vect', CountVectorizer()),
('clf', MultinomialNB()),
])
from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem
def evaluate_cross_validation(clf, X, y, K):
# create a k-fold croos validation iterator of k=5 folds
cv = KFold(len(y), K, shuffle=True, random_state=0)
# by default the score used is the one returned by score >>> method of the estimator (accuracy)
scores = cross_val_score(clf, X, y, cv=cv)
print scores
print ("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores))
clfs = [clf_1]
for clf in clfs:
evaluate_cross_validation(clf, bunch.data, bunch.target, 5)
[ 0.5 0.4 0.4 0.4 0.6]
Mean score: 0.460 (+/-0.040)
from sklearn import metrics
def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
clf.fit(X_train, y_train)
print "Accuracy on training set:"
print clf.score(X_train, y_train)
print "Accuracy on testing set:"
print clf.score(X_test, y_test)
y_pred = clf.predict(X_test)
print "Classification Report:"
print metrics.classification_report(y_test, y_pred)
print "Confusion Matrix:"
print metrics.confusion_matrix(y_test, y_pred)
train_and_evaluate(clf_1, X_train, X_test, y_train, y_test)
Accuracy on training set:
1.0
Accuracy on testing set:
0.714285714286
Classification Report:
precision recall f1-score support
0 0.67 0.67 0.67 3
1 0.75 0.75 0.75 4
avg / total 0.71 0.71 0.71 7
Confusion Matrix:
[[2 1]
[1 3]]
What I cannot figure out is how to identify the IDs of the misclassified files, to see which exact files are misclassified (e.g. '05.txt' and '23.txt'). Is this possible with sci-kit learn at all?
best,
guzdeh
Yes, you have to use the attribute filenames of the load_files result.
However you have two model training and evaluation cycles in your example code: one using CV and another using simple train-test split.
In the train-test split:
test_filenames = bunch.filenames[split_size:]
misclassified = (y_pred != y_test)
print test_filenames[misscalssified]
This answer does not assume that the text files are in alphabetical order or that all numbers are present.
Assuming load_files loads the text files in alphabetical order, all you need are the indices of the examples that were misclassified. This can be obtained via:
misclassified = np.where(y_pred != y_test)
print(misclassified)
at the end of your train_and_evaluate function. So if this prints, say [1, 3, 7], files '01.txt', '03.txt', and '07.txt' were misclassified.