Apply scaler before flattening or after flattening for PCA feature reduction - pca

I have an array of size (504, 20, 174). Where I have extracted 20 MFCC features from 504 files. The features matrix is therefore 20x174 for each of the 504 files.
I would like to apply PCA feature reduction to this array. In order to do PCA, I would like to do apply a StandardScaler() before going into the PCA analysis.
Do I have to flatten the array before carrying out the StandardScaler():
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
#Standardizaition data
scaler = StandardScaler()
# Flatten array of x_train original shape = 504 x 20 X 174
data_2d = np.array([features_2d.flatten() for features_2d in x_train]) # array reduced to 504x3480
# Fit scaler
X_train_scaled = scaler.fit_transform(data_2d)
# Apply PCA
pca_train = PCA().fit(X_train_scaled)
or do I scale on the original 3d array and then flatten after coming out of the standard scaler?
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
#Standardizaition data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(x_train.reshape(-1, x_train.shape[-1])).reshape(x_train.shape)
# Flatten array of x_train
X_train_flat = np.array([features_2d.flatten() for features_2d in X_train_scaled])
# Apply PCA
pca_train = PCA().fit(X_train_scaled)
What approach is the correct approach?
Any help will be greatly appreciated!

Related

Applying Multi Otsu Threshold for my image

I have this image shown below
And, here I am trying to define the threshold to distinguish bimodal class by using the Otsu technique based on intensity and then visualise those in the histogram. So far I have written following codes:
import matplotlib.pyplot as plt
import numpy as np
from skimage import data, io, img_as_ubyte
from skimage.filters import threshold_multiotsu
# Read an image
image = io.imread("Fig_1.png")
# Apply multi-Otsu threshold
thresholds = threshold_multiotsu(image,classes=5)
# Digitize (segment) original image into multiple classes.
#np.digitize assign values 0, 1, 2, 3, ... to pixels in each class.
regions = np.digitize(image, bins=thresholds)
output = img_as_ubyte(regions) #Convert 64 bit integer values to uint8
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(10, 3.5))
# Plotting the original image.
ax[0].imshow(image, cmap='gray')
ax[0].set_title('Original')
ax[0].axis('off')
# Plotting the histogram and the two thresholds obtained from
# multi-Otsu.
ax[1].hist(image.ravel(), bins=255)
ax[1].set_title('Histogram')
for thresh in thresholds:
ax[1].axvline(thresh, color='r')
# Plotting the Multi Otsu result.
ax[2].imshow(regions, cmap='gray')
ax[2].set_title('Multi-Otsu result')
ax[2].axis('off')
plt.subplots_adjust()
plt.show()
This gives me the following result. Here As you can see Multi-Otsu result is totally black and does not show the two class of object present in the figure.
I choose classes=5 but this is bimodal hence putting classes=3 also giving me the same result.
Any advice on how to correct this? Thanks in advance.

To extract Top feature names from list numpy

Here I have a compilation of feature names in feature_names, each name been taken from column if X_train, Y_train of a dataset.
neigh is an MultinomialNB classifier which has been fitted with X_train and Y_train of a dataset.
I can't figure out now How to extract top features from feature_names by using neigh MultinomialNB classifier.
So I wrote code below using numpy
max_ind_positv=np.argsort((neigh.feature_log_prob_)[1])[::-1][0:10]
top_pos=np.take(feature_names,max_ind_positv)
But it shows the following errors:
1) AttributeError: 'list' object has no attribute 'take'
2) IndexError: index 3997 is out of bounds for axis 0 with size 7
Please someone show me correction on how to get top 20 feature names.
You don't really need to use np.take(), you can just index into an array of strings. So to get the features in order of decreasing importance, you can do:
>>> import numpy as np
>>> features = np.array(['feat1', 'feat2', 'feat3'])
>>> coeffs = np.array([0.2, 0.02, 2.0])
>>> features[np.argsort(coeffs)[::-1]]
array(['feat3', 'feat1', 'feat2'], dtype='<U5')

Understanding Deep Learning model accuracy

I need help in understanding the accuracy and dataset output format for Deep Learning model.
I did some training for deep learning based on this site : https://machinelearningmastery.com/deep-learning-with-python2/
I did the example for pima-indian-diabetes dataset, and iris flower dataset. I train my computer for pima-indian-diabetes dataset using script from this : http://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
Then I train my computer for iris-flower dataset using below script.
# import package
import numpy
from pandas import read_csv
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from keras.callbacks import ModelCheckpoint
# fix random seed for reproductibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataframe = read_csv("iris_2.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:4].astype(float)
Y = dataset[:,4]
# encode class value as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
### one-hot encoder ###
dummy_y = np_utils.to_categorical(encoded_Y)
# define base model
def baseline_model():
# create model
model = Sequential()
model.add(Dense(4, input_dim=4, init='normal', activation='relu'))
model.add(Dense(3, init='normal', activation='sigmoid'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model_json = model.to_json()
with open("iris.json", "w") as json_file:
json_file.write(model_json)
model.save_weights('iris.h5')
return model
estimator = KerasClassifier(build_fn=baseline_model, nb_epoch=1000, batch_size=6, verbose=0)
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, dummy_y, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
Everything works fine until I decided to try on other dataset from this link : https://archive.ics.uci.edu/ml/datasets/Glass+Identification
At first I train this new dataset using the pime-indian-diabetes dataset script's example and change the value for X and Y variable to this
dataset = numpy.loadtxt("glass.csv", delimiter=",")
X = dataset[:,0:10]
Y = dataset[:,10]
and also the value for the neuron layer to this
model = Sequential()
model.add(Dense(10, input_dim=10, init='uniform', activation='relu'))
model.add(Dense(10, init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))
the result produce accuracy = 32.71%
Then I changed the output column of this dataset which is originally in integer (1~7) to string (a~g) and use the example's script for the iris-flower dataset by doing some modification to it
import numpy
from pandas import read_csv
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
seed = 7
numpy.random.seed(seed)
dataframe = read_csv("glass.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:10].astype(float)
Y = dataset[:,10]
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
def create_baseline():
model = Sequential()
model.add(Dense(10, input_dim=10, init='normal', activation='relu'))
model.add(Dense(1, init='normal', activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_json = model.to_json()
with open("glass.json", "w") as json_file:
json_file.write(model_json)
model.save_weights('glass.h5')
return model
estimator = KerasClassifier(build_fn=create_baseline, nb_epoch=1000, batch_size=10, verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
I did not use 'dummy_y' variable as refer to this tutorial : http://machinelearningmastery.com/binary-classification-tutorial-with-the-keras-deep-learning-library/
I check that the dataset using alphabet as the output and thinking that maybe I can reuse that script to train the new glass dataset that I modified.
This time the results become like this
Baseline : 68.42% (3.03%)
From the article, that 68% and 3% means the mean and standard deviation of model accuracy.
My 1st question is when do I use integer or alphabet as the output column? and is this kind of accuracy result common when we tempered with the dataset like changing the output from integer to string/alphabet?
My 2nd question is how do I know how many neuron I have to put for each layer? Is it related to what backend I use when compiling the model(Tensorflow or Theano)?
Thank you in advance.
First question
It doesn't matter, as you can see here:
Y = range(10)
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
print encoded_Y
Y = ['a', 'b', 'c', 'd', 'e', 'f','g','h','i','j']
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
print encoded_Y
results:
[0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4 5 6 7 8 9]
Which means that your classifier sees exactly the same labels.
Second question
There is no absolutely correct answer for this question, but for sure it does not depend on your backend.
You should try and experiment with different number of neurons, number of layers, types of layers and all other network parameters in order to understand what is the best architecture to your problem.
With experience you will develop both a good intuition as for what parameters will be better for which type of problems as well as a good method for the experimentation.
The best rule of thumb (assuming you have the dataset required to sustain such a strategy) I've heard is "Make your network as large as you can until it overfit, add regularization until it does not overfit - repeat".
Per parts. First, if your output includes values ​​of [0, 5] it is
impossible that using the sigmoid activation you can obtain that.
The sigmoid function has a range of [0, 1]. You could use an
activation = linear (without activation). But I think it's a bad approach because your problem is not to estimate a continuous value.
Second, the question you should ask yourself is not so much the type
of data you are using (in the sense of how you store the
information). Is it a string? Is it an int? Is it a float? It does
not matter, but you have to ask what kind of problem you are trying
to solve.
In this case, the problem should not be treated as a regression
(estimate a continuous value). Because your output are categorical,
numbers but categorical. Really you want to classifying between:
Type of glass: (class attribute).
When do a classification problem the following configuration is
"normally" used:
The class is encoded by one-hot encoding. It is nothing more than a vector of 0's and a single one in the corresponding class.
For instance: class 3 (0 count) and have 6 classes -> [0, 0, 0, 1, 0, 0] (as many zeros as classes you have).
As you see now, we dont have a single output, your model must be as outputs as your Y (6 classes). That way the last layer should
have as many neurons as classes. Dense (classes, ...).
You are also interested in the fact that the output is the probability of belonging to each class, that is: p (y = class_0),
... p (y_class_n). For this, the softmax activation layer is used,
which is to ensure that the sum of all the probabilities is 1.
You have to change the loss for the categorical_crossentropy so that it is able to work together with the softmax. And use the metric categorical_accuracy.
seed = 7
numpy.random.seed(seed)
dataframe = read_csv("glass.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:10].astype(float)
Y = dataset[:,10]
encoder = LabelEncoder()
encoder.fit(Y)
from keras.utils import to_categorical
encoded_Y = to_categorical(encoder.transform(Y))
def create_baseline():
model = Sequential()
model.add(Dense(10, input_dim=10, init='normal', activation='relu'))
model.add(Dense(encoded_Y.shape[1], init='normal', activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
model_json = model.to_json()
with open("glass.json", "w") as json_file:
json_file.write(model_json)
model.save_weights('glass.h5')
return model
model = create_baseline()
model.fit(X, encoded_Y, epochs=1000, batch_size=100)
The number of neurons does not depend on the backend you use.
But if it is true that you will never have the same results. That's
because there are enough stochastic processes within a network:
initialization, dropout (if you use), batch order, etc.
What is known is that expanding the number of neurons per dense
makes the model more complex and therefore has more potential to
represent your problem but is more difficult to learn and more
expensive both in time and in calculations. That way you always have
to look for a balance.
At the moment there is no clear evidence that it is better:
expand the number of neurons per layer.
add more layers.
There are models that use one architecture and others the other.
Using this architecture you get the following result:
Epoch 1000/1000
214/214 [==============================] - 0s 17us/step - loss: 0.0777 - categorical_accuracy: 0.9953
Using this architecture you get the following result:

Getting indices while using train test split in scikit

In order to split my data into train and test data separately, I'm using
sklearn.cross_validation.train_test_split function.
When I supply my data and labels as list of lists to this function, it returns train and test data in two separate lists.
I want to get the indices of the train and test data elements from the original data list.
Can anyone help me out with this?
Thanks in advance
You can supply the index vector as an additional argument. Using the example from sklearn:
import numpy as np
from sklearn.cross_validation import train_test_split
X, y,indices = (0.1*np.arange(10)).reshape((5, 2)),range(10,15),range(5)
X_train, X_test, y_train, y_test,indices_train,indices_test = train_test_split(X, y,indices, test_size=0.33, random_state=42)
indices_train,indices_test
#([2, 0, 3], [1, 4])
Try the below solutions (depending on whether you have imbalance):
NUM_ROWS = train.shape[0]
TEST_SIZE = 0.3
indices = np.arange(NUM_ROWS)
# usual train-val split
train_idx, val_idx = train_test_split(indices, test_size=TEST_SIZE, train_size=None)
# stratified train-val split as per Response's proportion (if imbalance)
strat_train_idx, strat_val_idx = train_test_split(indices, test_size=TEST_SIZE, stratify=y)

scikit-learn PCA doesn't have 'score' method

I am trying to identify the type of noise based on that article:
Model selection with Probabilistic (PCA) and Factor Analysis (FA)
I am using scikit-learn-0.14.1.win32-py2.7 on win8 64bit
I know that it refers on version 0.15, however at the version 0.14 documentation it mentions that the score method is available for PCA so I guess it should normally work:
sklearn.decomposition.ProbabilisticPCA
The problem is that no matter which PCA I will use for the *cross_val_score*, I always get a type error message saying that the estimator PCA does not have a score method:
*TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator PCA(copy=True, n_components=None, whiten=False) does not.*
Any ideas why is that happening?
Many thanks in advance
Christos
X has 1000 samples of 40 features
here is a portion of the code:
import numpy as np
import csv
from scipy import linalg
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.covariance import ShrunkCovariance, LedoitWolf
#read in the training data
train_path = '<train data path>/train.csv'
reader = csv.reader(open(train_path,"rb"),delimiter=',')
train = list(reader)
X = np.array(train).astype('float')
n_samples = 1000
n_features = 40
n_components = np.arange(0, n_features, 4)
def compute_scores(X):
pca = PCA()
pca_scores = []
for n in n_components:
pca.n_components = n
pca_scores.append(np.mean(cross_val_score(pca, X, n_jobs=1)))
return pca_scores
pca_scores = compute_scores(X)
n_components_pca = n_components[np.argmax(pca_scores)]
Ok, I think I found the problem. it is not working with PCA, but it does work with PPCA
However, by not providing a cv number the cross_val_score automatically sets 3-fold cross validation
that created 3 sets with sizes 334, 333 and 333 (my initial training set contains 1000 samples)
Since nympy.mean cannot make a comparison between sets with different sizes (334 vs 333), python rises an exception.
thx