I am doing some PCA using sklearn.decomposition.PCA. I found that if the input matrix X is big, the results of two different PCA instances for PCA.transform will not be the same. For example, when X is a 100x200 matrix, there will not be a problem. When X is a 1000x200 or a 100x2000 matrix, the results of two different PCA instances will be different. I am not sure what's the cause for this: I suppose there is no random elements in sklearn's PCA solver? I am using sklearn version 0.18.1. with python 2.7
The script below illustrates the issue.
import numpy as np
import sklearn.linear_model as sklin
from sklearn.decomposition import PCA
n_sample,n_feature = 100,200
X = np.random.rand(n_sample,n_feature)
pca_1 = PCA(n_components=10)
pca_1.fit(X)
X_transformed_1 = pca_1.transform(X)
pca_2 = PCA(n_components=10)
pca_2.fit(X)
X_transformed_2 = pca_2.transform(X)
print(np.sum(X_transformed_1 == X_transformed_2) )
print(np.mean((X_transformed_1 - X_transformed_2)**2) )
There's a svd_solver param in PCA and by default it has value "auto". Depending on the input data size, it chooses most efficient solver.
Now as for your case, when size is larger than 500, it will choose randomized.
svd_solver : string {‘auto’, ‘full’, ‘arpack’, ‘randomized’}
auto :
the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the
number of components to extract is lower than 80% of the smallest
dimension of the data, then the more efficient ‘randomized’ method is
enabled. Otherwise the exact full SVD is computed and optionally
truncated afterwards.
To control how the randomized solver behaves, you can set random_state param in PCA which will control the random number generator.
Try using
pca_1 = PCA(n_components=10, random_state=SOME_INT)
pca_2 = PCA(n_components=10, random_state=SOME_INT)
I had a similar problem even with the same trial number but on different machines I was getting different result setting the svd_solver to 'arpack' solved the problem
Related
I do understand the principle component analysis. I know how to do it and what it actually does. I have applied PCA and my best result has shown to be two components. I do understand that each of my inputs are now contributing partially in each component. What I do not understand is how to feed the result of PCA (in my case 2 components ) to a machine learning model?
How do we input them?
For example when I want to run a NN on my features, I just can navigate to where they are stored and import them, but my PCA analysis has been run in SPSS and all it shows me is the contribution of my features on each component.
What should I import to my NN model?
PCA is a method of feature extraction, which is used to avoid the problem of co-linearity. For example, if several variables are highly correlated because "they measure the same thing", then PCA can extract a measure of "that thing" (technically: a component), which is called a score. Your data set of, say, 100 measured variables may reduce to, say, 10 significant components. Then you can use the scores your test persons have achieved in those 10 components to do for example a multi-dimensional regression, a cluster analysis or a discriminance analysis. This will result in more valid results than performing the analysis directly on the 100 variables.
So the procedure is to sort the eigenvalues (and -vectors) by size, identify the number of significant components p (e.g., by scree-plot), set up the projection matrix F (eigenvectors corresponding to the largest q eigenvalues in columns) and multiply it with the data matrix D. This will give you the score matrix C (dimension n times q, with n the number of test persons), which you can use as input for whatever method you want to use next.
I am building a penalized regression model with 2 parameters $\lambda$ and $\alpha$. I am trying to find optimal values for these parameters so I consider a grid of different values. Let's say I consider n_lambda different $\lambda$ values and n_alpha different $\alpha$ values. In order to test the performance of the model, I consider $n$ data observations, for which I have the true value of my response vartiable, and I compute the predictions for these observations for each parameters pair.
I store my predictions into a 3D-array matrix with dimension (n_lambda, n_alpha, n_observations). This means that element [0, 0, :] of this matrix contains the predictions for n observations for the first value of $\lambda$ and the first value of $\alpha$.
Now I want to compute, for each of my predictions, the means squared error. I know I can do this using nested for loops like:
from sklearn.metrics import mean_squared_error
error_matrix = np.zeros(n_lambda, n_alpha)
for i in range(n_lambda):
for j in range(n_alpha):
error_matrix[i, j] = mean_squared_error(true_value, prediction[i, j, :])
This would work, but nesting for loops is not really optimal. I guess there must be a better way to get what I want. Actually I have tried using map function but it is not working so I would appreciate any advice.
I'm no king in python, and recently got in trouble with a modification I made in my code. My algorithm is basically multiple uses of stochastic gradient algorithm and thus needs random variables.
I wanted my code to handle custom random variables and probability distribution. To do so, I modified my code and now I use scipy.stats to draw samples of custom random variables. Basically, I create a random variable with an imposed probability density or a cumulative density, and then draw samples thanks to the inverse function of the cumulative distribution function and some uniform random variable between [0,1].
To make it simple the algorithm runs multiple optimization from different starting point using stochastic gradient algorithm, and thus can be parallelized since the starting points are independent.
Problem is that the random variable created this way can't be pickled
PicklingError: Can't pickle : attribute lookup builtin.instancemethod failed
I don't get the subtility of pickling problems for now, so if you guys can help me solve this following simple illustration of the problem :
RV = scipy.stats.norm();
def Draw(rv,N):
return rv.ppf(np.random.random(N))
pDraw = partial(Draw,RV);
PM = multiprocessing.pool(Processes = 2);
L = PM.map(pDraw,range(1,5));
I've heard of pathos library that do not use the same serialization algorithm (dill), but I would like to avoid this solution (if it is a solution) as it is not included in my python distribution at work... making it install will take a lot of time.
Usecase : Selecting the "optimal threshold" for a Logistic model built with statsmodel's Logit to predict say, binary classes (or multinomial, but integer classes)
To select your threshold for a (say,logistic) model in Python, is there something that's inbuilt ? For small data sets, I remember, optimizing for the "threshold", by picking up the maximum buckets of true predicted labels (true "0" and true "1") , best seen from the graph here -
http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
I also know intuitively that if I set alpha values, it should give me a "threshold" that I can use below. How should I compute the threshold given a reduced model with variables, all of which are significant at 95% confidence ? Obviously setting the threshold >0.5 ->"1" would be too naive & since I am looking at 95% confidence this threshold should be "smaller" , meaning p >0.2 or something.
This would then mean something like range of "critical values" if the label should be "1" and "0" otherwise.
What I want is something like this -:
test_scores = smf.Logit(y_train,x_train,missing='drop').fit()
threshold =0.2
#test_scores.predict(x_train,transform=False) will give the continues probability class, so to transform it into labels, I need to compare it against a threshold, (or x_test if I am testing the model)
y_predicted_train = np.array(test_scores.predict(x_train,transform=False) > threshold, dtype=float)
table = np.histogram2d(y_train, y_predicted_train, bins=2)[0]
# will do the similar on "test" data
# crude way of selecting an optimal threshold
from scipy.stats import ks_2samp
import numpy as np
ks_2samp(y_train, y_predicted_train)
(0.39963996399639962, 0.958989)
# must get <95 % here & keep modifying the threshold as above till I fail to reject the Null at 95%
# where y_train is REAL values & y_predicted back on the TRAIN dataset . Note that to get y_predicted (as binary, I already did the thresholding as above
Question :-
1. How can I select the threshold in an objective way - ie reduce the percentage of misclassified labels (say I care more for missing "1" (true positives), but not so much if I mispredict a "0" as "1" ( false negatives) & try to reduce this error. This I get from ROC curve . The roc curve in statsmodels(roc_curve) assumes that I have done the labelling for y_predicted class, and I am just revalidating this over test ( point me if my understanding is incorrect). I also think, using the confusion matrix also will not solve picking up the threshold problem
2. Which bring me to - How should I consume the output of these inbuilt functions (oob , confusion_matrix) to suit for selecting the optimal threshold (first on train sample, & then fine tune it over Test & cross validation sample)
I also looked up the official documentation of K-S tests in scipy here-
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest
Related -:
Statistics Tests (Kolmogorov and T-test) with Python and Rpy2
I have a basic understanding using numpy to create a matrix, but the context in which I have to create one confuses me. For example, I need to create a 2X1000 matrix with normally distributed values with mean 0 and standard deviation of 1. I'm not sure what it means to make a matrix with these conditions.
Besides what was writeen above by CoDEmanX, From numpy.random.normal we can read about the generic Normal Distribution in numpy:
numpy.random.normal(loc=0.0, scale=1.0, size=None)
Where:
loc is the Mean of the distribution and scale is the standard deviation (square root of variance).
import numpy as np
A = np.random.normal(loc =0, scale =1, size=(2, 1000))
print(A)
But if these examples are confusing, then consider that
np.random.normal()
simply gives you a random number, and you can create your own custom matrix:
import numpy as np
A = [ [np.random.normal() for i in range(1000)] for j in range(2) ]
A = np.array(A)
If you refer to the numpy docs, whether there are utility functions that facilitate you reaching your goal, you'll come across a normal distribution function:
numpy.random.standard_normal(size=None)
standard_normal(size=None)
Returns samples from a Standard Normal distribution (mean=0, stdev=1).
The simple average is 0, the standard deviation 1.
arr = numpy.random.standard_normal((2, 1000))
print(arr.mean()) # -0.027...
print(arr.std()) # 1.0272...
Note that it's not exactly 0 or 1.
I still recommend to read about normal distributation and standard deviation / variance, eventhough numpy offers a simple solution.