I am doing some PCA using sklearn.decomposition.PCA. I found that if the input matrix X is big, the results of two different PCA instances for PCA.transform will not be the same. For example, when X is a 100x200 matrix, there will not be a problem. When X is a 1000x200 or a 100x2000 matrix, the results of two different PCA instances will be different. I am not sure what's the cause for this: I suppose there is no random elements in sklearn's PCA solver? I am using sklearn version 0.18.1. with python 2.7
The script below illustrates the issue.
import numpy as np
import sklearn.linear_model as sklin
from sklearn.decomposition import PCA
n_sample,n_feature = 100,200
X = np.random.rand(n_sample,n_feature)
pca_1 = PCA(n_components=10)
pca_1.fit(X)
X_transformed_1 = pca_1.transform(X)
pca_2 = PCA(n_components=10)
pca_2.fit(X)
X_transformed_2 = pca_2.transform(X)
print(np.sum(X_transformed_1 == X_transformed_2) )
print(np.mean((X_transformed_1 - X_transformed_2)**2) )
There's a svd_solver param in PCA and by default it has value "auto". Depending on the input data size, it chooses most efficient solver.
Now as for your case, when size is larger than 500, it will choose randomized.
svd_solver : string {‘auto’, ‘full’, ‘arpack’, ‘randomized’}
auto :
the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the
number of components to extract is lower than 80% of the smallest
dimension of the data, then the more efficient ‘randomized’ method is
enabled. Otherwise the exact full SVD is computed and optionally
truncated afterwards.
To control how the randomized solver behaves, you can set random_state param in PCA which will control the random number generator.
Try using
pca_1 = PCA(n_components=10, random_state=SOME_INT)
pca_2 = PCA(n_components=10, random_state=SOME_INT)
I had a similar problem even with the same trial number but on different machines I was getting different result setting the svd_solver to 'arpack' solved the problem
I'm no king in python, and recently got in trouble with a modification I made in my code. My algorithm is basically multiple uses of stochastic gradient algorithm and thus needs random variables.
I wanted my code to handle custom random variables and probability distribution. To do so, I modified my code and now I use scipy.stats to draw samples of custom random variables. Basically, I create a random variable with an imposed probability density or a cumulative density, and then draw samples thanks to the inverse function of the cumulative distribution function and some uniform random variable between [0,1].
To make it simple the algorithm runs multiple optimization from different starting point using stochastic gradient algorithm, and thus can be parallelized since the starting points are independent.
Problem is that the random variable created this way can't be pickled
PicklingError: Can't pickle : attribute lookup builtin.instancemethod failed
I don't get the subtility of pickling problems for now, so if you guys can help me solve this following simple illustration of the problem :
RV = scipy.stats.norm();
def Draw(rv,N):
return rv.ppf(np.random.random(N))
pDraw = partial(Draw,RV);
PM = multiprocessing.pool(Processes = 2);
L = PM.map(pDraw,range(1,5));
I've heard of pathos library that do not use the same serialization algorithm (dill), but I would like to avoid this solution (if it is a solution) as it is not included in my python distribution at work... making it install will take a lot of time.
I referred following blog post while doing following code blocks https://prateekvjoshi.com/2015/12/15/how-to-compute-confidence-measure-for-svm-classifiers/ and I obtained following results. My intention find out the distance of a point from 3 classes in SVC of SVM in Scikit-learn, but I confused with the meaning described are there any solutions.
import numpy as np
from sklearn.svm import SVC
x = np.array([[1,2],[2,3],[3,4],[1,4],[1,5],[2,4],[2,6]])
y = np.array([0,1,-1,-1,1,1,0])
classifier = SVC(kernel='linear')
classifier.fit(x,y)
classifier.decision_function([2,1])
last call give the following output of array of size 3
array([[ -8.88178420e-16, -1.40000000e+00, -1.00000000e+00]])
what does this array meant for, how can we use this array to find out which out three class (-1,1,0) the particular data point related for.
It is the distance of the point [2,1] from the separating hyper-plane of SVM Classifier. So the first value is the distance of [2,1] from hyperplane separating the first class, so on and so forth. You can see the function's implementation here and read the documentation here for more info.
EDIT : You can also check out this example as well.
I have a cyclical signal I would like to model. I would like to allow the signal to be able to stretch and compress in time, and I do not know the exact profile.
At the moment, I am modelling the phase progression as a random walk, and capturing the cyclical nature by defining the mean likelihood as a sum of sines and cosines on the phase, where the weights on the cosines are parameters to be fitted.
i.e.
y = N(f(phase),sigma) = N(sum_i(a_i*sin(phase) + b_i*cos(phase)),sigma)
(i.e. latex image of above)
This seems to work to some extent, but I would like to change the definition of f so that it does not rely on sums of sin and cos.
I was looking at Gaussian Processes, and thinking that there could be a solution to this there - but I can't figure out how (if it's possible) to define the y in terms of phase when using GP.
There is an example on the pymc github site:
y_obs = pm.gp.GP('y_obs', cov_func=f_cov, sigma=s2_n, observed={'X':X, 'Y':y})
The problem here is that X is defined as observed, while I need to model it as a random variable.
I tried this form:
y_obs = pm.gp.GP('y_obs', X = phase , cov_func=f_cov, sigma=s2_n, observed={ 'Y':y})
But that leads to an error:
File "/home/person/.conda/envs/mcmcx/lib/python3.6/site-packages/pymc3/distributions/distribution.py", line 56, in __init__
raise TypeError("Expected int elements in shape")
I am new to HB/GP/pymc3... and even stackoverflow. Apologies if the question is off.
I would like to use a number of features to train with Naive Bayes classifier to classify 'A' or 'non-A'.
I have three features of different value types:
1) total_length - in positive integer
2) vowel-ratio - in decimal/fraction
3) twoLetters_lastName - a array containing multiple two-letters strings
# coding=utf-8
from nltk.corpus import names
import nltk
import random
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from sklearn.naive_bayes import GaussianNB
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# Import data into pandas
data = pd.read_csv('XYZ.csv', header=0, encoding='utf-8',
low_memory=False)
df = DataFrame(data)
# Randomize records
df = df.reindex(np.random.permutation(df.index))
# Assign column into label Y
df_Y = df[df.AScan.notnull()][['AScan']].values # Labels are 'A' or 'non-A'
#print df_Y
# Assign column vector into attribute X
df_X = df[df.AScan.notnull()][['total_length', 'vowel_ratio', 'twoLetters_lastName']].values
#print df_X[0:10]
# Incorporate X and Y into ML algorithms
clf = GaussianNB()
clf.fit(df_X, df_Y)
df_Y is as follow:
[[u'non-A']
[u'A']
[u'non-A']
...,
[u'A']
[u'non-A']
[u'non-A']]
df_X is below:
[[9L 0.222222222 u"[u'ke', u'el', u'll', u'ly']"]
[17L 0.41176470600000004
u"[u'ma', u'ar', u'rg', u'ga', u'ar', u'ri', u'is']"]
[11L 0.454545455 u"[u'du', u'ub', u'bu', u'uc']"]
[11L 0.454545455 u"[u'ma', u'ah', u'he', u'er']"]
[15L 0.333333333 u"[u'ma', u'ag', u'ge', u'ee']"]
[13L 0.307692308 u"[u'jo', u'on', u'ne', u'es']"]
[12L 0.41666666700000005
u"[u'le', u'ef', u'f\\xe8', u'\\xe8v', u'vr', u're']"]
[15L 0.26666666699999997 u"[u'ni', u'ib', u'bl', u'le', u'et', u'tt']"]
[15L 0.333333333 u"[u'ki', u'in', u'ns', u'sa', u'al', u'll', u'la']"]
[11L 0.363636364 u"[u'mc', u'cn', u'ne', u'ei', u'il']"]]
I am getting this error:
E:\Program Files Extra\Python27\lib\site-packages\sklearn\naive_bayes.py:150: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Traceback (most recent call last):
File "C:werwer\wer\wer.py", line 32, in <module>
clf.fit(df_X, df_Y)
File "E:\Program Files Extra\Python27\lib\site-packages\sklearn\naive_bayes.py", line 163, in fit
self.theta_[i, :] = np.mean(Xi, axis=0)
File "E:\Program Files Extra\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 2727, in mean
out=out, keepdims=keepdims)
File "E:\Program Files Extra\Python27\lib\site-packages\numpy\core\_methods.py", line 69, in _mean
ret, rcount, out=ret, casting='unsafe', subok=False)
TypeError: unsupported operand type(s) for /: 'unicode' and 'long'
My understanding is I need to convert the features into one numpy array as a feature vector, but I don't think if I am preparing this X vector right since it contains very different value types.
Related questions: Choosing a Classification Algorithm to Classify Mix of Nominal and Numeric Data -- Mixing Categorial and Continuous Data in Naive Bayes Classifier Using Scikit-learn
Okay so there are a few things going on. As DalekSec pointed out, it's best practice to keep all your features as one type as you input them into a model like GaussianNB. The traceback indicates that while fitting the model, it tries to divide a string (presumably one of your unicode strings like u"[u'ke', u'el', u'll', u'ly']") by an integer. So what we need to do is convert the training data into a form that sklearn can use. We can do this a few ways, two of which ogrisel eloquently describes in this answer here.
We can convert all the continuous variables to categorical variables. In our case, this means converting total_length (in some cases you could probably treat this as a categorical variable, but let's not get ahead of ourselves) and vowel-ratio. For instance, you can basically bin the values you see in each feature to one of 5 values based on percentile: 'very small', 'small', 'medium', 'high', 'very high'. There's no real easy way in sk-learn as far as I know, but it should be pretty straightforward to do it yourself. The only thing that you would want to change is that you would want to use MultinomialNB instead of GaussianNB because you'll be dealing with features that would be better described by multinomial distributions rather than gaussian ones.
We can convert the categorical features to numeric ones for use with GaussianNB. Personally I find this to be the more intuitive approach. Basically, when dealing with text, you need to figure out what information you want to take from the text and pass to the classifier. It looks like to me that you want to extract the incidence of different two letter last names.
Normally I would ask you whether or not you have all the last names in your dataset, but since each one is only two letters each we can just store all the possible two letter names (including the unicode characters involving accent marks) with a minimal impact on performance. This is where something like sklearn's CountVectorizer might be useful. Assuming that you have every possible combination of two letter last names in your data, you can just directly use this to turn a row in your twoLetter_lastname column into a N-dimensional vector that records the number of occurrences of each unique last name in your row. Then just combine this new vector with your other two features into a numpy array.
In the case you do not have every possible combination of two letters (including accented ones), you should consider generating that list and pass it in as the 'vocabulary' for the CountVectorizer. This is so that your classifier knows how to handle all possible last names. It's not the end of the world if you don't handle all cases, but any new unseen two letter pairs will be ignored in this scheme.
Before you use these tools, you should make sure that you pass your last name column in as a list, and not as a string, as this can result in unintended behavior.
You can read more about general sklearn preprocessing here, and more about CountVectorizer and other text feature extraction tools provided by sklearn here. I use a lot of these tools daily, and recommend them for basic text extraction tasks. There are also plenty of tutorials and demos available online. You might also look for other types of methods of representation, like binarizing and one-hot encoding. There are many ways to solve this problem, it mostly depends on your specific problem/needs.
After you're able to turn all your data into one form or the other, you should be able to make use of either the Gaussian or Multinomial NB classifier. As for your error regarding the 1D vector, you printed df_Y and it looked like
[[u'non-A']
[u'A']
[u'non-A']
...,
[u'A']
[u'non-A']
[u'non-A']]
Basically, it's expecting this to be in a flat list, rather than as a column vector (a list of one-dimensional lists). Just reshape it accordingly by making use of commands like numpy.reshape() or numpy.ravel() (numpy.ravel() would probably be more appropriate, considering that you're dealing with just one column, as the error mentioned).
I'm not 100% sure, but I think scikit-learn.naive_bayes requires a purely numeric feature vector instead of a mixture of text and numbers. It looks like it crashes when trying to "divide" a unicode string by a long integer.
I can't be much help with finding numeric representations for text, but this scikit-learn tutorial might be a good start.