PCA for dimensionality reduction before Random Forest - pca

I am working on binary class random forest with approximately 4500 variables. Many of these variables are highly correlated and some of them are just quantiles of an original variable. I am not quite sure if it would be wise to apply PCA for dimensionality reduction. Would this increase the model performance?
I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.
Many thanks in advance.

My experience is that PCA before RF is not an great advantage if any. Principal component regression(PCR) is e.g. when, PCA assists to regularize training features before OLS linear regression and that is very needed for sparse data-sets. As RF itself already performs a good/fair regularization without assuming linearity, it is not necessarily an advantage. That said, I found my self writing a PCA-RF wrapper for R two weeks ago. The code includes a simulated data set of a data set of 100 features comprising only 5 true linear components. Under such cercumstances it is infact a small advantage to pre-filter with PCA
The code is a seamless implementation, such that every RF parameters are simply passed on to RF. Loading vector are saved in model_fit to use during prediction.
#I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.
The easy way is to run without PCA and obtain variable importances and expect to find something similar for PCA-RF.
The tedious way, wrap the PCA-RF in a new bagging scheme with your own variable importance code. Could be done in 50-100 lines or so.
The souce-code suggestion for PCA-RF:
#wrap PCA around randomForest, forward any other arguments to randomForest
#define as new S3 model class
train_PCA_RF = function(x,y,ncomp=5,...) {
f.args=as.list(match.call()[-1])
pca_obj = princomp(x)
rf_obj = do.call(randomForest,c(alist(x=pca_obj$scores[,1:ncomp]),f.args[-1]))
out=mget(ls())
class(out) = "PCA_RF"
return(out)
}
#print method
print.PCA_RF = function(object) print(object$rf_obj)
#predict method
predict.PCA_RF = function(object,Xtest=NULL,...) {
print("predicting PCA_RF")
f.args=as.list(match.call()[-1])
if(is.null(f.args$Xtest)) stop("cannot predict without newdata parameter")
sXtest = predict(object$pca_obj,Xtest) #scale Xtest as Xtrain was scaled before
return(do.call(predict,c(alist(object = object$rf_obj, #class(x)="randomForest" invokes method predict.randomForest
newdata = sXtest), #newdata input, see help(predict.randomForest)
f.args[-1:-2]))) #any other parameters are passed to predict.randomForest
}
#testTrain predict #
make.component.data = function(
inter.component.variance = .9,
n.real.components = 5,
nVar.per.component = 20,
nObs=600,
noise.factor=.2,
hidden.function = function(x) apply(x,1,mean),
plot_PCA =T
){
Sigma=matrix(inter.component.variance,
ncol=nVar.per.component,
nrow=nVar.per.component)
diag(Sigma) = 1
x = do.call(cbind,replicate(n = n.real.components,
expr = {mvrnorm(n=nObs,
mu=rep(0,nVar.per.component),
Sigma=Sigma)},
simplify = FALSE)
)
if(plot_PCA) plot(prcomp(x,center=T,.scale=T))
y = hidden.function(x)
ynoised = y + rnorm(nObs,sd=sd(y)) * noise.factor
out = list(x=x,y=ynoised)
pars = ls()[!ls() %in% c("x","y","Sigma")]
attr(out,"pars") = mget(pars) #attach all pars as attributes
return(out)
}
A run code example:
#start script------------------------------
#source above from separate script
#test
library(MASS)
library(randomForest)
Data = make.component.data(nObs=600)#plots PC variance
train = list(x=Data$x[ 1:300,],y=Data$y[1:300])
test = list(x=Data$x[301:600,],y=Data$y[301:600])
rf = randomForest (train$x, train$y,ntree =50) #regular RF
rf2 = train_PCA_RF(train$x, train$y,ntree= 50,ncomp=12)
rf
rf2
pred_rf = predict(rf ,test$x)
pred_rf2 = predict(rf2,test$x)
cat("rf, R^2:",cor(test$y,pred_rf )^2,"PCA_RF, R^2", cor(test$y,pred_rf2)^2)
cor(test$y,predict(rf ,test$x))^2
cor(test$y,predict(rf2,test$x))^2
pairs(list(trueY = test$y,
native_rf = pred_rf,
PCA_RF = pred_rf2)
)

You can have a look here to get a better idea. The link says use PCA for smaller datasets!! Some of my colleagues have used Random Forests for the same purpose when working with Genomes. They had ~30000 variables and large amount of RAM.
Another thing I found is that Random Forests use up a lot of Memory and you have 4500 variables. So, may be you could apply PCA to the individual Trees.

Related

Get intermediate output of layers on MCU using tflite micro?

Sorry for the unclarity of my first question, I have edited it to be more specific.
Because the output from the middle layers in some neural network is very interesting, I would like to get the output of certain layer during the inference on a micro-controller(MCU) running tf-lite micro c++ library.
The normal way to do this in tensorflow:
# The model we train
model = tf.keras.models.Sequential([
...
])
model.compile(...)
model.fit(...)
# Creat a aux-model which includes the layers until the one we want
layer_output_model = Model(model.inputs, model.layers[theIndexYouWant].outputs)
When we put the model into MCU, we will first quantize/prune the model, convert it into a C array and flash the model to MCU, like this:
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
tflite_model = converter.convert()
open(tflite_mnist_model, "wb").write(tflite_model)
And the inference will be called in c++ like this:
# Initialization
const tflite::Model* model = ::tflite::GetModel(model);
TfLiteTensor* input = interpreter.input(0);
TfLiteTensor* output = interpreter.output(0);
# Give input, run inference and get output
input->data.f[0] = 0.;
TfLiteStatus invoke_status = interpreter.Invoke();
float value = output->data.f[0];
If I want to extract the output of certain middle layer during inference in the MCU, how could I do it?
The only method I can come up with now is convert the above aux-model layer_output_model into c array and upload this as an additional model to MCU.
converter = tf.lite.TFLiteConverter.from_keras_model(layer_output_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
aux_tflite_model = converter.convert()
Is this the right way to do? I'm not sure the aux_model I converted here is the same representation of the wanted model layer output, especially after quantization using representative_dataset
Thanks.

Parseval's Theorem does not hold for FFT of a sinusoid + noise?

Thanks in advance for any help on this subject. I've recently been trying to work out Parseval's theorem for discrete fourier transforms when noise is included. I based my code from this code.
What I expected to see is that (as when no noise is included) the total power in the frequency domain is half that of the total power in the time-domain, as I have cut off the negative frequencies.
However, as more noise is added to the time-domain signal, the total power of the fourier transform of the signal+noise becomes much less than half of the total power of the signal+noise.
My code is as follows:
import numpy as np
import numpy.fft as nf
import matplotlib.pyplot as plt
def findingdifference(randomvalues):
n = int(1e7) #number of points
tmax = 40e-3 #measurement time
f1 = 30e6 #beat frequency
t = np.linspace(-tmax,tmax,num=n) #define time axis
dt = t[1]-t[0] #time spacing
gt = np.sin(2*np.pi*f1*t)+randomvalues #make a sin + noise
fftfreq = nf.fftfreq(n,dt) #defining frequency (x) axis
hkk = nf.fft(gt) # fourier transform of sinusoid + noise
hkn = nf.fft(randomvalues) #fourier transform of just noise
fftfreq = fftfreq[fftfreq>0] #only taking positive frequencies
hkk = hkk[fftfreq>0]
hkn = hkn[fftfreq>0]
timedomain_p = sum(abs(gt)**2.0)*dt #parseval's theorem for time
freqdomain_p = sum(abs(hkk)**2.0)*dt/n # parseval's therom for frequency
difference = (timedomain_p-freqdomain_p)/timedomain_p*100 #percentage diff
tdomain_pn = sum(abs(randomvalues)**2.0)*dt #parseval's for time
fdomain_pn = sum(abs(hkn)**2.0)*dt/n # parseval's for frequency
difference_n = (tdomain_pn-fdomain_pn)/tdomain_pn*100 #percent diff
return difference,difference_n
def definingvalues(max_amp,length):
noise_amplitude = np.linspace(0,max_amp,length) #defining noise amplitude
difference = np.zeros((2,len(noise_amplitude)))
randomvals = np.random.random(int(1e7)) #defining noise
for i in range(len(noise_amplitude)):
difference[:,i] = (findingdifference(noise_amplitude[i]*randomvals))
return noise_amplitude,difference
def figure(max_amp,length):
noise_amplitude,difference = definingvalues(max_amp,length)
plt.figure()
plt.plot(noise_amplitude,difference[0,:],color='red')
plt.plot(noise_amplitude,difference[1,:],color='blue')
plt.xlabel('Noise_Variable')
plt.ylabel(r'Difference in $\%$')
plt.show()
return
figure(max_amp=3,length=21)
My final graph looks like this figure. Am I doing something wrong when working this out? Is there an physical reason that this trend occurs with added noise? Is it to do with doing a fourier transform on a not perfectly sinusoidal signal? The reason I am doing this is to understand a very noisy sinusoidal signal that I have real data for.
Parseval's theorem holds in general if you use the whole spectrum (positive and negative) frequencies to compute the power.
The reason for the discrepancy is the DC (f=0) component, which is treated somewhat special.
First, where does the DC component come from? You use np.random.random to generate random values between 0 and 1. So on average you raise the signal by 0.5*noise_amplitude, which entails a lot of power. This power is correctly computed in the time domain.
However, in the frequency domain, there is only a single FFT bin that corresponds to f=0. The power of all other frequencies is distributed over two bins, only the DC power is contained in a single bin.
By scaling the noise you add DC power. By removing the negative frequencies you remove half the signal power, but most of the noise power is located in the DC component which is used fully.
You have several options:
Use all frequencies to compute the power.
Use noise without a DC component: randomvals = np.random.random(int(1e7)) - 0.5
"Fix" the power calculation by removing half of the DC power: hkk[fftfreq==0] /= np.sqrt(2)
I'd go with option 1. The second might be OK and I don't really recommend 3.
Finally, there is a minor problem with the code:
fftfreq = fftfreq[fftfreq>0] #only taking positive frequencies
hkk = hkk[fftfreq>0]
hkn = hkn[fftfreq>0]
This does not really make sense. Better change it to
hkk = hkk[fftfreq>=0]
hkn = hkn[fftfreq>=0]
or completely remove it for option 1.

How to generate Scikit-Learn Gaussian Process regression with 2D input, 1D output

I have been looking for the answer to my question for quite a while. No Luck so far :(. I will my question as simple as possible. For the simplicity I only have a 2D input(it will eventually grow). Lets say I am using two variables (feature : Vehicle Odometer measurement, New Car Price) to predict a value of the car (target : Old car price) How can I train sklearn.gaussian_process.GaussianProcessRegressor to predict what I am looking for.
from sklearn import gaussian_process
X_train = np.array(X).reshape((-1, 2)).astype(int)
y_train = np.array(y).reshape(-1,1).astype(int)
GPR = gaussian_process.GaussianProcessRegressor(normalize_y = False,n_restarts_optimizer = 3)
GPR.fit(X_train,y_train)
#creating random points for testing the data
X_test_Odometer = np.linspace(0, 268000, 1000)[:, None]
X_test_Price = random.sample(range(5000, 13000), 1000)
X_test = np.column_stack((X_test_Odometer,X_test_Price)).astype(int)
GPR.predict(X_test)
This prediction doesnot work at all. I do not know whether I need to customize a kernel. If yes, I do not know how to. I am new to scikit and any help would be appreciated :)

How to create a container in pymc3

I am trying to build a model for the likelihood function of a particular outcome of a Langevin equation (Brownian particle in a harmonic potential):
Here is my model in pymc2 that seems to work:
https://github.com/hstrey/BayesianAnalysis/blob/master/Langevin%20simulation.ipynb
#define the model/function to be fitted.
def model(x):
t = pm.Uniform('t', 0.1, 20, value=2.0)
A = pm.Uniform('A', 0.1, 10, value=1.0)
#pm.deterministic(plot=False)
def S(t=t):
return 1-np.exp(-4*delta_t/t)
#pm.deterministic(plot=False)
def s(t=t):
return np.exp(-2*delta_t/t)
path = np.empty(N, dtype=object)
path[0]=pm.Normal('path_0',mu=0, tau=1/A, value=x[0], observed=True)
for i in range(1,N):
path[i] = pm.Normal('path_%i' % i,
mu=path[i-1]*s,
tau=1/A/S,
value=x[i],
observed=True)
return locals()
mcmc = pm.MCMC( model(x) )
mcmc.sample( 20000, 2000, 10 )
The basic idea is that each point depends on the previous point in the chain (Markov chain). Btw, x is an array of data, N is its length, delta_t is the time step =0.01. Any idea how to implement this in pymc3? I tried:
# define the model/function for diffusion in a harmonic potential
DHP_model = pm.Model()
with DHP_model:
t = pm.Uniform('t', 0.1, 20)
A = pm.Uniform('A', 0.1, 10)
S=1-pm.exp(-4*delta_t/t)
s=pm.exp(-2*delta_t/t)
path = np.empty(N, dtype=object)
path[0]=pm.Normal('path_0',mu=0, tau=1/A, observed=x[0])
for i in range(1,N):
path[i] = pm.Normal('path_%i' % i,
mu=path[i-1]*s,
tau=1/A/S,
observed=x[i])
Unfortunately the model crashes as soon as I try to run it. I tried some pymc3 examples (tutorial) on my machine and this is working.
Thanks in advance. I am really hoping that the new samplers in pymc3 will help me with this model. I am trying to apply Bayesian methods to single-molecule experiments.
Rather than creating many individual normally-distributed 1-D variables in a loop, you can make a custom distribution (by extending Continuous) that knows the formula for computing the log likelihood of your entire path. You can bootstrap this likelihood formula off of the Normal likelihood formula that pymc3 already knows. See the built-in AR1 class for an example.
Since your particle follows the Markov property, your likelihood looks like
import theano.tensor as T
def logp(path):
now = path[1:]
prev = path[:-1]
loglik_first = pm.Normal.dist(mu=0., tau=1./A).logp(path[0])
loglik_rest = T.sum(pm.Normal.dist(mu=prev*ss, tau=1./A/S).logp(now))
loglik_final = loglik_first + loglik_rest
return loglik_final
I'm guessing that you want to draw a value for ss at every time step, in which case you should make sure to specify ss = pm.exp(..., shape=len(x)-1), so that prev*ss in the block above gets interpreted as element-wise multiplication.
Then you can just specify your observations with
path = MyLangevin('path', ..., observed=x)
This should run much faster.
Since I did not see an answer to my question, let me answer it myself. I came up with the following solution:
# now lets model this data using pymc
# define the model/function for diffusion in a harmonic potential
DHP_model = pm.Model()
with DHP_model:
D = pm.Gamma('D',mu=mu_D,sd=sd_D)
A = pm.Gamma('A',mu=mu_A,sd=sd_A)
S=1.0-pm.exp(-2.0*delta_t*D/A)
ss=pm.exp(-delta_t*D/A)
path=pm.Normal('path_0',mu=0.0, tau=1/A, observed=x[0])
for i in range(1,N):
path = pm.Normal('path_%i' % i,
mu=path*ss,
tau=1.0/A/S,
observed=x[i])
start = pm.find_MAP()
print(start)
trace = pm.sample(100000, start=start)
unfortunately, this code takes at N=50 anywhere between 6hours to 2 days to compile. I am running on a pretty fast PC (24Gb RAM) running Ubuntu. I tried to using the GPU but that runs slightly slower. I suspect memory problems since it uses 99.8% of the memory when running. I tried the same calculation with Stan and it only takes 2min to run.

Scikit-learn 0.15.2 - OneVsRestClassifier not works due to predict_proba not available

I am trying to do onevsrest classification like below:
classifier = Pipeline([('vectorizer', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', OneVsRestClassifier(SVC(kernel='rbf')))])
classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
And I get the error 'predict_proba is not available when probability = false'. I saw that there was a bug reported, the one below:
https://github.com/scikit-learn/scikit-learn/issues/1946
And it was closed as fixed, so I killed scikit-learn from my Windows PC and completely re-downloaded scikit-learn to have version 0.15.2. But I still get this error. Any suggestions? Or I understood this wrong, and I still can't use SVC with OneVSRestClassifier unless I specify probability=true?
UPDATE: just to clarify, I am trying to actually achieve multi-label classification, here is data source:
df = pd.read_csv(fileIn, header = 0, encoding='utf-8-sig')
rows = random.sample(df.index, int(len(df) * 0.9))
work = df.ix[rows]
work_test = df.drop(rows)
X_train = []
y_train = []
X_test = []
y_test = []
for i in work[[i for i in list(work.columns.values) if i.startswith('Change')]].values:
X_train.append(','.join(i.T.tolist()))
X_train = np.array(X_train)
for i in work[[i for i in list(work.columns.values) if i.startswith('Corax')]].values:
y_train.append(list(i))
for i in work_test[[i for i in list(work_test.columns.values) if i.startswith('Change')]].values:
X_test.append(','.join(i.T.tolist()))
X_test = np.array(X_test)
for i in work_test[[i for i in list(work_test.columns.values) if i.startswith('Corax')]].values:
y_test.append(list(i))
lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y_train)
And after that I send it to pipeline mentioned earlier
Ok, I did some investigation in code. OneVsRestClassifier tries to call decision_function first and if it fails - it goes for predict_proba function of base classifier (svm.svc in our case).
As far as I see, my X_test is numpy.array of lists of strings. After it undergoes a sequence of transformations specified in pipeline CountVectorizer -> TfidfTransformer it becomes a sparse matrix (by design of these things). As I see currently decision_function is not available for sparse matrices, and there is even an open suggestion on github: https://github.com/scikit-learn/scikit-learn/issues/73
So, to summarize, looks like you can't make a multilabel classification using svm.svc unless you specify probability=True. If you do this you introduce some overhead to the classifier.fit process but it will work.