How to create a container in pymc3 - pymc3

I am trying to build a model for the likelihood function of a particular outcome of a Langevin equation (Brownian particle in a harmonic potential):
Here is my model in pymc2 that seems to work:
https://github.com/hstrey/BayesianAnalysis/blob/master/Langevin%20simulation.ipynb
#define the model/function to be fitted.
def model(x):
t = pm.Uniform('t', 0.1, 20, value=2.0)
A = pm.Uniform('A', 0.1, 10, value=1.0)
#pm.deterministic(plot=False)
def S(t=t):
return 1-np.exp(-4*delta_t/t)
#pm.deterministic(plot=False)
def s(t=t):
return np.exp(-2*delta_t/t)
path = np.empty(N, dtype=object)
path[0]=pm.Normal('path_0',mu=0, tau=1/A, value=x[0], observed=True)
for i in range(1,N):
path[i] = pm.Normal('path_%i' % i,
mu=path[i-1]*s,
tau=1/A/S,
value=x[i],
observed=True)
return locals()
mcmc = pm.MCMC( model(x) )
mcmc.sample( 20000, 2000, 10 )
The basic idea is that each point depends on the previous point in the chain (Markov chain). Btw, x is an array of data, N is its length, delta_t is the time step =0.01. Any idea how to implement this in pymc3? I tried:
# define the model/function for diffusion in a harmonic potential
DHP_model = pm.Model()
with DHP_model:
t = pm.Uniform('t', 0.1, 20)
A = pm.Uniform('A', 0.1, 10)
S=1-pm.exp(-4*delta_t/t)
s=pm.exp(-2*delta_t/t)
path = np.empty(N, dtype=object)
path[0]=pm.Normal('path_0',mu=0, tau=1/A, observed=x[0])
for i in range(1,N):
path[i] = pm.Normal('path_%i' % i,
mu=path[i-1]*s,
tau=1/A/S,
observed=x[i])
Unfortunately the model crashes as soon as I try to run it. I tried some pymc3 examples (tutorial) on my machine and this is working.
Thanks in advance. I am really hoping that the new samplers in pymc3 will help me with this model. I am trying to apply Bayesian methods to single-molecule experiments.

Rather than creating many individual normally-distributed 1-D variables in a loop, you can make a custom distribution (by extending Continuous) that knows the formula for computing the log likelihood of your entire path. You can bootstrap this likelihood formula off of the Normal likelihood formula that pymc3 already knows. See the built-in AR1 class for an example.
Since your particle follows the Markov property, your likelihood looks like
import theano.tensor as T
def logp(path):
now = path[1:]
prev = path[:-1]
loglik_first = pm.Normal.dist(mu=0., tau=1./A).logp(path[0])
loglik_rest = T.sum(pm.Normal.dist(mu=prev*ss, tau=1./A/S).logp(now))
loglik_final = loglik_first + loglik_rest
return loglik_final
I'm guessing that you want to draw a value for ss at every time step, in which case you should make sure to specify ss = pm.exp(..., shape=len(x)-1), so that prev*ss in the block above gets interpreted as element-wise multiplication.
Then you can just specify your observations with
path = MyLangevin('path', ..., observed=x)
This should run much faster.

Since I did not see an answer to my question, let me answer it myself. I came up with the following solution:
# now lets model this data using pymc
# define the model/function for diffusion in a harmonic potential
DHP_model = pm.Model()
with DHP_model:
D = pm.Gamma('D',mu=mu_D,sd=sd_D)
A = pm.Gamma('A',mu=mu_A,sd=sd_A)
S=1.0-pm.exp(-2.0*delta_t*D/A)
ss=pm.exp(-delta_t*D/A)
path=pm.Normal('path_0',mu=0.0, tau=1/A, observed=x[0])
for i in range(1,N):
path = pm.Normal('path_%i' % i,
mu=path*ss,
tau=1.0/A/S,
observed=x[i])
start = pm.find_MAP()
print(start)
trace = pm.sample(100000, start=start)
unfortunately, this code takes at N=50 anywhere between 6hours to 2 days to compile. I am running on a pretty fast PC (24Gb RAM) running Ubuntu. I tried to using the GPU but that runs slightly slower. I suspect memory problems since it uses 99.8% of the memory when running. I tried the same calculation with Stan and it only takes 2min to run.

Related

Update weights by different models

I have the following problem:
I must identify if a data point is an outlier or not (we don't have labels). I have different unsupervised models to identify the outlier. Then, I normalize the outlier score and I combine them via a weight average. According to the fact that I don't have information about their accuracy I use the same weight for each models.
Now, suppose that I have a small fraction of the dataset with also the label.
How can I update the weights according to the new information?
Please If you have it, give me some resources because I didn't find it.
Thank you in advance.
I tried to see some resources about the bayesian model average, but I don't know If it is the correct way. I also have implemented an idea, but I'm not sure that is correct.
import numpy as np
def bayesian_update(anomaly, weight, prob):
#posterior = prob(anomaly | model)
posterior = np.zeros(len(anomaly))
for i in range(len(anomaly)):
if anomaly[i] == 1:
posterior[i] = prob[i] * weight
else:
posterior[i] = (1-prob[i]) * weight
return posterior
np.random.seed(0)
n_observations = 100
n_models = 4
#
models_probs = np.random.rand(n_observations, n_models)
anomaly = np.where(models_probs[:, 0] > 0.5, 1, 0)
posterior_sum = np.zeros(n_models)
for i in range(n_models):
posterior_sum[i] = np.sum(bayesian_update(anomaly, 0.25, models_probs[:, i]))
new_weight = posterior_sum/np.sum(posterior_sum)
print(new_weight)

Specifying the Mean and Variance in a Scipy Distribution Python 2.7

I need to randomly sample from some distribution eventually, so I need one that allows me to readily change the mean and variance. I'm looking at using distributions from the scipy.stats library, however, I'm having difficulty seeing how the parameters "loc" and "scale" relate to the quantites I'm interested in. I'd like to be able to do something like:
x = numpy.linspace(0,5,1000)
y = scipy.stats.maxwell(x, mean, variance)
But loc and scale seem to be the only other arguments that function takes.
Can anyone specify the relationship those quantities must have to mean and variance, or suggest a better library to use?
Well, I don't have python 2.7, so answer would be for python 3.6, but it should work, it is a scipy after all.
Basically, you have to extract scale and loc parameters from given μ and σ. Here are two simple functions to do that, plus some sampling to prove we're getting right values. Basically, first printed line is what you want, and third line is result of sampling, should be roughly be the same. Second line is scale and loccomputed from μ and σ. Play with the numbers, see how it is going
import numpy as np
from scipy.stats import maxwell
def get_scale_from_sigma(sigma):
"""Compute scale from sigma based upon http://mathworld.wolfram.com/MaxwellDistribution.html"""
a2 = np.pi*sigma / (3.0*np.pi - 8.0)
return np.sqrt(a2)
def get_loc_from_mu_sigma(mu, sigma):
"""Compute loc from mu/sigma based upon http://mathworld.wolfram.com/MaxwellDistribution.html"""
scale = get_scale_from_sigma(sigma)
loc = mu - 2.0 * scale * np.sqrt(2.0 / np.pi)
return loc
sigma = 1.0
mu = 2.0 * get_scale_from_sigma(sigma) * np.sqrt(2.0 / np.pi) # + 3.0 as shift, for exampl
print(mu, sigma)
scale = get_scale_from_sigma(sigma)
loc = get_loc_from_mu_sigma(mu, sigma)
print(scale, loc)
q = maxwell.rvs(size=10000, scale = scale, loc = loc)
print(np.mean(q), np.std(q))

How to sample independently with pymc3

I am working with a simple bivariate normal model with a somewhat unconventional prior. The main issue I have is that my posteriors are inconsistent from one run to the next, which I'm guessing is related to an issue of high dependence between consecutive samples. Here are my specific questions.
What is the best way to get N independent samples? At the moment, I've been calling sample() to get a big chain (e.g. length 10,000) and then taking every 100th sample starting at 1,000. But looking now at an autocorrelation profile of one of the parameters, it looks like I need to take at least every 500th sample! (I could also use mutual information to get a better idea of dependence between lags.)
I've been following the fitting procedure described in the stochastic volatility example in the pymc3 tutorial. In particular I first find the MAP, then use it to generate a NUTS() object, then take a short sample, then use that to generate another NUTS() object, using gamma=0.25 (???), then finally get my big sample. I have no idea whether this is appropriate or whether I need the gamma=0.25.
Also, in that same example, there are testvals for the Exponential distribution. I don't know if I need these. (What is wrong with the default use of the mean?)
Here is the actual model I'm using.
import pymc3 as pymc
import numpy as np
import theano.tensor as th
from pymc3.distributions.continuous import Gamma, Uniform, Normal, Bounded
from pymc3.distributions.multivariate import MvNormal
from pymc3.model import Deterministic
data = np.random.randn(3000, 2) / 300 # I have actual data!
with pymc.Model():
tau = Gamma('tau', alpha=2, beta=1 / 20000)
sigma = Deterministic('sigma', 1 / th.sqrt(tau))
corr = Uniform('corr', lower=0, upper=1)
alpha_sig = Deterministic('alpha_sig', sigma / 50)
alpha_post = Normal('alpha_post', mu=0, sd=alpha_sig)
alpha_pre = Bounded(
'alpha_pre', Normal, alpha_post, np.Inf, mu=0, sd=alpha_sig)
corr_inv = th.stack([th.stack([1, -corr]),
th.stack([-corr, 1])]) / (1 - th.sqr(corr))
MvNormal(
'data', mu=th.stack([alpha_post, alpha_pre]),
tau=tau * corr_inv, observed=data)
map_ = pymc.find_MAP()
step1 = pymc.NUTS(scaling=map_)
trace1 = pymc.sample(1000, step=step1)
step2 = pymc.NUTS(scaling=trace1[-1], gamma=0.25)
trace2 = pymc.sample(10000, step=step2, start=trace1[-1])
I'm not sure what you're doing with the complex prior structure you have set up but I think there is something wrong there.
I simplified the model to:
import pymc3 as pymc
import numpy as np
import theano.tensor as th
from pymc3.distributions.continuous import Gamma, Uniform, Normal, Bounded
from pymc3.distributions.multivariate import MvNormal
from pymc3.model import Deterministic
data = np.random.randn(3000, 2) # I have actual data!
with pymc.Model():
corr = Uniform('corr', lower=0, upper=1)
corr_inv = th.stack([th.stack([1, -corr]),
th.stack([-corr, 1])]) / (1 - th.sqr(corr))
mu = Normal('mu', mu=0, sd=1, shape=2)
MvNormal('data',
mu=mu,
tau=corr_inv,
observed=data)
map_ = pymc.find_MAP()
step1 = pymc.NUTS(scaling=map_)
trace1 = pymc.sample(1000, step=step1)
step2 = pymc.NUTS(scaling=trace1[-1])
trace2 = pymc.sample(10000, step=step2, start=trace1[-1])
Which has great convergence. I think you can also just drop the gamma parameter.

PCA for dimensionality reduction before Random Forest

I am working on binary class random forest with approximately 4500 variables. Many of these variables are highly correlated and some of them are just quantiles of an original variable. I am not quite sure if it would be wise to apply PCA for dimensionality reduction. Would this increase the model performance?
I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.
Many thanks in advance.
My experience is that PCA before RF is not an great advantage if any. Principal component regression(PCR) is e.g. when, PCA assists to regularize training features before OLS linear regression and that is very needed for sparse data-sets. As RF itself already performs a good/fair regularization without assuming linearity, it is not necessarily an advantage. That said, I found my self writing a PCA-RF wrapper for R two weeks ago. The code includes a simulated data set of a data set of 100 features comprising only 5 true linear components. Under such cercumstances it is infact a small advantage to pre-filter with PCA
The code is a seamless implementation, such that every RF parameters are simply passed on to RF. Loading vector are saved in model_fit to use during prediction.
#I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.
The easy way is to run without PCA and obtain variable importances and expect to find something similar for PCA-RF.
The tedious way, wrap the PCA-RF in a new bagging scheme with your own variable importance code. Could be done in 50-100 lines or so.
The souce-code suggestion for PCA-RF:
#wrap PCA around randomForest, forward any other arguments to randomForest
#define as new S3 model class
train_PCA_RF = function(x,y,ncomp=5,...) {
f.args=as.list(match.call()[-1])
pca_obj = princomp(x)
rf_obj = do.call(randomForest,c(alist(x=pca_obj$scores[,1:ncomp]),f.args[-1]))
out=mget(ls())
class(out) = "PCA_RF"
return(out)
}
#print method
print.PCA_RF = function(object) print(object$rf_obj)
#predict method
predict.PCA_RF = function(object,Xtest=NULL,...) {
print("predicting PCA_RF")
f.args=as.list(match.call()[-1])
if(is.null(f.args$Xtest)) stop("cannot predict without newdata parameter")
sXtest = predict(object$pca_obj,Xtest) #scale Xtest as Xtrain was scaled before
return(do.call(predict,c(alist(object = object$rf_obj, #class(x)="randomForest" invokes method predict.randomForest
newdata = sXtest), #newdata input, see help(predict.randomForest)
f.args[-1:-2]))) #any other parameters are passed to predict.randomForest
}
#testTrain predict #
make.component.data = function(
inter.component.variance = .9,
n.real.components = 5,
nVar.per.component = 20,
nObs=600,
noise.factor=.2,
hidden.function = function(x) apply(x,1,mean),
plot_PCA =T
){
Sigma=matrix(inter.component.variance,
ncol=nVar.per.component,
nrow=nVar.per.component)
diag(Sigma) = 1
x = do.call(cbind,replicate(n = n.real.components,
expr = {mvrnorm(n=nObs,
mu=rep(0,nVar.per.component),
Sigma=Sigma)},
simplify = FALSE)
)
if(plot_PCA) plot(prcomp(x,center=T,.scale=T))
y = hidden.function(x)
ynoised = y + rnorm(nObs,sd=sd(y)) * noise.factor
out = list(x=x,y=ynoised)
pars = ls()[!ls() %in% c("x","y","Sigma")]
attr(out,"pars") = mget(pars) #attach all pars as attributes
return(out)
}
A run code example:
#start script------------------------------
#source above from separate script
#test
library(MASS)
library(randomForest)
Data = make.component.data(nObs=600)#plots PC variance
train = list(x=Data$x[ 1:300,],y=Data$y[1:300])
test = list(x=Data$x[301:600,],y=Data$y[301:600])
rf = randomForest (train$x, train$y,ntree =50) #regular RF
rf2 = train_PCA_RF(train$x, train$y,ntree= 50,ncomp=12)
rf
rf2
pred_rf = predict(rf ,test$x)
pred_rf2 = predict(rf2,test$x)
cat("rf, R^2:",cor(test$y,pred_rf )^2,"PCA_RF, R^2", cor(test$y,pred_rf2)^2)
cor(test$y,predict(rf ,test$x))^2
cor(test$y,predict(rf2,test$x))^2
pairs(list(trueY = test$y,
native_rf = pred_rf,
PCA_RF = pred_rf2)
)
You can have a look here to get a better idea. The link says use PCA for smaller datasets!! Some of my colleagues have used Random Forests for the same purpose when working with Genomes. They had ~30000 variables and large amount of RAM.
Another thing I found is that Random Forests use up a lot of Memory and you have 4500 variables. So, may be you could apply PCA to the individual Trees.

Fitting a Gaussian, getting a straight line. Python 2.7

As my title suggests, I'm trying to fit a Gaussian to some data and I'm just getting a straight line. I've been looking at these other discussion Gaussian fit for Python and Fitting a gaussian to a curve in Python which seem to suggest basically the same thing. I can make the code in those discussions work fine for the data they provide, but it won't do it for my data.
My code looks like this:
import pylab as plb
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy import asarray as ar,exp
y = y - y[0] # to make it go to zero on both sides
x = range(len(y))
max_y = max(y)
n = len(y)
mean = sum(x*y)/n
sigma = np.sqrt(sum(y*(x-mean)**2)/n)
# Someone on a previous post seemed to think this needed to have the sqrt.
# Tried it without as well, made no difference.
def gaus(x,a,x0,sigma):
return a*exp(-(x-x0)**2/(2*sigma**2))
popt,pcov = curve_fit(gaus,x,y,p0=[max_y,mean,sigma])
# It was suggested in one of the other posts I looked at to make the
# first element of p0 be the maximum value of y.
# I also tried it as 1, but that did not work either
plt.plot(x,y,'b:',label='data')
plt.plot(x,gaus(x,*popt),'r:',label='fit')
plt.legend()
plt.title('Fig. 3 - Fit for Time Constant')
plt.xlabel('Time (s)')
plt.ylabel('Voltage (V)')
plt.show()
The data I am trying to fit is as follows:
y = array([ 6.95301373e+12, 9.62971320e+12, 1.32501876e+13,
1.81150568e+13, 2.46111132e+13, 3.32321345e+13,
4.45978682e+13, 5.94819771e+13, 7.88394616e+13,
1.03837779e+14, 1.35888594e+14, 1.76677210e+14,
2.28196006e+14, 2.92781632e+14, 3.73133045e+14,
4.72340762e+14, 5.93892782e+14, 7.41632194e+14,
9.19750269e+14, 1.13278296e+15, 1.38551838e+15,
1.68291212e+15, 2.02996957e+15, 2.43161742e+15,
2.89259207e+15, 3.41725793e+15, 4.00937676e+15,
4.67187762e+15, 5.40667931e+15, 6.21440313e+15,
7.09421973e+15, 8.04366842e+15, 9.05855930e+15,
1.01328502e+16, 1.12585509e+16, 1.24257598e+16,
1.36226443e+16, 1.48356404e+16, 1.60496345e+16,
1.72482199e+16, 1.84140400e+16, 1.95291969e+16,
2.05757166e+16, 2.15360187e+16, 2.23933053e+16,
2.31320228e+16, 2.37385276e+16, 2.42009864e+16,
2.45114362e+16, 2.46427484e+16, 2.45114362e+16,
2.42009864e+16, 2.37385276e+16, 2.31320228e+16,
2.23933053e+16, 2.15360187e+16, 2.05757166e+16,
1.95291969e+16, 1.84140400e+16, 1.72482199e+16,
1.60496345e+16, 1.48356404e+16, 1.36226443e+16,
1.24257598e+16, 1.12585509e+16, 1.01328502e+16,
9.05855930e+15, 8.04366842e+15, 7.09421973e+15,
6.21440313e+15, 5.40667931e+15, 4.67187762e+15,
4.00937676e+15, 3.41725793e+15, 2.89259207e+15,
2.43161742e+15, 2.02996957e+15, 1.68291212e+15,
1.38551838e+15, 1.13278296e+15, 9.19750269e+14,
7.41632194e+14, 5.93892782e+14, 4.72340762e+14,
3.73133045e+14, 2.92781632e+14, 2.28196006e+14,
1.76677210e+14, 1.35888594e+14, 1.03837779e+14,
7.88394616e+13, 5.94819771e+13, 4.45978682e+13,
3.32321345e+13, 2.46111132e+13, 1.81150568e+13,
1.32501876e+13, 9.62971320e+12, 6.95301373e+12,
4.98705540e+12])
I would show you what it looks like, but apparently I don't have enough reputation points...
Anyone got any idea why it's not fitting properly?
Thanks for your help :)
The importance of the initial guess, p0 in curve_fit's default argument list, cannot be stressed enough.
Notice that the docstring mentions that
[p0] If None, then the initial values will all be 1
So if you do not supply it, it will use an initial guess of 1 for all parameters you're trying to optimize for.
The choice of p0 affects the speed at which the underlying algorithm changes the guess vector p0 (ref. the documentation of least_squares).
When you look at the data that you have, you'll notice that the maximum and the mean, mu_0, of the Gaussian-like dataset y, are
2.4e16 and 49 respectively. With the peak value so large, the algorithm, would need to make drastic changes to its initial guess to reach that large value.
When you supply a good initial guess to the curve fitting algorithm, convergence is more likely to occur.
Using your data, you can supply a good initial guess for the peak_value, the mean and sigma, by writing them like this:
y = np.array([...]) # starting from the original dataset
x = np.arange(len(y))
peak_value = y.max()
mean = x[y.argmax()] # observation of the data shows that the peak is close to the center of the interval of the x-data
sigma = mean - np.where(y > peak_value * np.exp(-.5))[0][0] # when x is sigma in the gaussian model, the function evaluates to a*exp(-.5)
popt,pcov = curve_fit(gaus, x, y, p0=[peak_value, mean, sigma])
print(popt) # prints: [ 2.44402560e+16 4.90000000e+01 1.20588976e+01]
Note that in your code, for the mean you take sum(x*y)/n , which is strange, because this would modulate the gaussian by a polynome of degree 1 (it multiplies a gaussian with a monotonously increasing line of constant slope) before taking the mean. That will offset the mean value of y (in this case to the right). A similar remark can be made for your calculation of sigma.
Final remark: the histogram of y will not resemble a Gaussian, as y is already a Gaussian. The histogram will merely bin (count) values into different categories (answering the question "how many datapoints in y reach a value between [a, b]?").