In pymc3, drawing the binomial n parameter from the categorical distribution throws error - pymc3

To fix things, consider the following simple model:
import numpy as np
import pymc3 as pm
#(a few other stuff)
obsv = np.array([12., 45., 64., 34., 57.])
M = 400
with pm.Model() as model:
p_cat = (1/M)*np.ones(M)
theta = pm.Beta('rate', 1,1)
num = pm.Categorical('n', p_cat)
#likelihood
resp = pm.Binomial('resp', n=num, p=theta, observed=obsv)
Now, if I try to find the MAP, it throws the following error:
ValueError: Optimization error: max, logp or dlogp at max have non-finite values. Some values may be outside of distribution support. max: {'count': array(0), 'rate_logodds_': array(0.0)}.. etc. etc.
At first glance it seems there is something wrong with sampling from the count variable.
And sure enough, when I run:
count.tag.test_value # -> 0
(My libraries are up-to-date. I'm using python3)

Related

Gekko - if3() statement for choosing between two expressions in ODE

I tried solving a differential equation having a logical condition in Gekko. I know that Gekko does not like these things but I supposed that simple if3() function in order to switch between the two given expressions (R1,R2) will get some solution. Here is a simple example code that failed - Solution not found.
import numpy as np
from gekko import GEKKO
import matplotlib.pyplot as plt
A=1
B=1e-5
C=2
D=0.01
G=1
y0=120
m = GEKKO() # create GEKKO model
nt = 101
m.time = np.linspace(0,100,nt) # time points
y = m.Var(y0) # create GEKKO variable
R1 = m.Intermediate(-(y+A-A/(y/C+1)**(B/D))/G*(D*y+D*C))
R2 = m.Intermediate(-0.1*y)
z = m.Var() # This way - Solution not found
z = m.if3(y-60,R1,R2) #
m.Equation(y.dt()== z) #
#m.Equation(y.dt()== R1)
#m.Equation(y.dt()== R2)
# solve ODE
m.options.IMODE = 4
m.options.NODES = 5
m.solve(disp=False)
# plot results
plt.plot(m.time,y)
plt.xlabel('time')
plt.ylabel('y(t)')
Unfortunately, it seems I did not figure out yet how to overcome these problems by reading the article about Logical conditions in Optimization.
Best Regards,
Radovan
The posted script gives the error:
Exception: #error: Solution Not Found
To see additional details about the error, turn on the display with disp=True. This gives additional detail that indicates that the problem is infeasible.
Number of state variables: 2800
Number of total equations: - 2000
Number of slack variables: - 800
---------------------------------------
Degrees of freedom : 0
----------------------------------------------
Dynamic Simulation with APOPT Solver
----------------------------------------------
Iter: 1 I: -1 Tm: 13.07 NLPi: 9 Dpth: 0 Lvs: 0 Obj: 0.00E+00 Gap: NaN
Warning: no more possible trial points and no integer solution
Maximum iterations
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 13.116200000000001 sec
Objective : 0.
Unsuccessful with error code 0
---------------------------------------------------
Creating file: infeasibilities.txt
Use command apm_get(server,app,'infeasibilities.txt') to retrieve file
#error: Solution Not Found
A couple issues with the model:
z = m.Var() is not needed because the z = if3() function defines the variable.
IMODE=6 is needed because there are slack variables in the problem and an optimization mode is needed to solve the problem.
Try if2() instead of if3(). It seems to work better with the MPCC form than the Mixed Integer form.
Solver APOPT performs better for this problem. Try m.options.SOLVER=1.
import numpy as np
from gekko import GEKKO
import matplotlib.pyplot as plt
A=1; B=1e-5; C=2; D=0.01; G=1
y0=120
m = GEKKO(remote=False) # create GEKKO model
nt = 101
m.time = np.linspace(0,50,nt) # time points
y = m.Var(y0) # create GEKKO variable
R1 = m.Intermediate(-(y+A-A/(y/C+1)**(B/D))/G*(D*y+D*C))
R2 = m.Intermediate(-0.1*y)
z = m.if2(y-60,0,1)
m.Equation(y.dt()== (1-z)*R1+z*R2)
# solve ODE
m.options.IMODE = 6
m.options.SOLVER = 1
m.options.NODES = 5
m.solve(disp=True)
# plot results
plt.subplot(2,1,1)
plt.plot(m.time,y)
plt.plot([0,50],[60,60],'r--')
plt.ylabel('y(t)'); plt.grid()
plt.subplot(2,1,2)
plt.plot(m.time,z)
plt.ylabel('z(t)'); plt.grid()
plt.xlabel('time')
plt.show()
The alternative to using an if statement in the model is to integrate in a loop and check the condition y>60 to switch to the other model.

How to create a container in pymc3

I am trying to build a model for the likelihood function of a particular outcome of a Langevin equation (Brownian particle in a harmonic potential):
Here is my model in pymc2 that seems to work:
https://github.com/hstrey/BayesianAnalysis/blob/master/Langevin%20simulation.ipynb
#define the model/function to be fitted.
def model(x):
t = pm.Uniform('t', 0.1, 20, value=2.0)
A = pm.Uniform('A', 0.1, 10, value=1.0)
#pm.deterministic(plot=False)
def S(t=t):
return 1-np.exp(-4*delta_t/t)
#pm.deterministic(plot=False)
def s(t=t):
return np.exp(-2*delta_t/t)
path = np.empty(N, dtype=object)
path[0]=pm.Normal('path_0',mu=0, tau=1/A, value=x[0], observed=True)
for i in range(1,N):
path[i] = pm.Normal('path_%i' % i,
mu=path[i-1]*s,
tau=1/A/S,
value=x[i],
observed=True)
return locals()
mcmc = pm.MCMC( model(x) )
mcmc.sample( 20000, 2000, 10 )
The basic idea is that each point depends on the previous point in the chain (Markov chain). Btw, x is an array of data, N is its length, delta_t is the time step =0.01. Any idea how to implement this in pymc3? I tried:
# define the model/function for diffusion in a harmonic potential
DHP_model = pm.Model()
with DHP_model:
t = pm.Uniform('t', 0.1, 20)
A = pm.Uniform('A', 0.1, 10)
S=1-pm.exp(-4*delta_t/t)
s=pm.exp(-2*delta_t/t)
path = np.empty(N, dtype=object)
path[0]=pm.Normal('path_0',mu=0, tau=1/A, observed=x[0])
for i in range(1,N):
path[i] = pm.Normal('path_%i' % i,
mu=path[i-1]*s,
tau=1/A/S,
observed=x[i])
Unfortunately the model crashes as soon as I try to run it. I tried some pymc3 examples (tutorial) on my machine and this is working.
Thanks in advance. I am really hoping that the new samplers in pymc3 will help me with this model. I am trying to apply Bayesian methods to single-molecule experiments.
Rather than creating many individual normally-distributed 1-D variables in a loop, you can make a custom distribution (by extending Continuous) that knows the formula for computing the log likelihood of your entire path. You can bootstrap this likelihood formula off of the Normal likelihood formula that pymc3 already knows. See the built-in AR1 class for an example.
Since your particle follows the Markov property, your likelihood looks like
import theano.tensor as T
def logp(path):
now = path[1:]
prev = path[:-1]
loglik_first = pm.Normal.dist(mu=0., tau=1./A).logp(path[0])
loglik_rest = T.sum(pm.Normal.dist(mu=prev*ss, tau=1./A/S).logp(now))
loglik_final = loglik_first + loglik_rest
return loglik_final
I'm guessing that you want to draw a value for ss at every time step, in which case you should make sure to specify ss = pm.exp(..., shape=len(x)-1), so that prev*ss in the block above gets interpreted as element-wise multiplication.
Then you can just specify your observations with
path = MyLangevin('path', ..., observed=x)
This should run much faster.
Since I did not see an answer to my question, let me answer it myself. I came up with the following solution:
# now lets model this data using pymc
# define the model/function for diffusion in a harmonic potential
DHP_model = pm.Model()
with DHP_model:
D = pm.Gamma('D',mu=mu_D,sd=sd_D)
A = pm.Gamma('A',mu=mu_A,sd=sd_A)
S=1.0-pm.exp(-2.0*delta_t*D/A)
ss=pm.exp(-delta_t*D/A)
path=pm.Normal('path_0',mu=0.0, tau=1/A, observed=x[0])
for i in range(1,N):
path = pm.Normal('path_%i' % i,
mu=path*ss,
tau=1.0/A/S,
observed=x[i])
start = pm.find_MAP()
print(start)
trace = pm.sample(100000, start=start)
unfortunately, this code takes at N=50 anywhere between 6hours to 2 days to compile. I am running on a pretty fast PC (24Gb RAM) running Ubuntu. I tried to using the GPU but that runs slightly slower. I suspect memory problems since it uses 99.8% of the memory when running. I tried the same calculation with Stan and it only takes 2min to run.

How to sample independently with pymc3

I am working with a simple bivariate normal model with a somewhat unconventional prior. The main issue I have is that my posteriors are inconsistent from one run to the next, which I'm guessing is related to an issue of high dependence between consecutive samples. Here are my specific questions.
What is the best way to get N independent samples? At the moment, I've been calling sample() to get a big chain (e.g. length 10,000) and then taking every 100th sample starting at 1,000. But looking now at an autocorrelation profile of one of the parameters, it looks like I need to take at least every 500th sample! (I could also use mutual information to get a better idea of dependence between lags.)
I've been following the fitting procedure described in the stochastic volatility example in the pymc3 tutorial. In particular I first find the MAP, then use it to generate a NUTS() object, then take a short sample, then use that to generate another NUTS() object, using gamma=0.25 (???), then finally get my big sample. I have no idea whether this is appropriate or whether I need the gamma=0.25.
Also, in that same example, there are testvals for the Exponential distribution. I don't know if I need these. (What is wrong with the default use of the mean?)
Here is the actual model I'm using.
import pymc3 as pymc
import numpy as np
import theano.tensor as th
from pymc3.distributions.continuous import Gamma, Uniform, Normal, Bounded
from pymc3.distributions.multivariate import MvNormal
from pymc3.model import Deterministic
data = np.random.randn(3000, 2) / 300 # I have actual data!
with pymc.Model():
tau = Gamma('tau', alpha=2, beta=1 / 20000)
sigma = Deterministic('sigma', 1 / th.sqrt(tau))
corr = Uniform('corr', lower=0, upper=1)
alpha_sig = Deterministic('alpha_sig', sigma / 50)
alpha_post = Normal('alpha_post', mu=0, sd=alpha_sig)
alpha_pre = Bounded(
'alpha_pre', Normal, alpha_post, np.Inf, mu=0, sd=alpha_sig)
corr_inv = th.stack([th.stack([1, -corr]),
th.stack([-corr, 1])]) / (1 - th.sqr(corr))
MvNormal(
'data', mu=th.stack([alpha_post, alpha_pre]),
tau=tau * corr_inv, observed=data)
map_ = pymc.find_MAP()
step1 = pymc.NUTS(scaling=map_)
trace1 = pymc.sample(1000, step=step1)
step2 = pymc.NUTS(scaling=trace1[-1], gamma=0.25)
trace2 = pymc.sample(10000, step=step2, start=trace1[-1])
I'm not sure what you're doing with the complex prior structure you have set up but I think there is something wrong there.
I simplified the model to:
import pymc3 as pymc
import numpy as np
import theano.tensor as th
from pymc3.distributions.continuous import Gamma, Uniform, Normal, Bounded
from pymc3.distributions.multivariate import MvNormal
from pymc3.model import Deterministic
data = np.random.randn(3000, 2) # I have actual data!
with pymc.Model():
corr = Uniform('corr', lower=0, upper=1)
corr_inv = th.stack([th.stack([1, -corr]),
th.stack([-corr, 1])]) / (1 - th.sqr(corr))
mu = Normal('mu', mu=0, sd=1, shape=2)
MvNormal('data',
mu=mu,
tau=corr_inv,
observed=data)
map_ = pymc.find_MAP()
step1 = pymc.NUTS(scaling=map_)
trace1 = pymc.sample(1000, step=step1)
step2 = pymc.NUTS(scaling=trace1[-1])
trace2 = pymc.sample(10000, step=step2, start=trace1[-1])
Which has great convergence. I think you can also just drop the gamma parameter.

Fitting The Theoretical Equation To My Data

I am very, very new to python, so please bear with me, and pardon my naivety. I am using Spyder Python 2.7 on my Windows laptop. As the title suggests, I have some data, a theoretical equation, and I am attempting to fit my data, with what I believe is the Chi-squared fit. The theoretical equation I am using is
import math
import numpy as np
import scipy.optimize as optimize
import matplotlib.pylab as plt
import csv
#with open('1.csv', 'r') as datafile:
# datareader = csv.reader(datafile)
# for row in datareader:
# print ', '.join(row)
t_y_data = np.loadtxt('exerciseball.csv', dtype=float, delimiter=',', usecols=(1,4), skiprows = 1)
print(t_y_data)
t = t_y_data[:,0]
y = t_y_data[:,1]
gamma0 = [.1]
sigma = [(0.345366)/2]*(len(t))
#len(sigma)
#print(sigma)
#print(len(sigma))
#sigma is the error in our measurements, which is the radius of the object
# Dragfunction is the theoretical equation of the position as a function of time when the thing falling experiences a drag force
# This is the function we are trying to fit to our data
# t is the independent variable time, m is the mass, and D is the Diameter
#Gamma is the value of which python will vary, until chi-squared is a minimum
def Dragfunction(x, gamma):
print x
g = 9.8
D = 0.345366
m = 0.715
# num = math.sqrt(gamma)*D*g*x
# den = math.sqrt(m*g)
# frac = num/den
# print "frac", frac
return ((m)/(gamma*D**2))*math.log(math.cosh(math.sqrt(gamma/m*g)*D*g*t))
optimize.curve_fit(Dragfunction, t, y, gamma0, sigma)
This is the error message I am getting:
return ((m)/(gamma*D**2))*math.log(math.cosh(math.sqrt(gamma/m*g)*D*g*t))
TypeError: only length-1 arrays can be converted to Python scalars
My professor and I have spent about three or four hours trying to fix this. He helped me work out a lot of the problems, but this we can't seem to resolve.
Could someone please help? If there is any other information you need, please let me know.
Your error message comes from the fact that those math functions only accept a scalar, so to call functions on an array, use the numpy versions:
In [82]: a = np.array([1,2,3])
In [83]: np.sqrt(a)
Out[83]: array([ 1. , 1.41421356, 1.73205081])
In [84]: math.sqrt(a)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
----> 1 math.sqrt(a)
TypeError: only length-1 arrays can be converted to Python scalars
In the process, I happened to spot a mathematical error in your code. Your equation at top says that g is in the bottom of the square root inside the log(cosh()), but you've got it on the top because a/b*c == a*c/b in python, not a/(b*c)
log(cosh(sqrt(gamma/m*g)*D*g*t))
should instead be any one of these:
log(cosh(sqrt(gamma/m/g)*D*g*t))
log(cosh(sqrt(gamma/(m*g))*D*g*t))
log(cosh(sqrt(gamma*g/m)*D*t)) # the simplest, by canceling with the g from outside sqrt
A second error is that in your function definition, you have the parameter named x which you never use, but instead you're using t which at this point is a global variable (from your data), so you won't see an error. You won't see an effect using curve_fit since it will pass your t data to the function anyway, but if you tried to call the Dragfunction on a different data set, it would still give you the results from the t values. Probably you meant this:
def Dragfunction(t, gamma):
print t
...
return ... D*g*t ...
A couple other notes as unsolicited advice, since you said you were new to python:
You can load and "unpack" the t and y variables at once with:
t, y = np.loadtxt('exerciseball.csv', dtype=float, delimiter=',', usecols=(1,4), skiprows = 1, unpack=True)
If your error is constant, then sigma has no effect on curve_fit, as it only affects the relative weighting for the fit, so you really don't need it at all.
Below is my version of your code, with all of the above changes in place.
import numpy as np
from scipy import optimize # simplified syntax
import matplotlib.pyplot as plt # pylab != pyplot
# `unpack` lets you split the columns immediately:
t, y = np.loadtxt('exerciseball.csv', dtype=float, delimiter=',',
usecols=(1, 4), skiprows=1, unpack=True)
gamma0 = .1 # does not need to be a list
def Dragfunction(x, gamma):
g = 9.8
D = 0.345366
m = 0.715
gammaD_m = gamma*D*D/m # combination is used twice, only calculate once for (small) speedup
return np.log(np.cosh(np.sqrt(gammaD_m*g)*t)) / gammaD_m
gamma_best, gamma_var = optimize.curve_fit(Dragfunction, t, y, gamma0)

scikit-learn PCA doesn't have 'score' method

I am trying to identify the type of noise based on that article:
Model selection with Probabilistic (PCA) and Factor Analysis (FA)
I am using scikit-learn-0.14.1.win32-py2.7 on win8 64bit
I know that it refers on version 0.15, however at the version 0.14 documentation it mentions that the score method is available for PCA so I guess it should normally work:
sklearn.decomposition.ProbabilisticPCA
The problem is that no matter which PCA I will use for the *cross_val_score*, I always get a type error message saying that the estimator PCA does not have a score method:
*TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator PCA(copy=True, n_components=None, whiten=False) does not.*
Any ideas why is that happening?
Many thanks in advance
Christos
X has 1000 samples of 40 features
here is a portion of the code:
import numpy as np
import csv
from scipy import linalg
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.covariance import ShrunkCovariance, LedoitWolf
#read in the training data
train_path = '<train data path>/train.csv'
reader = csv.reader(open(train_path,"rb"),delimiter=',')
train = list(reader)
X = np.array(train).astype('float')
n_samples = 1000
n_features = 40
n_components = np.arange(0, n_features, 4)
def compute_scores(X):
pca = PCA()
pca_scores = []
for n in n_components:
pca.n_components = n
pca_scores.append(np.mean(cross_val_score(pca, X, n_jobs=1)))
return pca_scores
pca_scores = compute_scores(X)
n_components_pca = n_components[np.argmax(pca_scores)]
Ok, I think I found the problem. it is not working with PCA, but it does work with PPCA
However, by not providing a cv number the cross_val_score automatically sets 3-fold cross validation
that created 3 sets with sizes 334, 333 and 333 (my initial training set contains 1000 samples)
Since nympy.mean cannot make a comparison between sets with different sizes (334 vs 333), python rises an exception.
thx