For comparison purposes, I want to utilize the posterior density function outside of PyMC3.
For my research project, I want to find out how well PyMC3 is performing compared to my own custom made code. As such, I need to compare it to our own in-house samplers and likelihood functions.
I think I figured out how to call the internal PyMC3 posterior, but it feels very awkward, and I want to know if there is a better way. Right now I am hand-transforming variables, whereas I should just be able to pass pymc a parameter dictionary and get the posterior density. Is this possible in a straightforward manner?
Thanks a lot!
Demo code:
import numpy as np
import pymc3 as pm
import scipy.stats as st
# Simple data, with sigma = 4. We want to estimate sigma
sigma_inject = 4.0
data = np.random.randn(10) * sigma_inject
# Prior interval for sigma
a, b = 0.0, 20.0
# Build PyMC model
with pm.Model() as model:
sigma = pm.Uniform('sigma', a, b) # Prior uniform between 0.0 and 20.0
likelihood = pm.Normal('data', 0.0, sd=sigma, observed=data)
# Write my own likelihood
def logpost_self(sig, data):
loglik = np.sum(st.norm(loc=0.0, scale=sig).logpdf(data)) # Gaussian
logpr = np.log(1.0 / (b-a)) # Uniform prior
return loglik + logpr
# Utilize PyMC likelihood (Have to hand-transform parameters)
def logpost_pymc(sig, model):
sigma_interval = np.log((sig - a) / (b - sig)) # Parameter transformation
ldrdx = np.log(1.0/(sig-a) + 1.0/(b-sig)) # Jacobian
return model.logp({'sigma_interval':sigma_interval}) + ldrdx
print("Own posterior: {0}".format(logpost_self(1.0, data)))
print("PyMC3 posterior: {0}".format(logpost_pymc(1.0, model)))
It's been over 5 years, but I figured this deserves an answer.
Firstly, regarding the transformations, you need to decide within the pymc3 definitions whether you want these parameters transformed. Here, sigma was being transformed with an interval transform to avoid hard boundaries. If you are interested in accessing the posterior as a function of sigma, then set transform=None. If you do transform, then the 'sigma' variable will be accessible as one of the deterministic parameters of the model.
Regarding accessing the posterior, there is a great description here. With the example given above, the code becomes:
import numpy as np
import pymc3 as pm
import theano as th
import scipy.stats as st
# Simple data, with sigma = 4. We want to estimate sigma
sigma_inject = 4.0
data = np.random.randn(10) * sigma_inject
# Prior interval for sigma
a, b = 0.1, 20.0
# Build PyMC model
with pm.Model() as model:
sigma = pm.Uniform('sigma', a, b, transform=None) # Prior uniform between 0.0 and 20.0
likelihood = pm.Normal('data', mu=0.0, sigma=sigma, observed=data)
# Write my own likelihood
def logpost_self(sig, data):
loglik = np.sum(st.norm(loc=0.0, scale=sig).logpdf(data)) # Gaussian
logpr = np.log(1.0 / (b-a)) # Uniform prior
return loglik + logpr
with model:
# Compile model posterior into a theano function
f = th.function(model.vars, [model.logpt] + model.deterministics)
def logpost_pymc3(params):
dct = model.bijection.rmap(params)
args = (dct[k.name] for k in model.vars)
results = f(*args)
return tuple(results)
print("Own posterior: {0}".format(logpost_self(1.0, data)))
print("PyMC3 posterior: {0}".format(logpost_pymc3([1.0])))
Note that if you remove the 'transform=None' part from the sigma prior, then the actual value of sigma becomes part of the tuple that is returned by the logpost_pymc3 function. It's now a deterministic of the model.
Related
In PyMC3 examples, priors and likelihood are defined inside with statement, but they are not explicitly defined if they are priors or likelihood. How do I define them?
In following example code, alpha and beta are priors and y_obs is likelihood(As PyMC3 examples states).
My question is: How PyMC3 internal code finds out if distribution is of prior or likelihood? There should be some explicit parameter to tell PyMC3 internals about type of distribution (prior/likelihood).
I know y_obs is likelihood, but I could define more y_obs1 y_obs2. How PyMC3 is going to identify which one is likelihood and which one is prior.
from pymc3 import Model, Normal, HalfNormal
regression_model = Model()
with regression_model:
alpha = Normal('alpha', mu=0, sd=10)
beta = Normal('beta', mu=0, sd=10, shape=2)
sigma = HalfNormal('sigma', sd=1)
mu = alpha + beta[0] * X[:,0] + beta[1] * X[:,1]
y_obs = Normal('y_obs', mu=mu, sd=sigma, observed=y)
Passing an observed argument makes it a likelihood term (in your example, P[y|mu, sigma]). The other RandomVariable variables (alpha, beta, and sigma), lacking an observed argument, are sampled as priors.
I need to randomly sample from some distribution eventually, so I need one that allows me to readily change the mean and variance. I'm looking at using distributions from the scipy.stats library, however, I'm having difficulty seeing how the parameters "loc" and "scale" relate to the quantites I'm interested in. I'd like to be able to do something like:
x = numpy.linspace(0,5,1000)
y = scipy.stats.maxwell(x, mean, variance)
But loc and scale seem to be the only other arguments that function takes.
Can anyone specify the relationship those quantities must have to mean and variance, or suggest a better library to use?
Well, I don't have python 2.7, so answer would be for python 3.6, but it should work, it is a scipy after all.
Basically, you have to extract scale and loc parameters from given μ and σ. Here are two simple functions to do that, plus some sampling to prove we're getting right values. Basically, first printed line is what you want, and third line is result of sampling, should be roughly be the same. Second line is scale and loccomputed from μ and σ. Play with the numbers, see how it is going
import numpy as np
from scipy.stats import maxwell
def get_scale_from_sigma(sigma):
"""Compute scale from sigma based upon http://mathworld.wolfram.com/MaxwellDistribution.html"""
a2 = np.pi*sigma / (3.0*np.pi - 8.0)
return np.sqrt(a2)
def get_loc_from_mu_sigma(mu, sigma):
"""Compute loc from mu/sigma based upon http://mathworld.wolfram.com/MaxwellDistribution.html"""
scale = get_scale_from_sigma(sigma)
loc = mu - 2.0 * scale * np.sqrt(2.0 / np.pi)
return loc
sigma = 1.0
mu = 2.0 * get_scale_from_sigma(sigma) * np.sqrt(2.0 / np.pi) # + 3.0 as shift, for exampl
print(mu, sigma)
scale = get_scale_from_sigma(sigma)
loc = get_loc_from_mu_sigma(mu, sigma)
print(scale, loc)
q = maxwell.rvs(size=10000, scale = scale, loc = loc)
print(np.mean(q), np.std(q))
I am working with a simple bivariate normal model with a somewhat unconventional prior. The main issue I have is that my posteriors are inconsistent from one run to the next, which I'm guessing is related to an issue of high dependence between consecutive samples. Here are my specific questions.
What is the best way to get N independent samples? At the moment, I've been calling sample() to get a big chain (e.g. length 10,000) and then taking every 100th sample starting at 1,000. But looking now at an autocorrelation profile of one of the parameters, it looks like I need to take at least every 500th sample! (I could also use mutual information to get a better idea of dependence between lags.)
I've been following the fitting procedure described in the stochastic volatility example in the pymc3 tutorial. In particular I first find the MAP, then use it to generate a NUTS() object, then take a short sample, then use that to generate another NUTS() object, using gamma=0.25 (???), then finally get my big sample. I have no idea whether this is appropriate or whether I need the gamma=0.25.
Also, in that same example, there are testvals for the Exponential distribution. I don't know if I need these. (What is wrong with the default use of the mean?)
Here is the actual model I'm using.
import pymc3 as pymc
import numpy as np
import theano.tensor as th
from pymc3.distributions.continuous import Gamma, Uniform, Normal, Bounded
from pymc3.distributions.multivariate import MvNormal
from pymc3.model import Deterministic
data = np.random.randn(3000, 2) / 300 # I have actual data!
with pymc.Model():
tau = Gamma('tau', alpha=2, beta=1 / 20000)
sigma = Deterministic('sigma', 1 / th.sqrt(tau))
corr = Uniform('corr', lower=0, upper=1)
alpha_sig = Deterministic('alpha_sig', sigma / 50)
alpha_post = Normal('alpha_post', mu=0, sd=alpha_sig)
alpha_pre = Bounded(
'alpha_pre', Normal, alpha_post, np.Inf, mu=0, sd=alpha_sig)
corr_inv = th.stack([th.stack([1, -corr]),
th.stack([-corr, 1])]) / (1 - th.sqr(corr))
MvNormal(
'data', mu=th.stack([alpha_post, alpha_pre]),
tau=tau * corr_inv, observed=data)
map_ = pymc.find_MAP()
step1 = pymc.NUTS(scaling=map_)
trace1 = pymc.sample(1000, step=step1)
step2 = pymc.NUTS(scaling=trace1[-1], gamma=0.25)
trace2 = pymc.sample(10000, step=step2, start=trace1[-1])
I'm not sure what you're doing with the complex prior structure you have set up but I think there is something wrong there.
I simplified the model to:
import pymc3 as pymc
import numpy as np
import theano.tensor as th
from pymc3.distributions.continuous import Gamma, Uniform, Normal, Bounded
from pymc3.distributions.multivariate import MvNormal
from pymc3.model import Deterministic
data = np.random.randn(3000, 2) # I have actual data!
with pymc.Model():
corr = Uniform('corr', lower=0, upper=1)
corr_inv = th.stack([th.stack([1, -corr]),
th.stack([-corr, 1])]) / (1 - th.sqr(corr))
mu = Normal('mu', mu=0, sd=1, shape=2)
MvNormal('data',
mu=mu,
tau=corr_inv,
observed=data)
map_ = pymc.find_MAP()
step1 = pymc.NUTS(scaling=map_)
trace1 = pymc.sample(1000, step=step1)
step2 = pymc.NUTS(scaling=trace1[-1])
trace2 = pymc.sample(10000, step=step2, start=trace1[-1])
Which has great convergence. I think you can also just drop the gamma parameter.
I have an equation, as follows:
R - ((1.0 - np.exp(-tau))/(1.0 - np.exp(-a*tau))) = 0.
I want to solve for tau in this equation using a numerical solver available within numpy. What is the best way to go about this?
The values for R and a in this equation vary for different implementations of this formula, but are fixed at particular values when it is to be solved for tau.
In conventional mathematical notation, your equation is
The SciPy fsolve function searches for a point at which a given expression equals zero (a "zero" or "root" of the expression). You'll need to provide fsolve with an initial guess that's "near" your desired solution. A good way to find such an initial guess is to just plot the expression and look for the zero crossing.
#!/usr/bin/python
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fsolve
# Define the expression whose roots we want to find
a = 0.5
R = 1.6
func = lambda tau : R - ((1.0 - np.exp(-tau))/(1.0 - np.exp(-a*tau)))
# Plot it
tau = np.linspace(-0.5, 1.5, 201)
plt.plot(tau, func(tau))
plt.xlabel("tau")
plt.ylabel("expression value")
plt.grid()
plt.show()
# Use the numerical solver to find the roots
tau_initial_guess = 0.5
tau_solution = fsolve(func, tau_initial_guess)
print "The solution is tau = %f" % tau_solution
print "at which the value of the expression is %f" % func(tau_solution)
You can rewrite the equation as
For integer a and non-zero R you will get a solutions in the complex space;
There are analytical solutions for a=0,1,...4(see here);
So in general you may have one, multiple or no solution and some or all of them may be complex values. You may easily throw scipy.root at this equation, but no numerical method will guarantee to find all the solutions.
To solve in the complex space:
import numpy as np
from scipy.optimize import root
def poly(xs, R, a):
x = complex(*xs)
err = R * x - x + 1 - R
return [err.real, err.imag]
root(poly, x0=[0, 0], args=(1.2, 6))
I am very, very new to python, so please bear with me, and pardon my naivety. I am using Spyder Python 2.7 on my Windows laptop. As the title suggests, I have some data, a theoretical equation, and I am attempting to fit my data, with what I believe is the Chi-squared fit. The theoretical equation I am using is
import math
import numpy as np
import scipy.optimize as optimize
import matplotlib.pylab as plt
import csv
#with open('1.csv', 'r') as datafile:
# datareader = csv.reader(datafile)
# for row in datareader:
# print ', '.join(row)
t_y_data = np.loadtxt('exerciseball.csv', dtype=float, delimiter=',', usecols=(1,4), skiprows = 1)
print(t_y_data)
t = t_y_data[:,0]
y = t_y_data[:,1]
gamma0 = [.1]
sigma = [(0.345366)/2]*(len(t))
#len(sigma)
#print(sigma)
#print(len(sigma))
#sigma is the error in our measurements, which is the radius of the object
# Dragfunction is the theoretical equation of the position as a function of time when the thing falling experiences a drag force
# This is the function we are trying to fit to our data
# t is the independent variable time, m is the mass, and D is the Diameter
#Gamma is the value of which python will vary, until chi-squared is a minimum
def Dragfunction(x, gamma):
print x
g = 9.8
D = 0.345366
m = 0.715
# num = math.sqrt(gamma)*D*g*x
# den = math.sqrt(m*g)
# frac = num/den
# print "frac", frac
return ((m)/(gamma*D**2))*math.log(math.cosh(math.sqrt(gamma/m*g)*D*g*t))
optimize.curve_fit(Dragfunction, t, y, gamma0, sigma)
This is the error message I am getting:
return ((m)/(gamma*D**2))*math.log(math.cosh(math.sqrt(gamma/m*g)*D*g*t))
TypeError: only length-1 arrays can be converted to Python scalars
My professor and I have spent about three or four hours trying to fix this. He helped me work out a lot of the problems, but this we can't seem to resolve.
Could someone please help? If there is any other information you need, please let me know.
Your error message comes from the fact that those math functions only accept a scalar, so to call functions on an array, use the numpy versions:
In [82]: a = np.array([1,2,3])
In [83]: np.sqrt(a)
Out[83]: array([ 1. , 1.41421356, 1.73205081])
In [84]: math.sqrt(a)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
----> 1 math.sqrt(a)
TypeError: only length-1 arrays can be converted to Python scalars
In the process, I happened to spot a mathematical error in your code. Your equation at top says that g is in the bottom of the square root inside the log(cosh()), but you've got it on the top because a/b*c == a*c/b in python, not a/(b*c)
log(cosh(sqrt(gamma/m*g)*D*g*t))
should instead be any one of these:
log(cosh(sqrt(gamma/m/g)*D*g*t))
log(cosh(sqrt(gamma/(m*g))*D*g*t))
log(cosh(sqrt(gamma*g/m)*D*t)) # the simplest, by canceling with the g from outside sqrt
A second error is that in your function definition, you have the parameter named x which you never use, but instead you're using t which at this point is a global variable (from your data), so you won't see an error. You won't see an effect using curve_fit since it will pass your t data to the function anyway, but if you tried to call the Dragfunction on a different data set, it would still give you the results from the t values. Probably you meant this:
def Dragfunction(t, gamma):
print t
...
return ... D*g*t ...
A couple other notes as unsolicited advice, since you said you were new to python:
You can load and "unpack" the t and y variables at once with:
t, y = np.loadtxt('exerciseball.csv', dtype=float, delimiter=',', usecols=(1,4), skiprows = 1, unpack=True)
If your error is constant, then sigma has no effect on curve_fit, as it only affects the relative weighting for the fit, so you really don't need it at all.
Below is my version of your code, with all of the above changes in place.
import numpy as np
from scipy import optimize # simplified syntax
import matplotlib.pyplot as plt # pylab != pyplot
# `unpack` lets you split the columns immediately:
t, y = np.loadtxt('exerciseball.csv', dtype=float, delimiter=',',
usecols=(1, 4), skiprows=1, unpack=True)
gamma0 = .1 # does not need to be a list
def Dragfunction(x, gamma):
g = 9.8
D = 0.345366
m = 0.715
gammaD_m = gamma*D*D/m # combination is used twice, only calculate once for (small) speedup
return np.log(np.cosh(np.sqrt(gammaD_m*g)*t)) / gammaD_m
gamma_best, gamma_var = optimize.curve_fit(Dragfunction, t, y, gamma0)