How can I specify a non-theano based likelihood? - pymc3

I saw a post from a few days ago by someone else: pymc3 likelihood math with non-theano function. Even though I think the problem at its core is the same, I thought I would ask with a simpler example:
Inside logp_wrap, I put some made up definition of a likelihood function. It depends on the rv and an observation. In this case I could do this with theano operations, but let's say that I want this function to be more complex and so I cannot use theano.
The problem comes when I try to define the likelihood both in terms of an RV and observations. From what I have seen, this format would work if I was specifying everything in 'logp_wrap' as theano operations.
I have searched around for a solution to this, but haven't found anything where this problem is fully addressed.
The problem in my attempt to do this is actually that the logp_ function is correctly decorated, but the logp_wrap function is only correctly decorated for its input, and not for its output, so I get the error
TypeError: 'TensorVariable' object is not callable.
Would be great if someone had a solution - don't think I am the only one with this problem.
The theano version of this that works (and uses the same function within a function definition) without the #as_op code is here: https://pymc-devs.github.io/pymc3/notebooks/lda-advi-aevb.html?highlight=densitydist (Specifically the sections: "Log-likelihood of documents for LDA" and "LDA model section")
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
"""
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pymc3 as pm
from theano import as_op
import theano.tensor as T
from scipy.stats import norm
#Some data that we observed
g_observed = [0.0, 1.0, 2.0, 3.0]
#Define a function to calculate the logp without using theano
#This as_op is where the problem is - the input is an rv but the output is a
#function.
#as_op(itypes=[T.dscalar],otypes=[T.dscalar])
def logp_wrap(rv):
#We are not using theano so we wrap the function.
#as_op(itypes=[T.dvector],otypes=[T.dscalar])
def logp_(ob):
#Some made up likelihood -
#The key here is that lp depends on the rv input and the observations
lp = np.log(norm.pdf(rv + ob))
return lp
return logp_
hb1_model = pm.Model()
with hb1_model:
I_mean = pm.Normal('I_mean', mu=0.1, sd=0.05)
xs = pm.DensityDist('x', logp_wrap(I_mean),observed = g_observed)
with hb1_model:
step = pm.Metropolis()
trace = pm.sample(1000, step)

Related

Saving data from traceplot in PyMC3

Below is the code for a simple Bayesian Linear regression. After I obtain the trace and the plots for the parameters, is there any way in which I can save the data that created the plots in a file so that if I need to plot it again I can simply plot it from the data in the file rather than running the whole simulation again?
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0,9,5)
y = 2*x + 5
yerr=np.random.rand(len(x))
def soln(x, p1, p2):
return p1+p2*x
with pm.Model() as model:
# Define priors
intercept = pm.Normal('Intercept', 15, sd=5)
slope = pm.Normal('Slope', 20, sd=5)
# Model solution
sol = soln(x, intercept, slope)
# Define likelihood
likelihood = pm.Normal('Y', mu=sol,
sd=yerr, observed=y)
# Sampling
trace = pm.sample(1000, nchains = 1)
pm.traceplot(trace)
print pm.summary(trace, ['Slope'])
print pm.summary(trace, ['Intercept'])
plt.show()
There are two easy ways of doing this:
Use a version after 3.4.1 (currently this means installing from master, with pip install git+https://github.com/pymc-devs/pymc3). There is a new feature that allows saving and loading traces efficiently. Note that you need access to the model that created the trace:
...
pm.save_trace(trace, 'linreg.trace')
# later
with model:
trace = pm.load_trace('linreg.trace')
Use cPickle (or pickle in python 3). Note that pickle is at least a little insecure, don't unpickle data from untrusted sources:
import cPickle as pickle # just `import pickle` on python 3
...
with open('trace.pkl', 'wb') as buff:
pickle.dump(trace, buff)
#later
with open('trace.pkl', 'rb') as buff:
trace = pickle.load(buff)
Update for someone like me who is still coming over to this question:
load_trace and save_trace functions were removed. Since version 4.0 even the deprecation waring for these functions were removed.
The way to do it is now to use arviz:
with model:
trace = pymc.sample(return_inferencedata=True)
trace.to_netcdf("filename.nc")
And it can be loaded with:
trace = arviz.from_netcdf("filename.nc")
This way works for me :
# saving trace
pm.save_trace(trace=trace_nb, directory=r"c:\Users\xxx\Documents\xxx\traces\trace_nb")
# loading saved traces
with model_nb:
t_nb = pm.load_trace(directory=r"c:\Users\xxx\Documents\xxx\traces\trace_nb")

cannot import name NonlinearConstraint

I'm trying to run the optimization example with non-linear constraints shown here
https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
>>> def cons_f(x):
... return [x[0]**2 + x[1], x[0]**2 - x[1]]
>>> def cons_J(x):
... return [[2*x[0], 1], [2*x[0], -1]]
>>> def cons_H(x, v):
... return v[0]*np.array([[2, 0], [0, 0]]) + v[1]*np.array([[2, 0], [0, 0]])
>>> from scipy.optimize import NonlinearConstraint
>>> nonlinear_constraint = NonlinearConstraint(cons_f, -np.inf, 1, jac=cons_J, hess=cons_H)
But when I try to import NonlinearConstraint this is what I get
ImportError: cannot import name NonlinearConstraint
I'm running scipy v.1.0.0
>>> import scipy
>>> print scipy.__version__
1.0.0
Any suggestions? Thanks in advance for your help
You will need scipy >= 1.1 or a master-branch based install!
As 1.1 was released recently (05.05.18), there are chances for binary-builds (depends a bit on how you use scipy).
Compare 1.1's optimize/init.py:
...
from ._lsq import least_squares, lsq_linear
from ._constraints import (NonlinearConstraint,
LinearConstraint,
Bounds)
from ._hessian_update_strategy import HessianUpdateStrategy, BFGS, SR1
__all__ = [s for s in dir() if not s.startswith('_')]
...
with 1.0.1's optimize/init.py:
...
from ._lsq import least_squares, lsq_linear
__all__ = [s for s in dir() if not s.startswith('_')]
...
More indications are available in the 1.1 release-text:
scipy.optimize improvements
The method trust-constr has been added to scipy.optimize.minimize. The
method switches between two implementations depending on the problem
definition. For equality constrained problems it is an implementation of
a trust-region sequential quadratic programming solver and, when
inequality constraints are imposed, it switches to a trust-region
interior point method. Both methods are appropriate for large scale
problems. Quasi-Newton options BFGS and SR1 were implemented and can be
used to approximate second order derivatives for this new method. Also,
finite-differences can be used to approximate either first-order or
second-order derivatives.
which is actually the solver introducing those abstractions.
Additionally, optimize/_constraints.py does not exist in 1.01.

Potential drawing defect in Seaborn's 'heatmap' (?)

Consider the following code:
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np
def main():
dst = np.ones((100,100),dtype=np.float32)
ax = plt.subplots(figsize=(17, 17))
sb.heatmap(dst, linewidths=.5, vmax=np.max(dst), vmin=np.min(dst), square=True, cmap="RdYlBu_r", cbar=False).get_figure().savefig("sb_save.png")
plt.show()
if __name__ == "__main__": main()
Now the saved plot looks like this,
which is clearly irregular; on the other hand, this is the output of plt.show(),
which although on a closer look you would still be able to recognize the uneven size of its square cells, it's overall acceptable.
The behavior might have been triggered because of the particulars of invoking savefig but I don't know of an alternative to try here. Any help would be much appreciated.

Using cython to speed up thousands of set operations

I have been trying to get over my fear of Cython (fear because I literally know NOTHING about c, or c++)
I have a function which takes 2 arguments, a set (we'll call it testSet), and a list of sets (we'll call that targetSets). The function then iterates through targetSets, and computes the length of the intersection with testSet, adding that value to a list, which is then returned.
Now, this isn't by itself that slow, but the problem is I need to do simulations of the testSet (and a large number at that, ~ 10,000), and the targetSet is about 10,000 sets long.
So for a small number of simulations to test, the pure python implementation was taking ~50 secs.
I tried making a cython function, and it worked and it's now running at ~16 secs.
If there is anything else that I could do to the cython function that anyone could think of that would be great (python 2.7 btw)
Here is my Cython implementation in overlapFunc.pyx
def computeOverlap(set testSet, list targetSets):
cdef list obsOverlaps = []
cdef int i, N
cdef set overlap
N = len(targetSets)
for i in range(N):
overlap = testSet & targetSets[i]
if len(overlap) <= 1:
obsOverlaps.append(0)
else:
obsOverlaps.append(len(overlap))
return obsOverlaps
and the setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
ext_modules = [Extension("overlapFunc",
["overlapFunc.pyx"])]
setup(
name = 'computeOverlap function',
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules
)
and some code to build some random sets for testing and to time the function. test.py
import numpy as np
from overlapFunc import computeOverlap
import time
def simRandomSet(n):
for i in range(n):
simSet= set(np.random.randint(low=1, high=100, size=50))
yield simSet
if __name__ == '__main__':
np.random.seed(23032014)
targetSet = [set(np.random.randint(low=1, high=100, size=50)) for i in range(10000)]
simulatedTestSets = simRandomSet(200)
start = time.time()
for i in simulatedTestSets:
obsOverlaps = computeOverlap(i, targetSet)
print time.time()-start
I tried changing the def at the start of the computerOverlap function, as in:
cdef list computeOverlap(set testSet, list targetSets):
but I get the following warning message when I run the setup.py script:
'__pyx_f_11overlapFunc_computeOverlap' defined but not used [-Wunused-function]
and then when I run something that tries to use the function I get an import Error:
from overlapFunc import computeOverlap
ImportError: cannot import name computeOverlap
Thanks in advance for your help,
Cheers,
Davy
In the following line, the extension module name and the filename does not match actual filename.
ext_modules = [Extension("computeOverlapWithGeneList",
["computeOverlapWithGeneList.pyx"])]
Replace it with:
ext_modules = [Extension("overlapFunc",
["overlapFunc.pyx"])]

Simple matplotlib Annotating example not working in Python 2.7

Code
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
t = np.arange(0.0, 5.0, 0.01)
s = np.cos(2*np.pi*t)
line, = ax.plot(t, s, lw=2)
ax.annotate('local max', xy=(2, 1), xytext=(3, 1.5),
arrowprops=dict(facecolor='black', shrink=0.05),
)
ax.set_ylim(-2,2)
plt.show()
from http://matplotlib.org/1.2.0/users/annotations_intro.html
return
TypeError: 'dict' object is not callable
I manged to fixed it with
xxx={'facecolor':'black', 'shrink':0.05}
ax.annotate('local max', xy=(2, 1), xytext=(3, 1.5),
arrowprops=xxx,
)
Is this the best way ?
Also what caused this problem ? ( I know that this started with Python 2.7)
So if somebody know more, please share.
Since the code looks fine and runs ok on my machine, it seems that you may have a variable named "dict" (see this answer for reference). A couple of ideas on how to check:
use Pylint.
if you suspect one specific builtin, try checking it's type (type(dict)), or look at the properties/functions it has (dir(dict)).
open a fresh notebook and try again, if you only observe the problem in interactive session.
try alternate syntax to initialise the dictionary
ax.annotate('local max', xy=(2, 1), xytext=(3, 1.5),
arrowprops={'facecolor':'black', 'shrink':0.05})
try explicitly instancing a variable of this type, using the alternate syntax (as you did already).