Saving data from traceplot in PyMC3 - python-2.7

Below is the code for a simple Bayesian Linear regression. After I obtain the trace and the plots for the parameters, is there any way in which I can save the data that created the plots in a file so that if I need to plot it again I can simply plot it from the data in the file rather than running the whole simulation again?
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0,9,5)
y = 2*x + 5
yerr=np.random.rand(len(x))
def soln(x, p1, p2):
return p1+p2*x
with pm.Model() as model:
# Define priors
intercept = pm.Normal('Intercept', 15, sd=5)
slope = pm.Normal('Slope', 20, sd=5)
# Model solution
sol = soln(x, intercept, slope)
# Define likelihood
likelihood = pm.Normal('Y', mu=sol,
sd=yerr, observed=y)
# Sampling
trace = pm.sample(1000, nchains = 1)
pm.traceplot(trace)
print pm.summary(trace, ['Slope'])
print pm.summary(trace, ['Intercept'])
plt.show()

There are two easy ways of doing this:
Use a version after 3.4.1 (currently this means installing from master, with pip install git+https://github.com/pymc-devs/pymc3). There is a new feature that allows saving and loading traces efficiently. Note that you need access to the model that created the trace:
...
pm.save_trace(trace, 'linreg.trace')
# later
with model:
trace = pm.load_trace('linreg.trace')
Use cPickle (or pickle in python 3). Note that pickle is at least a little insecure, don't unpickle data from untrusted sources:
import cPickle as pickle # just `import pickle` on python 3
...
with open('trace.pkl', 'wb') as buff:
pickle.dump(trace, buff)
#later
with open('trace.pkl', 'rb') as buff:
trace = pickle.load(buff)

Update for someone like me who is still coming over to this question:
load_trace and save_trace functions were removed. Since version 4.0 even the deprecation waring for these functions were removed.
The way to do it is now to use arviz:
with model:
trace = pymc.sample(return_inferencedata=True)
trace.to_netcdf("filename.nc")
And it can be loaded with:
trace = arviz.from_netcdf("filename.nc")

This way works for me :
# saving trace
pm.save_trace(trace=trace_nb, directory=r"c:\Users\xxx\Documents\xxx\traces\trace_nb")
# loading saved traces
with model_nb:
t_nb = pm.load_trace(directory=r"c:\Users\xxx\Documents\xxx\traces\trace_nb")

Related

(imageio or celery) Error closing: 'Image' object has no attribute 'fp'

I am using imageio to write png images to file.
import numpy as np
import matplotlib.cm as cm
import imageio # for saving the image
import matplotlib as mpl
hm_colors = ['blue', 'white','red']
cmap = mpl.colors.LinearSegmentedColormap.from_list('bwr', hm_colors)
data = np.array([[1,2,3],[5,6,7]])
norm = mpl.colors.Normalize(vmin=-3, vmax=3)
colormap = cm.ScalarMappable(norm=norm, cmap=cmap)
im = colormap.to_rgba(data)
# scale the data to a width of w pixels
im = np.repeat(im, w, axis=1)
im = np.repeat(im, h, axis=0)
# save the picture
imageio.imwrite("my_img.png", im)
This process is performed automatically and I noticed some Error messages saying:
Error closing: 'Image' object has no attribute 'fp'.
Before this message I get warning:
/usr/local/lib/python2.7/dist-packages/imageio/core/util.py:78: UserWarning: Lossy conversion from float64 to uint8, range [0, 1] dtype_str, out_type.__name__))
However, the images seem to be generated and saved just fine.
I can't find data to recreate this message.
Any idea why I get this error and why it doesn't noticeably affect the results? I don't use PIL.
One possible reason could come from using this in Celery.
Thanks!
L.
I encountered the same issue using imageio.imwrite in Python 3.5. It's a fairly harmless except for the fact that that it's stopping garbage collection and leading to excessive memory usage when writing thousands of images. The solution was to use the PIL module, which is a dependency of imageio. The last line of your code should read:
from PIL import Image
image = Image.fromarray(im)
image.save('my_img.png')

memory leak unpickle dict of pandas DataFrames

When I pickle a dictionary of dataframes and then unpickle them again, I experience a kind of memory leak. After the unpickled variable is dereferenced, the memory is only released partially. Calling gc.collect() does not help. I have created the following minimal exmaple:
import pickle
import numpy as np
import pandas as pd
new = np.zeros((1000, 100))
new = pd.DataFrame(new)
cc = {ix: new.copy() for ix in range(500)}
pickle.dump(cc, open('/tmp/test21', 'wb'))
Now I open a clean python session and do
import pickle
# memory consumption is around 40MB
data = pickle.load(open('/tmp/test21'))
# memory consumption goes to 991MB
data = None
# memory consumption goes to 776MB
This is pandas 0.19.2 and python 2.7.13. The problem seems to be the interaction between pickle, dictionary and pandas. If I remove the line new = pd.DataFrame(new), the problem does not occur. If I simply make a large df without a dictionary, the problem does not occur. If I don't pickle the dictionary and set cc = None, the problem does not occur. I have also tested the problem with pandas 0.14.1 and python 2.7.13. Finally the problem appears with both pickle and cPickle.
What could be the reason or a strategy to analyze this further? Any help is much appreciated!

How can I specify a non-theano based likelihood?

I saw a post from a few days ago by someone else: pymc3 likelihood math with non-theano function. Even though I think the problem at its core is the same, I thought I would ask with a simpler example:
Inside logp_wrap, I put some made up definition of a likelihood function. It depends on the rv and an observation. In this case I could do this with theano operations, but let's say that I want this function to be more complex and so I cannot use theano.
The problem comes when I try to define the likelihood both in terms of an RV and observations. From what I have seen, this format would work if I was specifying everything in 'logp_wrap' as theano operations.
I have searched around for a solution to this, but haven't found anything where this problem is fully addressed.
The problem in my attempt to do this is actually that the logp_ function is correctly decorated, but the logp_wrap function is only correctly decorated for its input, and not for its output, so I get the error
TypeError: 'TensorVariable' object is not callable.
Would be great if someone had a solution - don't think I am the only one with this problem.
The theano version of this that works (and uses the same function within a function definition) without the #as_op code is here: https://pymc-devs.github.io/pymc3/notebooks/lda-advi-aevb.html?highlight=densitydist (Specifically the sections: "Log-likelihood of documents for LDA" and "LDA model section")
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
"""
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pymc3 as pm
from theano import as_op
import theano.tensor as T
from scipy.stats import norm
#Some data that we observed
g_observed = [0.0, 1.0, 2.0, 3.0]
#Define a function to calculate the logp without using theano
#This as_op is where the problem is - the input is an rv but the output is a
#function.
#as_op(itypes=[T.dscalar],otypes=[T.dscalar])
def logp_wrap(rv):
#We are not using theano so we wrap the function.
#as_op(itypes=[T.dvector],otypes=[T.dscalar])
def logp_(ob):
#Some made up likelihood -
#The key here is that lp depends on the rv input and the observations
lp = np.log(norm.pdf(rv + ob))
return lp
return logp_
hb1_model = pm.Model()
with hb1_model:
I_mean = pm.Normal('I_mean', mu=0.1, sd=0.05)
xs = pm.DensityDist('x', logp_wrap(I_mean),observed = g_observed)
with hb1_model:
step = pm.Metropolis()
trace = pm.sample(1000, step)

Tensorflow high-level Estimator with input_fn from external file reader

[short summary: how to use TF high-level Estimator on Python with an external file reader? or with feed_dict?]
Been struggling with this for few days, couldn't find any solution on-line...
I'm using TF high-level modules (tf.contrib.learn.Estimator on tf1.0, or tf.estimator.Estimator on tf1.1),
features and targets (x/y) inputted through an input_fn, and the graph built on the model_fn.
Already trained a nn on 'small' data sets, in which the whole input is the part of the graph, using slice_input_producer etc. (I can push an example to github if it serves ppl here).
I try to train a larger nn on 'heavier' data-sets (10s-100s GB).
I have an external Python reader that does some nasty binary file reading, which I really don't want to get into.
This reader has its own queue.Queue with m1 samples. When I use it to extract the m1 {features} & {targets}, the net simply saves all these samples as const. in the first layer of the graph... completely undesired.
I try to either -
feed the output of the external file reader as input to my graph.
define a proper tf queue object that will keep updating the queue (each time a sample is dequeued, i want a completely other sample to be enqueued).
Reminding that I use the "high level", e.g.
self.Estimator = tf.contrib.learn.Estimator(
model_fn=self.model_fn,
model_dir=self.config['model_dir'],
config=tf.contrib.learn.RunConfig( ... ) )
def input_fn(self, mode):
batch_data = self.data[mode].next() # pops out a batch of samples, as numpy 4D matrices
... # some processing of batch data
features_dict = dict(data=batch_data.pop('data'))
targets_dict = batch_data
return features_dict, targets_dict
self.Estimator.fit(input_fn=lambda: self.input_fn(modekeys.TRAIN))
Attached is a final solution for integrating an external reader into the high-level TF api (tf.contrib.learn.Estimator / tf.estimator.Estimator).
Please note:
the architecture and "logic" is not important. it's a stupid simple net.
the external reader outputs a dictionary of numpy matrices.
the input_fn is using this reader.
In order to verify that the reader "pulls new values", I both
save the recent value to self.status (should be > 1.0)
save a summary, to be viewed in tensorboard.
Code example is in gist, and below.
import tensorflow as tf
import numpy as np
modekeys = tf.contrib.learn.ModeKeys
tf.logging.set_verbosity(tf.logging.DEBUG)
# Tested on python 2.7.9, tf 1.1.0
class inputExample:
def __init__(self):
self.status = 0.0 # tracing which value was recently 'pushed' to the net
self.model_dir = 'temp_dir'
self.get_estimator()
def input_fn(self):
# returns features and labels dictionaries as expected by tf Estimator's model_fn
data, labels = tf.py_func(func=self.input_fn_np, inp=[], Tout=[tf.float32, tf.float32], stateful=True)
data.set_shape([1,3,3,1]) # shapes are unknown and need to be set for integrating into the network
labels.set_shape([1,1,1,1])
return dict(data=data), dict(labels=labels)
def input_fn_np(self):
# returns a dictionary of numpy matrices
batch_data = self.reader()
return batch_data['data'], batch_data['labels']
def model_fn(self, features, labels, mode):
# using tf 2017 convention of dictionaries of features/labels as inputs
features_in = features['data']
labels_in = labels['labels']
pred_layer = tf.layers.conv2d(name='pred', inputs=features_in, filters=1, kernel_size=3)
tf.summary.scalar(name='label', tensor=tf.squeeze(labels_in))
tf.summary.scalar(name='pred', tensor=tf.squeeze(pred_layer))
loss = None
if mode != modekeys.INFER:
loss = tf.losses.mean_squared_error(labels=labels_in, predictions=pred_layer)
train_op = None
if mode == modekeys.TRAIN:
train_op = tf.contrib.layers.optimize_loss(
loss=loss,
learning_rate = 0.01,
optimizer = 'SGD',
global_step = tf.contrib.framework.get_global_step()
)
predictions = {'estim_exp': pred_layer}
return tf.contrib.learn.ModelFnOps(mode=mode, predictions=predictions, loss=loss, train_op=train_op)
def reader(self):
self.status += 1
if self.status > 1000.0:
self.status = 1.0
return dict(
data = np.random.randn(1,3,3,1).astype(dtype=np.float32),
labels = np.sin(np.ones([1,1,1,1], dtype=np.float32)*self.status)
)
def get_estimator(self):
self.Estimator = tf.contrib.learn.Estimator(
model_fn = self.model_fn,
model_dir = self.model_dir,
config = tf.contrib.learn.RunConfig(
save_checkpoints_steps = 10,
save_summary_steps = 10,
save_checkpoints_secs = None
)
)
if __name__ == '__main__':
ex = inputExample()
ex.Estimator.fit(input_fn=ex.input_fn)
You can use tf.constant if you have the training data already in python memory as shown in the abalone TF example: https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/examples/tutorials/estimators/abalone.py#L138-L141
Note: copying the data from disk to Python to TensorFlow is often less efficient than constructing an input pipeline in TensorFlow (i.e. loading data from disk directly into TensorFlow Tensors), such as using tf.contrib.learn.datasets.base.load_csv_without_header.

Displaying PNG in matplotlib.pyplot framework in python 2.7

I am pulling PNG images from Jupyter Notebooks and manage to display with IPython.display.Image but not with matplotib.pyplot.plt. What am I missing? I use python 2.7.
I am using the following algorithm:
To open the notebook JSON content I do:
import nbformat
notebook_ = nbformat.read(file_notebook, 4)
After retrieving the relevant cell information I pull the png information from it using:
def cell_to_image(cell, out_value_item_number=1):
if "execution_count" in cell.keys(): # i.e version >=4
return cell["outputs"][out_value_item_number]['data']['image/png']
elif "prompt_number" in cell.keys(): # i.e version < 4
return cell["outputs"][out_value_item_number]['png']
return None
cell_image = cell_to_image(cell)
The first few characters of cell_image (which is unicode) looks like:
iVBORw0KGgoAAAANSUhEUgAAA64AAAFMCAYAAADLFeHSAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\n
AAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xd8jef/x/HXyTjZiYQkCGrU3ruR0tr9oq2qGtGo0dbe
\nm5pVlJpFUSMoVb6UoEZ/lCpatWuPUiNEEiMDmef3R75OexonJKUO3s/HI4/mXPd1X/d1f+LRR965
\n7/u6DSaTyYSIiIiIiIiIjbJ70hMQERERERERyYiCq4iIiIiIiNg0BVcRERERERGxaQquIiIiIiIi
\nYtMUXEVERERERMSmKbiKiIiIiIiITVNwFRGRxyIkJIRixYqxfv36+24/e/YsxYoVo3jx4v/yzGxb
\naGgoderUIS4uDoBdu3bRsmVLKlasyCuvvMKgQYOIjo622CcsLIyGDRtSunRp6tSpw8KFC62OW7p0
\naRo2bJju53Lnzh1GjRrFyy+/TNmyZWnRogW//fbbQ835q6++olGjRpQvX5769eszc+ZMkpOTzdtT
\nU1OZNGkSNWrUoHTp0jRp0oTdu3enGyc2NpZOn
I can easily plot in my Jupityer notebook using
from IPython.display import Image
Image(cell_image)
And now to my question:
How can I manipulate cell_image to be plt.subplot friendly?
(Assuming import matplotlib.pyplot as plt).
I realise that plt.imshow wouldn't work because this would require an array, which is not my case (which is a string, as far as I understand).
If you have your image string representation in a variable string_rep, the following code should work.
from io import BytesIO
import matplotlib.image as mpimage
import matplotlib.pyplot as plt
with BytesIO(string_rep.decode('base64')) as byte_rep:
image = mpimage.imread(byte_rep)
plt.imshow(image)