Fitting multiple data sets using lmfit without writting an objective function - python-2.7

This topic describes how to fit multiple data-sets using lmfit:
Python and lmfit: How to fit multiple datasets with shared parameters?
However it uses a fitting/objective function written by the user.
I was wondering if it's possible to fit multiple data-sets using lmfit without writing an objective function and using model.fit() method of the model class.
As an example: Lets say we have multiple data sets of (x,y) coordinates that we want to fit using the same model function in order to find the set of parameters that on average fit all the data best.
import numpy as np
from lmfit import Model, Parameters
from lmfit.models import GaussianModel
def gauss(x, amp, cen, sigma):
return amp*np.exp(-(x-cen)**2/(2.*sigma**2))
x1= np.arange(0.,100.,0.1)
x2= np.arange(0.,100.,0.09)
y1= gauss(x1, 1.,50.,5.)+ np.random.normal(size=len(x1), scale=0.1)
y2= gauss(x2, 0.8,48.4.,4.5)+ np.random.normal(size=len(x2), scale=0.1)
mod= GaussianModel()
params= mod.make_params()
mod.fit([y1,y2], params, x= [x1, x2])
I guess if this is possible the data has to be passed to mod.fit in the right type. The documentation only says that mod.fit takes an array-like data input.
I tried to give it lists and arrays. If I pass the different data sets as a list I get a ValueError: setting an array element with a sequence
If I pass an array I get an AttributeError: 'numpy.ndarray' has no atribute 'exp'
So am I just trying to do something that isn't possible or am I doing something wrong?

Well, I think the answer is "sort of". The lmfit.Model class is meant to represent a model for an array of data. So, if you can map your multiple datasets into a numpy ndarray (say, with np.concatenate), you can probably write a Model function to represent this by building sub-models for the different datasets and concatenating them in the same way.
I don't think you could do that with any of the built-in models. I also think that once you start down the road of writing complex model functions, it isn't a very big jump to writing objective functions. That is, what would be
def model_function(x, a, b, c):
### do some calculation with x, a, b, c values
result = a + x*b + x*x*c
return result
might become
def objective_function(params, x, data):
vals = params.valuesdict()
return data - model_function(x, vals['a'], vals['b'], vals['c'])
If that do_calc() is doing anything complex, the additional burden of unpacking the parameters and subtracting the data is pretty small. And, especially if some parameters would be used for multiple datasets and some only for particular datasets, you'll have to manage that in either the model function or the objective function. In the example you link to, my answer included a loop over datasets, picking out parameters by name for each dataset. You'll probably want to do something like that. You could probably do that in a model function by thinking of it as modeling the concatenated datasets, but I'm not sure you'd really gain a lot by doing that.

I found the problem. Actually model.fit() will handle arrays of multiple data sets just fine and perform a proper fit. The correct call of model.fit() with multiple data sets would be:
import numpy as np
from lmfit import Model, Parameters
from lmfit.models import GaussianModel
import matplotlib.pyplot as plt
def gauss(x, amp, cen, sigma):
"basic gaussian"
return amp*np.exp(-(x-cen)**2/(2.*sigma**2))
x1= np.arange(0.,100.,0.1)
x2= np.arange(0.,100.,0.1)
y1= gauss(x1, 1.,50.,5.)+ np.random.normal(size=len(x1), scale=0.01)
y2= gauss(x2, 0.8,48.4,4.5)+ np.random.normal(size=len(x2), scale=0.01)
mod= GaussianModel()
params= mod.make_params()
params['amplitude'].set(1.,min=0.01,max=100.)
params['center'].set(1.,min=0.01,max=100.)
params['sigma'].set(1.,min=0.01,max=100.)
result= mod.fit(np.array([y1,y2]), params,method='basinhopping',
x=np.array([x1,x2]))
print(result.fit_report(min_correl=0.5))
fig, ax = plt.subplots()
plt.plot(x1,y1, lw=2, color='red')
plt.plot(x2,y2, lw=2, color='orange')
plt.plot(x1,result.eval(x=x1), lw=2, color='black')
plt.show()
The problem in the original code actually lies in the fact that my data sets don't have the same length. However I'm not sure at all how to handle this in the most elegant way?

Related

How to Use MCMC with a Custom Log-Probability and Solve for a Matrix

The code is in PyMC3, but this is a general problem. I want to find which matrix (combination of variables) gives me the highest probability. Taking the mean of the trace of each element is meaningless because they depend on each other.
Here is a simple case; the code uses a vector rather than a matrix for simplicity. The goal is to find a vector of length 2, where the each value is between 0 and 1, so that the sum is 1.
import numpy as np
import theano
import theano.tensor as tt
import pymc3 as mc
# define a theano Op for our likelihood function
class LogLike_Matrix(tt.Op):
itypes = [tt.dvector] # expects a vector of parameter values when called
otypes = [tt.dscalar] # outputs a single scalar value (the log likelihood)
def __init__(self, loglike):
self.likelihood = loglike # the log-p function
def perform(self, node, inputs, outputs):
# the method that is used when calling the Op
theta, = inputs # this will contain my variables
# call the log-likelihood function
logl = self.likelihood(theta)
outputs[0][0] = np.array(logl) # output the log-likelihood
def logLikelihood_Matrix(data):
"""
We want sum(data) = 1
"""
p = 1-np.abs(np.sum(data)-1)
return np.log(p)
logl_matrix = LogLike_Matrix(logLikelihood_Matrix)
# use PyMC3 to sampler from log-likelihood
with mc.Model():
"""
Data will be sampled randomly with uniform distribution
because the log-p doesn't work on it
"""
data_matrix = mc.Uniform('data_matrix', shape=(2), lower=0.0, upper=1.0)
# convert m and c to a tensor vector
theta = tt.as_tensor_variable(data_matrix)
# use a DensityDist (use a lamdba function to "call" the Op)
mc.DensityDist('likelihood_matrix', lambda v: logl_matrix(v), observed={'v': theta})
trace_matrix = mc.sample(5000, tune=100, discard_tuned_samples=True)
If you only want the highest likelihood parameter values, then you want the Maximum A Posteriori (MAP) estimate, which can be obtained using pymc3.find_MAP() (see starting.py for method details). If you expect a multimodal posterior, then you will likely need to run this repeatedly with different initializations and select the one that obtains the largest logp value, but that still only increases the chances of finding the global optimum, though cannot guarantee it.
It should be noted that at high parameter dimensions, the MAP estimate is usually not part of the typical set, i.e., it is not representative of typical parameter values that would lead to the observed data. Michael Betancourt discusses this in A Conceptual Introduction to Hamiltonian Monte Carlo. The fully Bayesian approach is to use posterior predictive distributions, which effectively averages over all the high-likelihood parameter configurations rather than using a single point estimate for parameters.

How to send namedtuples between tranformations?

How to send out a row as namedtuple from one Bonobo transformation? So in the receiving transformation I have field-level access to row data.
I'm now using dicts to send data between transformations. But they have a disadvantage: they're mutable (bad things can happen if you forgot to create a fresh one at output of transformation).
I thought that simply replacing a dict with a namedtuple would do the trick, but apparently Bonobo doesn't support sending out a namedtuple. I read something about context.set_output_fields[list of keys]), but can't figure out how to use it. A small example would be great!
Using namedtuple is very straightforward, you can yield a namedtuple instance and retrieve it expanded as the next transformation input:
import bonobo
import collections
Hero = collections.namedtuple("Hero", ["name", "power"])
def produce():
yield Hero(name="Road Runner", power="speed")
yield Hero(name="Wile E. Coyote", power="traps")
yield Hero(name="Guido", power="dutch")
def consume(name, power):
print(name, "has", power, "power")
def get_graph():
graph = bonobo.Graph()
graph >> produce >> consume
return graph
if __name__ == "__main__":
with bonobo.parse_args() as options:
bonobo.run(get_graph())
The "output fields" of produce() will be set from the namedtuple fields, and the "input fields" of consume(...) will be detected from the first input row.
The context.set_output_fields(...) method is only useful if for whatever reason you don't want to used named datastructures (like namedtuples) but prefer to use tuples, yet needing to name the values in the tuple.
Hope that helps!

'DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION' in Numba

I haven't seen this specific scenario in my research for this error in Numba. This is my first time using the package so it might be something obvious.
I have a function that calculates engineered features in a data set by adding, multiplying and/or dividing each column in a dataframe called data and I wanted to test whether numba would speed it up
#jit
def engineer_features(engineer_type,features,joined):
#choose which features to engineer (must be > 1)
engineered = features
if len(engineered) > 1:
if 'Square' in engineer_type:
sq = data[features].apply(np.square)
sq.columns = map(lambda s:s + '_^2',features)
for c1,c2 in combinations(engineered,2):
if 'Add' in engineer_type:
data['{0}+{1}'.format(c1,c2)] = data[c1] + data[c2]
if 'Multiply' in engineer_type:
data['{0}*{1}'.format(c1,c2)] = data[c1] * data[c2]
if 'Divide' in engineer_type:
data['{0}/{1}'.format(c1,c2)] = data[c1] / data[c2]
if 'Square' in engineer_type and len(sq) > 0:
data= pd.merge(data,sq,left_index=True,right_index=True)
return data
When I call it with lists of features, engineer_type and the dataset:
engineer_type = ['Square','Add','Multiply','Divide']
df = engineer_features(engineer_type,features,joined)
I get the error: Failed at object (analyzing bytecode)
'DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION'
Same question here. I think the problem might be the lambda function since numba does not support function creation.
I had this same error. Numba doesnt support pandas. I converted important columns from my pandas df into bunch of arrays and it worked successfully under #JIT.
Also arrays are much faster then pandas df, incase you need it for processing large data.

Datetime in python - speed of calculations - big data

I want to find the difference (in days) between two columns in a dataframe (more specifically in the graphlab SFrame datastructure).
I have tried to write a couple of functions to do this but I cannot seem to create a function that is fast enough. Speed is my issue right now as I have ~80 million rows to process.
I have tried two different functions but both are too slow:
The t2_colname_str and t1_colname_str arguments are the column-names of which I want to use, and both columns contain datetime.datetime objects.
For Loop
def diff_days(sframe_obj,t2_colname_str,t1_colname_str):
import graphlab as gl
import datetime as datetime
# creating the new column name to be used later
new_colname = str(t2_colname_str[:-9] + "_DiffDays_" + t1_colname_str[:-9])
diff_days_list = []
for i in range(len(sframe_obj[t2_colname_str])):
t2 = sframe_obj[t2_colname_str][i]
t1 = sframe_obj[t1_colname_str][i]
try:
diff = t2 - t1
diff_days = diff.days
diff_days_list.append(diff_days)
except TypeError:
diff_days_list.append(None)
sframe_obj[new_colname] = gl.SArray(diff_days_list)
List Comprehension
I know this is not the intended purpose of list comprehensions, but I just tried it to see if it was faster.
def diff_days(sframe_obj,t2_colname_str,t1_colname_str):
import graphlab as gl
import datetime as datetime
# creating the new column name to be used later
new_colname = str(t2_colname_str[:-9] + "_DiffDays_" + t1_colname_str[:-9])
diff_days_list = [(sframe_obj[t2_colname_str][i]-sframe_obj[t1_colname_str][i]).days if sframe_obj[t2_colname_str][i] and sframe_obj[t1_colname_str][i] != None else None for i in range(len(sframe_obj[t2_colname_str]))]
sframe_obj[new_colname] = gl.SArray(diff_days_list)
Additional Notes
I have been using GraphLab-Create by Dato and their SFrame data-structure mainly because it parallelizes all the computation which makes my analysis super-fast and it has a great library for machine learning applications. It's a great product if you haven't checked it out already.
GraphLab User Guide can be found here: https://dato.com/learn/userguide/index.html
I'm glad you found a workable way for you, however SArrays allow vector operations, so you don't need to loop through every element of the column. SArrays will iterate, but they're REALLY slow at that.
Unfortunately, SArrays don't support vector operations on datetime types because they don't support a "timedelta" type. You can do this though:
diff = sframe_obj[t2_colname].astype(int) - sframe_obj[t1_colname].astype(int)
That will convert the columns to a UNIX timestamp and then do a vectorized difference operation, which should be plenty fast...at least faster than a conversion to NumPy.

Theano get unique values in a tensor

I have a tensor which I convert into a vector by flattening, now I want to remove the duplicate values in this vector. How can I do this? What is equivalent for numpy.unique() in theano?
x1 = T.itensor3('x1')
y1 = T.flatten(x1)
#z1 = T.unique() How do I do this?
For e.g. my tensor may be : [1,1,2,3,3,4,4,5,1,3,4]
and I want : [1,2,3,4,5]
EDIT: this is now available in Theano: http://deeplearning.net/software/theano/library/tensor/extra_ops.html#theano.tensor.extra_ops.Unique
This question was also asked on theano-user mailing list. The conclusion is that this is one of the function NumPy function that isn't wrapped in Theano. As he don't need the grad, it can be rapidly wrapped. Here is an example who expect the outputs to be the same as the input.
from theano.compile.ops import as_op
#as_op(itypes=[theano.tensor.imatrix],
otypes=[theano.tensor.imatrix])
def numpy_unique(a):
return numpy.unique(a)
More doc about as_op is available here: http://deeplearning.net/software/theano/tutorial/extending_theano.html#as-op-example