Theano get unique values in a tensor - python-2.7

I have a tensor which I convert into a vector by flattening, now I want to remove the duplicate values in this vector. How can I do this? What is equivalent for numpy.unique() in theano?
x1 = T.itensor3('x1')
y1 = T.flatten(x1)
#z1 = T.unique() How do I do this?
For e.g. my tensor may be : [1,1,2,3,3,4,4,5,1,3,4]
and I want : [1,2,3,4,5]

EDIT: this is now available in Theano: http://deeplearning.net/software/theano/library/tensor/extra_ops.html#theano.tensor.extra_ops.Unique
This question was also asked on theano-user mailing list. The conclusion is that this is one of the function NumPy function that isn't wrapped in Theano. As he don't need the grad, it can be rapidly wrapped. Here is an example who expect the outputs to be the same as the input.
from theano.compile.ops import as_op
#as_op(itypes=[theano.tensor.imatrix],
otypes=[theano.tensor.imatrix])
def numpy_unique(a):
return numpy.unique(a)
More doc about as_op is available here: http://deeplearning.net/software/theano/tutorial/extending_theano.html#as-op-example

Related

SymPy integration of Matrix with multivariable entries

I am using Sympy to integrate a Sympy Matrix whose components depend on variables (x,y). Integrating with respect to a single variable x (or y) works, and returns the expected Matrix whose components are the integrals of the components of the original vector.
import sympy as sp
from sympy.abc import x,y
V = sp.Matrix(4,1,[1,x,y,x*y])
display(V)
# This works
I = sp.integrate(V,(x,0,1))
display(I)
Ultimately, I would like a double integral. I can accomplish this with the following
Ix = sp.integrate(V,(x,0,1))
I = sp.integrate(Ix,(y,0,1))
display(I)
My question is why the following does not seem to work.
I = sp.integrate(V,(x,0,1),(y,0,1))
The error I get is :
ValueError: Invalid limits given: (((x, 0, 1), (y, 0, 1)),)
Is this a bug? Or am I using the wrong syntax for the double integral with a Matrix type? This syntax works on components of the Matrix, i.e.
# This works
I3 = sp.integrate(V[3,0],(x,0,1),(y,0,1))
Thanks for confirming that this was a bug in SymPy. This is now fixed in SymPy. See https://github.com/sympy/sympy/pull/23277.
The other suggestion - using
I = V.integrate((x,0,1),(y,0,1))
may even be a nicer solution.

How to Use MCMC with a Custom Log-Probability and Solve for a Matrix

The code is in PyMC3, but this is a general problem. I want to find which matrix (combination of variables) gives me the highest probability. Taking the mean of the trace of each element is meaningless because they depend on each other.
Here is a simple case; the code uses a vector rather than a matrix for simplicity. The goal is to find a vector of length 2, where the each value is between 0 and 1, so that the sum is 1.
import numpy as np
import theano
import theano.tensor as tt
import pymc3 as mc
# define a theano Op for our likelihood function
class LogLike_Matrix(tt.Op):
itypes = [tt.dvector] # expects a vector of parameter values when called
otypes = [tt.dscalar] # outputs a single scalar value (the log likelihood)
def __init__(self, loglike):
self.likelihood = loglike # the log-p function
def perform(self, node, inputs, outputs):
# the method that is used when calling the Op
theta, = inputs # this will contain my variables
# call the log-likelihood function
logl = self.likelihood(theta)
outputs[0][0] = np.array(logl) # output the log-likelihood
def logLikelihood_Matrix(data):
"""
We want sum(data) = 1
"""
p = 1-np.abs(np.sum(data)-1)
return np.log(p)
logl_matrix = LogLike_Matrix(logLikelihood_Matrix)
# use PyMC3 to sampler from log-likelihood
with mc.Model():
"""
Data will be sampled randomly with uniform distribution
because the log-p doesn't work on it
"""
data_matrix = mc.Uniform('data_matrix', shape=(2), lower=0.0, upper=1.0)
# convert m and c to a tensor vector
theta = tt.as_tensor_variable(data_matrix)
# use a DensityDist (use a lamdba function to "call" the Op)
mc.DensityDist('likelihood_matrix', lambda v: logl_matrix(v), observed={'v': theta})
trace_matrix = mc.sample(5000, tune=100, discard_tuned_samples=True)
If you only want the highest likelihood parameter values, then you want the Maximum A Posteriori (MAP) estimate, which can be obtained using pymc3.find_MAP() (see starting.py for method details). If you expect a multimodal posterior, then you will likely need to run this repeatedly with different initializations and select the one that obtains the largest logp value, but that still only increases the chances of finding the global optimum, though cannot guarantee it.
It should be noted that at high parameter dimensions, the MAP estimate is usually not part of the typical set, i.e., it is not representative of typical parameter values that would lead to the observed data. Michael Betancourt discusses this in A Conceptual Introduction to Hamiltonian Monte Carlo. The fully Bayesian approach is to use posterior predictive distributions, which effectively averages over all the high-likelihood parameter configurations rather than using a single point estimate for parameters.

Fitting multiple data sets using lmfit without writting an objective function

This topic describes how to fit multiple data-sets using lmfit:
Python and lmfit: How to fit multiple datasets with shared parameters?
However it uses a fitting/objective function written by the user.
I was wondering if it's possible to fit multiple data-sets using lmfit without writing an objective function and using model.fit() method of the model class.
As an example: Lets say we have multiple data sets of (x,y) coordinates that we want to fit using the same model function in order to find the set of parameters that on average fit all the data best.
import numpy as np
from lmfit import Model, Parameters
from lmfit.models import GaussianModel
def gauss(x, amp, cen, sigma):
return amp*np.exp(-(x-cen)**2/(2.*sigma**2))
x1= np.arange(0.,100.,0.1)
x2= np.arange(0.,100.,0.09)
y1= gauss(x1, 1.,50.,5.)+ np.random.normal(size=len(x1), scale=0.1)
y2= gauss(x2, 0.8,48.4.,4.5)+ np.random.normal(size=len(x2), scale=0.1)
mod= GaussianModel()
params= mod.make_params()
mod.fit([y1,y2], params, x= [x1, x2])
I guess if this is possible the data has to be passed to mod.fit in the right type. The documentation only says that mod.fit takes an array-like data input.
I tried to give it lists and arrays. If I pass the different data sets as a list I get a ValueError: setting an array element with a sequence
If I pass an array I get an AttributeError: 'numpy.ndarray' has no atribute 'exp'
So am I just trying to do something that isn't possible or am I doing something wrong?
Well, I think the answer is "sort of". The lmfit.Model class is meant to represent a model for an array of data. So, if you can map your multiple datasets into a numpy ndarray (say, with np.concatenate), you can probably write a Model function to represent this by building sub-models for the different datasets and concatenating them in the same way.
I don't think you could do that with any of the built-in models. I also think that once you start down the road of writing complex model functions, it isn't a very big jump to writing objective functions. That is, what would be
def model_function(x, a, b, c):
### do some calculation with x, a, b, c values
result = a + x*b + x*x*c
return result
might become
def objective_function(params, x, data):
vals = params.valuesdict()
return data - model_function(x, vals['a'], vals['b'], vals['c'])
If that do_calc() is doing anything complex, the additional burden of unpacking the parameters and subtracting the data is pretty small. And, especially if some parameters would be used for multiple datasets and some only for particular datasets, you'll have to manage that in either the model function or the objective function. In the example you link to, my answer included a loop over datasets, picking out parameters by name for each dataset. You'll probably want to do something like that. You could probably do that in a model function by thinking of it as modeling the concatenated datasets, but I'm not sure you'd really gain a lot by doing that.
I found the problem. Actually model.fit() will handle arrays of multiple data sets just fine and perform a proper fit. The correct call of model.fit() with multiple data sets would be:
import numpy as np
from lmfit import Model, Parameters
from lmfit.models import GaussianModel
import matplotlib.pyplot as plt
def gauss(x, amp, cen, sigma):
"basic gaussian"
return amp*np.exp(-(x-cen)**2/(2.*sigma**2))
x1= np.arange(0.,100.,0.1)
x2= np.arange(0.,100.,0.1)
y1= gauss(x1, 1.,50.,5.)+ np.random.normal(size=len(x1), scale=0.01)
y2= gauss(x2, 0.8,48.4,4.5)+ np.random.normal(size=len(x2), scale=0.01)
mod= GaussianModel()
params= mod.make_params()
params['amplitude'].set(1.,min=0.01,max=100.)
params['center'].set(1.,min=0.01,max=100.)
params['sigma'].set(1.,min=0.01,max=100.)
result= mod.fit(np.array([y1,y2]), params,method='basinhopping',
x=np.array([x1,x2]))
print(result.fit_report(min_correl=0.5))
fig, ax = plt.subplots()
plt.plot(x1,y1, lw=2, color='red')
plt.plot(x2,y2, lw=2, color='orange')
plt.plot(x1,result.eval(x=x1), lw=2, color='black')
plt.show()
The problem in the original code actually lies in the fact that my data sets don't have the same length. However I'm not sure at all how to handle this in the most elegant way?

'DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION' in Numba

I haven't seen this specific scenario in my research for this error in Numba. This is my first time using the package so it might be something obvious.
I have a function that calculates engineered features in a data set by adding, multiplying and/or dividing each column in a dataframe called data and I wanted to test whether numba would speed it up
#jit
def engineer_features(engineer_type,features,joined):
#choose which features to engineer (must be > 1)
engineered = features
if len(engineered) > 1:
if 'Square' in engineer_type:
sq = data[features].apply(np.square)
sq.columns = map(lambda s:s + '_^2',features)
for c1,c2 in combinations(engineered,2):
if 'Add' in engineer_type:
data['{0}+{1}'.format(c1,c2)] = data[c1] + data[c2]
if 'Multiply' in engineer_type:
data['{0}*{1}'.format(c1,c2)] = data[c1] * data[c2]
if 'Divide' in engineer_type:
data['{0}/{1}'.format(c1,c2)] = data[c1] / data[c2]
if 'Square' in engineer_type and len(sq) > 0:
data= pd.merge(data,sq,left_index=True,right_index=True)
return data
When I call it with lists of features, engineer_type and the dataset:
engineer_type = ['Square','Add','Multiply','Divide']
df = engineer_features(engineer_type,features,joined)
I get the error: Failed at object (analyzing bytecode)
'DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION'
Same question here. I think the problem might be the lambda function since numba does not support function creation.
I had this same error. Numba doesnt support pandas. I converted important columns from my pandas df into bunch of arrays and it worked successfully under #JIT.
Also arrays are much faster then pandas df, incase you need it for processing large data.

Adding data to a Pandas dataframe

I have a dataframe that contains Physician_Profile_City, Physician_Profile_State and Physician_Profile_Zip_Code. I ultimately want to stratify an analysis based on state, but unfortunately not all of the Physician_Profile_States are filled in. I started looking around to try and figure out how to fill in the missing States. I came across the pyzipcode module which can take as an input a zip code and returns the state as follows:
In [39]: from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
zcdb = ZipCodeDatabase()
zipcode = zcdb[54115]
zipcode.state
Out[39]: u'WI'
What I'm struggling with is how I would iterate through the dataframe and add the appropriate "Physician_Profile_State" when that variable is missing. Any suggestions would be most appreciated.
No need to iterate if the form of the data is a dict then you should be able to perform the following:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].map(zcdb)
Otherwise you can call apply like so:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].apply(lambda x: zcdb[x].state)
In the case where the above won't work as it can't generate a Series to align with you df you can apply row-wise passing axis=1 to the df:
df['Physician_Profile_State'] = df[['Physician_Profile_Zip_Code']].apply(lambda x: zcdb[x].state, axis=1)
By using double square brackets we return a df allowing you to pass the axis param