tuple index error while doing regression fit - python-2.7

I'm writing a code to do linear single variate regression analysis of data using numpy. I know that fit() function in Python uses np.array() but the program is throwing me tuple index error and I'm at my wit's end now. Here is my code:
def linear_model_main(X_parameter, Y_parameter, prediction_value):
regression = linear_model.LinearRegression()
regression.fit(X_parameter, Y_parameter, prediction_value)
prediction_outcome = regression.predict(prediction_value)
predictions = {}
predictions['intercept'] = regression.intercept_
predictions['coefficient'] = regression.coef_
predictions['predicted_value'] = prediction_outcome
return predictions
X, Y = get_data(filename)
Xarr = np.array(X)
Yarr = np.array(Y)
predictionvalue = 70
result = linear_model_main(Xarr, Yarr, predictionvalue)
Xarr and Yarr are np.arrays of separate columns of data taken from a csv file and are basically the X and Y coordinate values in the regression. When printed, they look this:
[ 7. 73. 49. ..., 56. 56. 56.]
[ 5863. 5860. 5860. ..., 5860. 5860. 5860.]
It is a huge dataset (about 130,000 rows and 35 columns).

Related

How do I fit a pymc3 model when each person has multiple data points?

I'm trying to practice using pymc3 on the kinds of data I come across in my research, but I'm having trouble thinking through how to fit the model when each person gives me multiple data points, and each person comes from a different group (so trying a hierarchical model).
Here's the practice scenario I'm using: Suppose we have 2 groups of people, N = 30 in each group. All 60 people go through a 10 question survey, where each person can response ("1") or not respond ("0") to each question. So, for each person, I have an array of length 10 with 1's and 0's.
To model these data, I assume each person has some latent trait "theta", and each item has a "discrimination" a and a "difficulty" b (this is just a basic item response model), and the probability of responding ("1") is given by: (1 + exp(-a(theta - b)))^(-1). (Logistic applied to a(theta - b) .)
Here is how I tried to fit it using pymc3:
traces = {}
for grp in range(2):
group = prac_data["Group ID"] == grp
data = prac_data[group]["Response"]
with pm.Model() as irt:
# Priors
a_tmp = pm.Normal('a_tmp',mu=0, sd = 1, shape = 10)
a = pm.Deterministic('a', np.exp(a_tmp))
# We do this transformation since we must have a >= 0
b = pm.Normal('b', mu = 0, sd = 1, shape = 10)
# Now for the hyperpriors on the groups:
theta_mu = pm.Normal('theta_mu', mu = 0, sd = 1)
theta_sigma = pm.Uniform('theta_sigma', upper = 2, lower = 0)
theta = pm.Normal('theta', mu = theta_mu,
sd = theta_sigma, shape = N)
p = getProbs(Disc, Diff, theta, N)
y = pm.Bernoulli('y', p = p, observed = data)
traces[grp] = pm.sample(1000)
The function "getProbs" is supposed to give me an array of probabilities for the Bernoulli random variable, as the probability of responding 1 changes across trials/survey questions for each person. But this method gives me an error because it says to "specify one of p or logit_p", but I thought I did with the function?
Here's the code for "getProbs" in case it's helpful:
def getProbs(Disc, Diff, THETA, Nprt):
# Get a large array of probabilities for the bernoulli random variable
n = len(Disc)
m = Nprt
probs = np.array([])
for th in range(m):
for t in range(n):
p = item(Disc[t], Diff[t], THETA[th])
probs = np.append(probs, p)
return probs
I added the Nprt parameter because if I tried to get the length of THETA, it would give me an error since it is a FreeRV object. I know I can try and vectorize the "item" function, which is just the logistic function I put above, instead of doing it this way, but that also got me an error when I tried to run it.
I think I can do something with pm.Data to fix this, but the documentation isn't exactly clear to me.
Basically, I'm used to building models in JAGS, where you loop through each data point, but pymc3 doesn't seem to work like that. I'm confused about how to build/index my random variables in the model to make sure that the probabilities change how I'd like them to from trial-to-trial, and to make sure that the parameters I'm estimating correspond to the right person in the right group.
Thanks in advance for any help. I'm pretty new to pymc3 and trying to get the hang of it, and wanted to try something different from JAGS.
EDIT: I was able to solve this by first building the array I needed by looping through the trials, then transforming the array using:
p = theano.tensor.stack(p, axis = 0)
I then put this new variable in the "p" argument of the Bernoulli instance and it worked! Here's the updated full model: (below, I imported theano.tensor as T)
group = group.astype('int')
data = prac_data["Response"]
with pm.Model() as irt:
# Priors
# Item parameters:
a = pm.Gamma('a', alpha = 1, beta = 1, shape = 10) # Discrimination
b = pm.Normal('b', mu = 0, sd = 1, shape = 10) # Difficulty
# Now for the hyperpriors on the groups: shape = 2 as there are 2 groups
theta_mu = pm.Normal('theta_mu', mu = 0, sd = 1, shape = 2)
theta_sigma = pm.Uniform('theta_sigma', upper = 2, lower = 0, shape = 2)
# Individual-level person parameters:
# group is a 2*N array that lets the model know which
# theta_mu to use for each theta to estimate
theta = pm.Normal('theta', mu = theta_mu[group],
sd = theta_sigma[group], shape = 2*N)
# Here, we're building an array of the probabilities we need for
# each trial:
p = np.array([])
for n in range(2*N):
for t in range(10):
x = -a[t]*(theta[n] - b[t])
p = np.append(p, x)
# Here, we turn p into a tensor object to put as an argument to the
# Bernoulli random variable
p = T.stack(p, axis = 0)
y = pm.Bernoulli('y', logit_p = p, observed = data)
# On my computer, this took about 5 minutes to run.
traces = pm.sample(1000, cores = 1)
print(az.summary(traces)) # Summary of parameter distributions

Joining of curve fitting models

I have this 7 quasi-lorentzian curves which are fitted to my data.
and I would like to join them, to make one connected curved line. Do You have any ideas how to do this? I've read about ComposingModel at lmfit documentation, but it's not clear how to do this.
Here is a sample of my code of two fitted curves.
for dataset in [Bxfft]:
dataset = np.asarray(dataset)
freqs, psd = signal.welch(dataset, fs=266336/300, window='hamming', nperseg=16192, scaling='spectrum')
plt.semilogy(freqs[0:-7000], psd[0:-7000]/dataset.size**0, color='r', label='Bx')
x = freqs[100:-7900]
y = psd[100:-7900]
# 8 Hz
model = Model(lorentzian)
params = model.make_params(amp=6, cen=5, sig=1, e=0)
result = model.fit(y, params, x=x)
final_fit = result.best_fit
print "8 Hz mode"
print(result.fit_report(min_correl=0.25))
plt.plot(x, final_fit, 'k-', linewidth=2)
# 14 Hz
x2 = freqs[220:-7780]
y2 = psd[220:-7780]
model2 = Model(lorentzian)
pars2 = model2.make_params(amp=6, cen=10, sig=3, e=0)
pars2['amp'].value = 6
result2 = model2.fit(y2, pars2, x=x2)
final_fit2 = result2.best_fit
print "14 Hz mode"
print(result2.fit_report(min_correl=0.25))
plt.plot(x2, final_fit2, 'k-', linewidth=2)
UPDATE!!!
I've used some hints from user #MNewville, who posted an answer and using his code I got this:
So my code is similar to his, but extended with each peak. What I'm struggling now is replacing ready LorentzModel with my own.
The problem is when I do this, the code gives me an error like this.
C:\Python27\lib\site-packages\lmfit\printfuncs.py:153: RuntimeWarning:
invalid value encountered in double_scalars [[Model]] spercent =
'({0:.2%})'.format(abs(par.stderr/par.value))
About my own model:
def lorentzian(x, amp, cen, sig, e):
return (amp*(1-e)) / ((pow((1.0 * x - cen), 2)) + (pow(sig, 2)))
peak1 = Model(lorentzian, prefix='p1_')
peak2 = Model(lorentzian, prefix='p2_')
peak3 = Model(lorentzian, prefix='p3_')
# make composite by adding (or multiplying, etc) components
model = peak1 + peak2 + peak3
# make parameters for the full model, setting initial values
# using the prefixes
params = model.make_params(p1_amp=6, p1_cen=8, p1_sig=1, p1_e=0,
p2_ampe=16, p2_cen=14, p2_sig=3, p2_e=0,
p3_amp=16, p3_cen=21, p3_sig=3, p3_e=0,)
rest of the code is similar like at #MNewville
[![enter image description here][3]][3]
A composite model for 3 Lorentzians would look like this:
from lmfit import Model, LorentzianModel
peak1 = LorentzianModel(prefix='p1_')
peak2 = LorentzianModel(prefix='p2_')
peak3 = LorentzianModel(prefix='p3_')
# make composite by adding (or multiplying, etc) components
model = peak1 + peaks2 + peak3
# make parameters for the full model, setting initial values
# using the prefixes
params = model.make_params(p1_amplitude=10, p1_center=8, p1_sigma=3,
p2_amplitude=10, p2_center=15, p2_sigma=3,
p3_amplitude=10, p3_center=20, p3_sigma=3)
# perhaps set bounds to prevent peaks from swapping or crazy values
params['p1_amplitude'].min = 0
params['p2_amplitude'].min = 0
params['p3_amplitude'].min = 0
params['p1_sigma'].min = 0
params['p2_sigma'].min = 0
params['p3_sigma'].min = 0
params['p1_center'].min = 2
params['p1_center'].max = 11
params['p2_center'].min = 10
params['p2_center'].max = 18
params['p3_center'].min = 17
params['p3_center'].max = 25
# then do a fit over the full data range
result = model.fit(y, params, x=x)
I think the key parts you were missing were: a) just add models together, and b) use prefix to avoid name collisions of parameters.
I hope that is enough to get you started...

How can I implement a joint hyerprior?

I'm trying to recreate results from Bayesian Data Analysis Third Edition.
Chapter 5 Section 3 concerns tumors in rats. a Hierarchical model is fit and the hyperprior used is not one of the densities included in pymc3.
The hyperprior is a*b*(a+b)^-2.5. Here is my attempt using pymc3.
import pymc3 as pm
with pm.Model() as model:
def ab_dist(x):
#Should be log density, from what I have read
a = x[0]
b = x[1]
return a+b-5/2*(a+b)
ab = pm.DensityDist('ab', ab_dist, shape = 2, testval=[2,2])
a = ab[0]
b = ab[1]
theta = pm.Beta('theta',alpha = a,beta = b)
Y= pm.Binomial('y', n = n, p = theta, observed = y)
At this stage, I am returned an error
ValueError: Input dimension mis-match. (input[0].shape[0] = 71, input[1].shape[0] = 20000)
What have I done wrong? Have I correctly implemented the density?

Splitting dataset

I am new to opencv and python.
I am trying to create sudoku solver in opencv and want to use this image as my dataset for recognizing the digits in sudoku.
I want the entire image to be used as dataset.
sudoku digits dataset
Dimensions: 468x108
This image has 39 numbers in a row and 9 such rows [ 1 .. 9 ]
image = cv2.imread( 'images/digits_sudoku4.png')
image = cv2.resize( image, None, fx=2, fy=2,)
gray = cv2.cvtColor( image, cv2.COLOR_BGR2GRAY)
cells = [np.hsplit(row, 39) for row in np.vsplit(gray, 12)]
# Convert the List data type to Numpy Array
x = np.array(cells)
print ("The shape of our cells array: " + str(x.shape))
train = x.astype(np.float32) # Size = (3500,400)
# Create labels for train and test data
k = [ 1, 2, 3, 4, 5, 6, 7, 8, 9]
train_labels = np.repeat( k, 468)[:, np.newaxis]
# Initiate kNN, train the data, then test it with test data for k=3
knn = cv2.KNearest()
knn.train(train, train_labels)
#ret, result, neighbors, distance = knn.find_nearest(test, k=3)
# Now we check the accuracy of classification
# For that, compare the result with test_labels and check which are wrong
'''
matches = result == test_labels
correct = np.count_nonzero(matches)
accuracy = correct * (100.0 / result.size)
print("Accuracy is = %.2f" % accuracy + "%")
'''
cv2.imshow( 'Sudoku', gray)
cv2.waitKey(0)
cv2.destroyAllWindows()
I am facing this error on line knn.train(train, train_labels)...
cv2.error: /build/opencv-SviWsf/opencv-2.4.9.1+dfsg/modules/ml/src/inner_functions.cpp:857: error: (-5) train data must be floating-point matrix in function cvCheckTrainData
Please help me out.
Thankyou.

hight resolution Reanalysis data

When I extract data from a netCDF file Reanalysis (variable pressure (SLP), 01/01/2014) the data is very high resolution (9km grid) which makes the resulting image quite noisy. I would like to put the data into a lower resolution grid (e.g. 1 degree). I'm trying to use the functions meshgrid and gridata, but inexperience am unable to make it work. Does anyone know how to solve? Thank you.
from netCDF4 import Dataset
import numpy as np
from scipy.interpolate import griddata
file = Dataset('slp_2014_01_01.nc', 'r')
# Printing variables
print ' '
print ' '
print '----------------------------------------------------------'
for i,variable in enumerate(file.variables):
print ' '+str(i),variable
if i == 2:
current_variable = variable
print ' '
print 'Variable: ', current_variable.upper()
print 'File name: ', file_name
lat = file.variables['lat'][:]
lon = file.variables['lon'][:]
slp = file.variables['slp'][:]
lon_i = np.linspace(lon[0], lon[len(REANALYSIS_lon)-1], num=len(lon)*2, endpoint=True, retstep=False)
lat_i = np.linspace(lat[0], lat[len(lat)-1], num=len(lat)*2, endpoint=True, retstep=False)
lon_grid, lat_grid = np.meshgrid(lon_i,lat_i)
temp_slp = np.asarray(slp).squeeze()
new_slp = temp_slp.reshape(temp_slp.size)
slp_grid = griddata((lon, lat), new_slp, (lon_grid, lat_grid),method='cubic')
As I mentioned, I tried to use the meshgrid and datagrid functions, but produced the following error:
Traceback (most recent call last):
File "REANALYSIS_LOCAL.py", line 346, in
lon,lat,time,var,variavel_atual=netCDF_builder_local(caminho_netcdf_local,nome_arquivo,dt)
File "REANALYSIS_LOCAL.py", line 143, in netCDF_builder_local
slp_grid = griddata((lon, lat), new_slp, (lon_grid, lat_grid),method='cubic')
File "/home/carlos/anaconda/lib/python2.7/site-packages/scipy/interpolate/ndgriddata.py", line 182, in griddata
points = _ndim_coords_from_arrays(points)
File "interpnd.pyx", line 176, in scipy.interpolate.interpnd._ndim_coords_from_arrays (scipy/interpolate/interpnd.c:4064)
File "/home/carlos/anaconda/lib/python2.7/site-packages/numpy/lib/stride_tricks.py", line 101, in broadcast_arrays
"incompatible dimensions on axis %r." % (axis,))
ValueError: shape mismatch: two or more arrays have incompatible dimensions on axis 0.
The dimensions of variables are:
lon: (144,)
lat: (73,)
lon_i: (288,)
lat_i: (146,)
lon_grid: (146, 288)
lat_grid: (146, 288)
new_slp: (10512,)
The values in new_slp are:
new_slp: [ 102485. 102485. 102485. ..., 100710. 100710. 100710.]
The purpose is increase the values in the variables (lon, lat and slp), because the Reanalysis resolution is highter. Then, the resolution could be most detailed (more points).
For example: the variable lat have the points:
Original dimension variable lat: (73,)
lat: [ 90. 87.5 85. 82.5 80. 77.5 75. 72.5 70. 67.5 65. 62.5
60. 57.5 55. 52.5 50. 47.5 45. 42.5 40. 37.5 35. 32.5
30. 27.5 25. 22.5 20. 17.5 15. 12.5 10. 7.5 5. 2.5
0. -2.5 -5. -7.5 -10. -12.5 -15. -17.5 -20. -22.5 -25. -27.5
-30. -32.5 -35. -37.5 -40. -42.5 -45. -47.5 -50. -52.5 -55. -57.5
-60. -62.5 -65. -67.5 -70. -72.5 -75. -77.5 -80. -82.5 -85. -87.5
-90. ]
When I define the code line: lat_i = np.linspace(lat[0], lat[len(lat)-1], num=len(lat)*2, endpoint=True, retstep=False) I doubled the values of the lat variable la_i(146,)
lat _i: [ 90. 88.75862069 87.51724138 86.27586207 85.03448276 83.79310345 82.55172414 81.31034483 80.06896552 78.82758621 77.5862069
...
-78.82758621 -80.06896552 -81.31034483 -82.55172414 -83.79310345 -85.03448276 -86.27586207 -87.51724138 -88.75862069 -90. ]
The idea that I need is the same is available in this code, where x is lon, y is lat and slp is z.
from scipy.interpolate import griddata
import numpy as np
import matplotlib.pyplot as plt
x=np.linspace(1.,10.,20)
y=np.linspace(1.,10.,20)
z=z = np.random.random(20)
xi=np.linspace(1.,10.,40)
yi=np.linspace(1.,10.,40)
X,Y= np.meshgrid(xi,yi)
Z = griddata((x, y), z, (X, Y),method='nearest')
plt.contourf(X,Y,Z)
Depending on Your final purpose, You may use cdo to regrid the whole file
cdo remapbil,r360x180 infile outfile
or just plot every second or third value from original file like this:
plt.pcolormesh(lon[::2,::2],lat[::2,::2],var1[::2,::2])
The error message You show just says that dimensions do not much, just print the shape of your variables before the error appears and try to get it working.
Why Your code does not work?
Your chosen method requires input coordinates as lon,lat pairs for data points, not mesh coordinates. If You have data points with shape 10000, your coordinates must be with the shape (10000,2), not (100,100).
But as griddata is meant for unstructured data, it will not be efficient for Your purpose, I suggest using something like scipy.interpolate.RegularGridInterpolator
But anyway, if You need to use the interpolated data more than once, I suggest creating new netCDF files with cdo and process them, instead of interpolating data each time You run Your script.
Thanks for your help. Really, my problem was about dimensions. I'm learning to work with oceanographic data. So, I solved the problem with this code.
lonbounds = [25,59]
latbounds = [-10,-33]
#longitude lower and upper index
lonli = np.argmin(np.abs(lon - lonbounds[0]))
lonui = np.argmin(np.abs(lon - lonbounds[1]))
#latitude lower and upper index
latli = np.argmin(np.abs(lat - latbounds[0]))
latui = np.argmin(np.abs(lat - latbounds[1]))
#limiting of the interest region/data
lon_f = file.variables['lon'][lonli:lonui]
lat_f = file.variables['lat'][latli:latui]
slp_f = file.variables['slp'][0,latli:latui,lonli:lonui]
#creating a matrix with the filtered data (area to be searched) for use in gridData function of python
lon_f_grid, lat_f_grid = np.meshgrid(lon_f,lat_f)
#adjusting the data (size 1) for use in gridData function of python
lon_f1 = lon_f_grid.reshape(lon_f_grid.size)
lat_f1 = lat_f_grid.reshape(lat_f_grid.size)
slp_f1 = slp_f.reshape(slp_f.size)
#increasing the resolution of data (1000 points) of longitude and latitude for the data to be more refined
lon_r = np.linspace(lon_f[0], lon_f[len(lon_f)-1], num=1000, endpoint=True, retstep=False)
lat_r = np.linspace(lat_f[0], lat_f[len(lat_f)-1], num=1000, endpoint=True, retstep=False)
#creating a matrix with the filtered data (area to be searched) and higher resolution for use in gridData function of python
lon_r_grid, lat_r_grid = np.meshgrid(lon_r,lat_r)
#applying gridata that can be generated since pressure (SLP) with higher resolution.
slp_r = griddata((lon_f1,lat_f1),slp_f1,(lon_r_grid,lat_r_grid),method='cubic')
Hugs,
Carlos.