Why am I getting a dimension mismatch in my PyMC3 hierarchical model? - pymc3

This is essentially the "Multiple Coins from Multiple Mints / Baseball Players" example from Doing Bayesian Data Analysis, Second Edition (DBDA2). I believe I have PyMC3 code which is functionally equivalent, but one works and the other does not. This is with PyMC version 3.5. In more detail,
Let's say I have the following data. Each row is an observation:
observations_dict = {
'mint': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'coin': [0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 7],
'outcome': [1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1]
}
observations = pd.DataFrame(observations_dict)
observations
One Mint, Several Coins
The below, which implements DBDA2 Figure 9.7, runs just fine:
num_coins = observations['coin'].nunique()
coin_idx = observations['coin']
with pm.Model() as hierarchical_model:
# mint is characterized by omega and kappa
omega = pm.Beta('omega', 1., 1.)
kappa_minus2 = pm.Gamma('kappa_minus2', 0.01, 0.01)
kappa = pm.Deterministic('kappa', kappa_minus2 + 2)
# each coin is described by a theta
theta = pm.Beta('theta', alpha=omega*(kappa-2)+1, beta=(1-omega)*(kappa-2)+1, shape=num_coins)
# define the likelihood
y = pm.Bernoulli('y', theta[coin_idx], observed=observations['outcome'])
Many Mints, Many Coins
However, once this is turned into a hierarchical model (as seen in DBDA2 Figure 9.13):
num_mints = observations['mint'].nunique()
mint_idx = observations['mint']
num_coins = observations['coin'].nunique()
coin_idx = observations['coin']
with pm.Model() as hierarchical_model2:
# Hyper parameters
omega = pm.Beta('omega', 1, 1)
kappa_minus2 = pm.Gamma('kappa_minus2', 0.01, 0.01)
kappa = pm.Deterministic('kappa', kappa_minus2 + 2)
# Parameters for mints
omega_c = pm.Beta('omega_c',
omega*(kappa-2)+1, (1-omega)*(kappa-2)+1,
shape = num_mints)
kappa_c_minus2 = pm.Gamma('kappa_c_minus2',
0.01, 0.01,
shape = num_mints)
kappa_c = pm.Deterministic('kappa_c', kappa_c_minus2 + 2)
# Parameters for coins
theta = pm.Beta('theta',
omega_c[mint_idx]*(kappa_c[mint_idx]-2)+1,
(1-omega_c[mint_idx])*(kappa_c[mint_idx]-2)+1,
shape = num_coins)
y2 = pm.Bernoulli('y2', p=theta[coin_idx], observed=observations['outcome'])
The error is:
ValueError: operands could not be broadcast together with shapes (8,) (20,)
as the model has 8 thetas for 8 coins but sees 20 rows of data.
However, if the data is grouped such that each line represents the final statistics of an individual coin, as with the following
grouped = observations.groupby(['mint', 'coin']).agg({'outcome': [np.sum, np.size]}).reset_index()
grouped.columns = ['mint', 'coin', 'heads', 'total']
And the final likelihood variable is changed to a Binomial, as follows
num_mints = grouped['mint'].nunique()
mint_idx = grouped['mint']
num_coins = grouped['coin'].nunique()
coin_idx = grouped['coin']
with pm.Model() as hierarchical_model2:
# Hyper parameters
omega = pm.Beta('omega', 1, 1)
kappa_minus2 = pm.Gamma('kappa_minus2', 0.01, 0.01)
kappa = pm.Deterministic('kappa', kappa_minus2 + 2)
# Parameters for mints
omega_c = pm.Beta('omega_c',
omega*(kappa-2)+1, (1-omega)*(kappa-2)+1,
shape = num_mints)
kappa_c_minus2 = pm.Gamma('kappa_c_minus2',
0.01, 0.01,
shape = num_mints)
kappa_c = pm.Deterministic('kappa_c', kappa_c_minus2 + 2)
# Parameter for coins
theta = pm.Beta('theta',
omega_c[mint_idx]*(kappa_c[mint_idx]-2)+1,
(1-omega_c[mint_idx])*(kappa_c[mint_idx]-2)+1,
shape = num_coins)
y2 = pm.Binomial('y2', n=grouped['total'], p=theta, observed=grouped['heads'])
Everything works. Now, the latter form is more efficient and generally preferred, but I believe the former should work as well. So I believe this is primarily a PyMC3 issue (or even more likely, a user error).
To quote DBDA Edition 1,
"The BUGS model uses a binomial likelihood distribution for total
correct, instead of using the Bernoulli distribution for individual
trials. This use of the binomial is just a convenience for shortening
the program. If the data were specified as trial-by-trial outcomes
instead of as total correct, then the model could include a
trial-by-trial loop and use a Bernoulli likelihood function"
What bothers me is that in the very first example (One Mint, Several Coins), it looks like PyMC3 can handle individual observations instead of aggregated observations just fine. So I believe the first form should work, but doesn't.
Code
http://nbviewer.jupyter.org/github/JWarmenhoven/DBDA-python/blob/master/Notebooks/Chapter%209.ipynb
References
PyMC3 - Differences in ways observations are passed to model -> difference in results?
https://discourse.pymc.io/t/pymc3-differences-in-ways-observations-are-passed-to-model-difference-in-results/501
http://www.databozo.com/deep-in-the-weeds-complex-hierarchical-models-in-pymc3
https://stats.stackexchange.com/questions/157521/is-this-correct-hierarchical-bernoulli-model

The length of mint_idx was 20 (one for each observation), but it should have been 8 (one for each coin).
Working answer, notice the mint_idx recalculation (rest remains the same):
grouped = observations.groupby(['mint', 'coin']).agg({'outcome': [np.sum, np.size]}).reset_index()
grouped.columns = ['mint', 'coin', 'heads', 'total']
num_mints = grouped['mint'].nunique()
mint_idx = grouped['mint']
num_coins = observations['coin'].nunique()
coin_idx = observations['coin']
with pm.Model() as hierarchical_model2:
# Hyper parameters
omega = pm.Beta('omega', 1, 1)
kappa_minus2 = pm.Gamma('kappa_minus2', 0.01, 0.01)
kappa = pm.Deterministic('kappa', kappa_minus2 + 2)
# Parameters for mints
omega_c = pm.Beta('omega_c',
omega*(kappa-2)+1, (1-omega)*(kappa-2)+1,
shape = num_mints)
kappa_c_minus2 = pm.Gamma('kappa_c_minus2',
0.01, 0.01,
shape = num_mints)
kappa_c = pm.Deterministic('kappa_c', kappa_c_minus2 + 2)
# Parameters for coins
theta = pm.Beta('theta',
omega_c[mint_idx]*(kappa_c[mint_idx]-2)+1,
(1-omega_c[mint_idx])*(kappa_c[mint_idx]-2)+1,
shape = num_coins)
y2 = pm.Bernoulli('y2', p=theta[coin_idx], observed=observations['outcome'])
Many thanks to #junpenglao!!
https://discourse.pymc.io/t/why-cant-i-use-a-bernoulli-as-a-likelihood-variable-in-a-hierarchical-model-in-pymc3/2022/2

Related

Extract Optimal Features from Recursive Feature Elimination (RFE)

I have a dataset consisting of categorical and numerical data with 124 features. In order to reduce its dimensionality I want to remove irrelevant features. However, to run the dataset against a feature selection algorithm I one hot encoded it with get_dummies, which increased the number of features to 391.
In[16]:
X_train.columns
Out[16]:
Index([u'port_7', u'port_9', u'port_13', u'port_17', u'port_19', u'port_21',
...
u'os_cpes.1_2', u'os_cpes.1_1'], dtype='object', length=391)
With the resulting data I can run recursive feature elimination with cross validation, as per the Scikit Learn example:
Which produces:
Cross Validated Score vs Features Graph
Given that the optimal number of features identified was 8, how do I identify the feature names? I am assuming that I can extract them into a new DataFrame for use in a classification algorithm?
[EDIT]
I have achieved this as follows, with help from this post:
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols, query_cols, sorter = sidx)]
feature_index = []
features = []
column_index(X_dev_train, X_dev_train.columns.values)
for num, i in enumerate(rfecv.get_support(), start=0):
if i == True:
feature_index.append(str(num))
for num, i in enumerate(X_dev_train.columns.values, start=0):
if str(num) in feature_index:
features.append(X_dev_train.columns.values[num])
print("Features Selected: {}\n".format(len(feature_index)))
print("Features Indexes: \n{}\n".format(feature_index))
print("Feature Names: \n{}".format(features))
which produces:
Features Selected: 8
Features Indexes:
['5', '6', '20', '26', '27', '28', '67', '98']
Feature Names:
['port_21', 'port_22', 'port_199', 'port_512', 'port_513', 'port_514', 'port_3306', 'port_32768']
Given that one hot encoding introduces multicollinearity, I don't think the target column selection is ideal because the features it has chosen are non-encoded continual data features. I have tried re-adding the target column unencoded but RFE throws the following error because the data is categorical:
ValueError: could not convert string to float: Wireless Access Point
Do I need to group multiple one hot encoded feature columns to act as the target?
[EDIT 2]
If I simply LabelEncode the target column, I can use this target as 'y' see example again. However, the output determines only a single feature (the target column) as optimal. I think this might be because of the one hot encoding, should I be looking at producing a dense array and if so, can it be run against RFE?
Thanks,
Adam
You can do this:
`
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, 5)
rfe = rfe.fit(X, y)
print(rfe.support_)
print(rfe.ranking_)
f = rfe.get_support(1) #the most important features
X = df[df.columns[f]] # final features`
Then you can use X as input in your neural network or any algorithm
Answering my own question, I figured out the issue was related to the way I had one-hot encoded the data. Initially, I ran one hot encoding against all categorical columns as follows:
ohe_df = pd.get_dummies(df[df.columns]) # One-hot encode all columns
This introduced a large number of additional features. Taking a different approach, with some help from here, I have modified the encoding to encode multiple columns on a per-column/feature basis as follows:
cf_df = df.select_dtypes(include=[object]) # Get categorical features
nf_df = df.select_dtypes(exclude=[object]) # Get numerical features
ohe_df = nf_df.copy()
for feature in cf_df:
ohe_df[feature] = ohe_df.loc[:,(feature)].str.get_dummies().values.tolist()
Producing:
ohe_df.head(2) # Only showing a subset of the data
+---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
| | os_name | os_family | os_type | os_vendor | os_cpes.0 |
+---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
| 0 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [0, 1, 0, 0, 0] | [1, 0, 0, 0, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ... |
| 1 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [0, 0, 0, 1, 0] | [0, 0, 0, 1, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
+---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
Unfortunately, although this was what I was searching for, it didn't execute against RFECV. Next I thought perhaps I could take a slice of all the new features and pass them in as the target, but this resulted in an error. Finally, I realised I would have to iterate through all target values and take the top outputs from each. The code ended up looking something like this:
for num, feature in enumerate(features, start=0):
X = X_dev_train
y = X_dev_train[feature]
# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")
# The "accuracy" scoring is proportional to the number of correct classifications
# step is the number of features to remove at each iteration
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(kfold), scoring='accuracy')
try:
rfecv.fit(X, y)
print("Number of observations in each fold: {}".format(len(X)/kfold))
print("Optimal number of features : {}".format(rfecv.n_features_))
g_scores = rfecv.grid_scores_
indices = np.argsort(g_scores)[::-1]
print('Printing RFECV results:')
for num2, f in enumerate(range(X.shape[1]), start=0):
if g_scores[indices[f]] > 0.80:
if num2 < 10:
print("{}. Number of features: {} Grid_Score: {:0.3f}".format(f + 1, indices[f]+1, g_scores[indices[f]]))
print "\nTop features sorted by rank:"
results = sorted(zip(map(lambda x: round(x, 4), rfecv.ranking_), X.columns.values))
for num3, i in enumerate(results, start=0):
if num3 < 10:
print i
# Plot number of features VS. cross-validation scores
plt.rc("figure", figsize=(8, 5))
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("CV score (of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
except ValueError:
pass
I'm sure this could be cleaner, may be even plotted in one graph, but it works for me.
Cheers,

CSR Matrix - Matrix multiplication

I have two square matrices A and B
I must convert B to CSR Format and determine the product C
A * B_csr = C
I have found a lot of information online regarding CSR Matrix - Vector multiplication. The algorithm is:
for (k = 0; k < N; k = k + 1)
result[i] = 0;
for (i = 0; i < N; i = i + 1)
{
for (k = RowPtr[i]; k < RowPtr[i+1]; k = k + 1)
{
result[i] = result[i] + Val[k]*d[Col[k]];
}
}
However, I require Matrix - Matrix multiplication.
Further, it seems that most algorithms apply A_csr - vector multiplication where I require A * B_csr. My solution is to transpose the two matrices before converting then transpose the final product.
Can someone explain how to compute a Matrix - CSR Matrix product and/or a CSR Matrix - Matrix product?
Here is a simple solution in Python for the Dense Matrix X CSR Matrix. It should be self-explanatory.
def main():
# 4 x 4 csr matrix
# [1, 0, 0, 0],
# [2, 0, 3, 0],
# [0, 0, 0, 0],
# [0, 4, 0, 0],
csr_values = [1, 2, 3, 4]
col_idx = [0, 0, 2, 1]
row_ptr = [0, 1, 3, 3, 4]
csr_matrix = [
csr_values,
col_idx,
row_ptr
]
dense_matrix = [
[1, 3, 3, 4],
[1, 2, 3, 4],
[1, 4, 3, 4],
[1, 2, 3, 5],
]
res = [
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
]
# matrix order, assumes both matrices are square
n = len(dense_matrix)
# res = dense X csr
csr_row = 0 # Current row in CSR matrix
for i in range(n):
start, end = row_ptr[i], row_ptr[i + 1]
for j in range(start, end):
col, csr_value = col_idx[j], csr_values[j]
for k in range(n):
dense_value = dense_matrix[k][csr_row]
res[k][col] += csr_value * dense_value
csr_row += 1
print res
if __name__ == '__main__':
main()
CSR Matrix X Dense Matrix is really just a sequence of CSR Matrix X Vector product for each row of the dense matrix right? So it should be really easy to extend the code you show above to do this.
Moving forward, I suggest you don't code these routines yourself. If you are using C++ (based on the tag), then you could have a look at Boost ublas for example, or Eigen. The APIs may seem a bit cryptic at first but it's really worth it in the long term. First, you gain access to a lot more functionality, which you will probably require in the future. Second these implementations will be better optimised.

Find minimum N elements in theano

I've got a theano function which computes euclidean distances for 2 matrices—X (n vectors x k features) and Y (m vectors x k features). The result is an n x m matrix of pairwise distances of each vector (or row) in X from each vector (or row) in Y.
import theano
from theano import tensor as T
X, Y = T.dmatrices('X', 'Y')
X_squared_sum = T.sum(X ** 2, axis=1, keepdims=True)
Y_squared_sum = T.sum(Y.T ** 2, axis=0, keepdims=True)
squared_distances = X_squared_sum + Y_squared_sum - 2 * T.dot(X, Y.T)
f_distance = theano.function([X, Y], T.sqrt(squared_distances))
Let's say I change the above function to accept a single vector, an array of vectors, and the number of smallest distances. What I want is a theano function that will find the N smallest distances, similar to below:
import numpy as np
import theano
from theano import tensor as T
X = T.dvector('X')
Y = T.dmatrix('Y')
N = T.iscalar('N')
X_squared_sum = T.dot(X, X)
Y_squared_sum = T.sum(Y.T ** 2, axis=0)
squared_distances = X_squared_sum + Y_squared_sum - 2 * T.dot(X, Y.T)
dist_sorted = T.FIND_N_SMALLEST(T.sqrt(squared_distances), N)
n_closest = theano.function([X, Y, N], dist_sorted)
U = np.array([[1, 1, 1, 1]])
V = np.array([
[ 4, 4, 4, 4],
[ 2, 2, 2, 2],
[ 3, 3, 3, 3],
[ 1, 1, 1, 1]])
n_closest(U, V, 2) # [0.0, 2.0]
I'd like to avoid explicitly sorting all the distances, since the number that I want will generally be much much smaller than the total number of distances.

Normalizing data and applying colormap results in rotated image using matplotlib?

So I wanted to see if I could make fractal flames using matplotlib and figured a good test would be the sierpinski triangle. I modified a working version I had that simply performed the chaos game by normalizing the x range from -2, 2 to 0, 400 and the y range from 0, 2 to 0, 200. I also truncated the x and y coordinates to 2 decimal places and multiplied by 100 so that the coordinates could be put in to a matrix that I could apply a color map to. Here's the code I'm working on right now (please forgive the messiness):
import numpy as np
import matplotlib.pyplot as plt
import math
import random
def f(x, y, n):
N = np.array([[x, y]])
M = np.array([[1/2.0, 0], [0, 1/2.0]])
b = np.array([[.5], [0]])
b2 = np.array([[0], [.5]])
if n == 0:
return np.dot(M, N.T)
elif n == 1:
return np.dot(M, N.T) + 2*b
elif n == 2:
return np.dot(M, N.T) + 2*b2
elif n == 3:
return np.dot(M, N.T) - 2*b
def norm_x(n, minX_1, maxX_1, minX_2, maxX_2):
rng = maxX_1 - minX_1
n = (n - minX_1) / rng
rng_2 = maxX_2 - minX_2
n = (n * rng_2) + minX_2
return n
def norm_y(n, minY_1, maxY_1, minY_2, maxY_2):
rng = maxY_1 - minY_1
n = (n - minY_1) / rng
rng_2 = maxY_2 - minY_2
n = (n * rng_2) + minY_2
return n
# Plot ranges
x_min, x_max = -2.0, 2.0
y_min, y_max = 0, 2.0
# Even intervals for points to compute orbits of
x_range = np.arange(x_min, x_max, (x_max - x_min) / 400.0)
y_range = np.arange(y_min, y_max, (y_max - y_min) / 200.0)
mat = np.zeros((len(x_range) + 1, len(y_range) + 1))
random.seed()
x = 1
y = 1
for i in range(0, 100000):
n = random.randint(0, 3)
V = f(x, y, n)
x = V.item(0)
y = V.item(1)
mat[norm_x(x, -2, 2, 0, 400), norm_y(y, 0, 2, 0, 200)] += 50
plt.xlabel('x0')
plt.ylabel('y')
fig = plt.figure(figsize=(10,10))
plt.imshow(mat, cmap="spectral", extent=[-2,2, 0, 2])
plt.show()
The mathematics seem solid here so I suspect something weird is going on with how I'm handling where things should go into the 'mat' matrix and how the values in there correspond to the colormap.
If I understood your problem correctly, you need to transpose your matrix using the method .T. So just replace
fig = plt.figure(figsize=(10,10))
plt.imshow(mat, cmap="spectral", extent=[-2,2, 0, 2])
plt.show()
by
fig = plt.figure(figsize=(10,10))
ax = gca()
ax.imshow(mat.T, cmap="spectral", extent=[-2,2, 0, 2], origin="bottom")
plt.show()
The argument origin=bottom tells to imshow to have the origin of your matrix at the bottom of the figure.
Hope it helps.

Elegant way the find the Vertices of a Cube

Nearly every OpenGL tutorial lets you implement drawing a cube. Therefore the vertices of the cube are needed. In the example code I saw a long list defining every vertex. But I would like to compute the vertices of a cube rather that using a overlong list of precomputed coordinates.
A cube is made of eight vertices and twelve triangles. Vertices are defined by x, y, and z. Triangles are defined each by the indexes of three vertices.
Is there an elegant way to compute the vertices and the element indexes of a cube?
When i was "porting" the csg.js project to Java I've found some cute code which generated cube with selected center point and radius. (I know it's JS, but anyway)
// Construct an axis-aligned solid cuboid. Optional parameters are `center` and
// `radius`, which default to `[0, 0, 0]` and `[1, 1, 1]`. The radius can be
// specified using a single number or a list of three numbers, one for each axis.
//
// Example code:
//
// var cube = CSG.cube({
// center: [0, 0, 0],
// radius: 1
// });
CSG.cube = function(options) {
options = options || {};
var c = new CSG.Vector(options.center || [0, 0, 0]);
var r = !options.radius ? [1, 1, 1] : options.radius.length ?
options.radius : [options.radius, options.radius, options.radius];
return CSG.fromPolygons([
[[0, 4, 6, 2], [-1, 0, 0]],
[[1, 3, 7, 5], [+1, 0, 0]],
[[0, 1, 5, 4], [0, -1, 0]],
[[2, 6, 7, 3], [0, +1, 0]],
[[0, 2, 3, 1], [0, 0, -1]],
[[4, 5, 7, 6], [0, 0, +1]]
].map(function(info) {
return new CSG.Polygon(info[0].map(function(i) {
var pos = new CSG.Vector(
c.x + r[0] * (2 * !!(i & 1) - 1),
c.y + r[1] * (2 * !!(i & 2) - 1),
c.z + r[2] * (2 * !!(i & 4) - 1)
);
return new CSG.Vertex(pos, new CSG.Vector(info[1]));
}));
}));
};
I solved this problem with this piece code (C#):
public CubeShape(Coord3 startPos, int size) {
int l = size / 2;
verts = new Coord3[8];
for (int i = 0; i < 8; i++) {
verts[i] = new Coord3(
(i & 4) != 0 ? l : -l,
(i & 2) != 0 ? l : -l,
(i & 1) != 0 ? l : -l) + startPos;
}
tris = new Tris[12];
int vertCount = 0;
void AddVert(int one, int two, int three) =>
tris[vertCount++] = new Tris(verts[one], verts[two], verts[three]);
for (int i = 0; i < 3; i++) {
int v1 = 1 << i;
int v2 = v1 == 4 ? 1 : v1 << 1;
AddVert(0, v1, v2);
AddVert(v1 + v2, v2, v1);
AddVert(7, 7 - v2, 7 - v1);
AddVert(7 - (v1 + v2), 7 - v1, 7 - v2);
}
}
If you want to understand more of what is going on, you can check out the github page I wrote that explains it.