Related
This is essentially the "Multiple Coins from Multiple Mints / Baseball Players" example from Doing Bayesian Data Analysis, Second Edition (DBDA2). I believe I have PyMC3 code which is functionally equivalent, but one works and the other does not. This is with PyMC version 3.5. In more detail,
Let's say I have the following data. Each row is an observation:
observations_dict = {
'mint': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'coin': [0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 7],
'outcome': [1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1]
}
observations = pd.DataFrame(observations_dict)
observations
One Mint, Several Coins
The below, which implements DBDA2 Figure 9.7, runs just fine:
num_coins = observations['coin'].nunique()
coin_idx = observations['coin']
with pm.Model() as hierarchical_model:
# mint is characterized by omega and kappa
omega = pm.Beta('omega', 1., 1.)
kappa_minus2 = pm.Gamma('kappa_minus2', 0.01, 0.01)
kappa = pm.Deterministic('kappa', kappa_minus2 + 2)
# each coin is described by a theta
theta = pm.Beta('theta', alpha=omega*(kappa-2)+1, beta=(1-omega)*(kappa-2)+1, shape=num_coins)
# define the likelihood
y = pm.Bernoulli('y', theta[coin_idx], observed=observations['outcome'])
Many Mints, Many Coins
However, once this is turned into a hierarchical model (as seen in DBDA2 Figure 9.13):
num_mints = observations['mint'].nunique()
mint_idx = observations['mint']
num_coins = observations['coin'].nunique()
coin_idx = observations['coin']
with pm.Model() as hierarchical_model2:
# Hyper parameters
omega = pm.Beta('omega', 1, 1)
kappa_minus2 = pm.Gamma('kappa_minus2', 0.01, 0.01)
kappa = pm.Deterministic('kappa', kappa_minus2 + 2)
# Parameters for mints
omega_c = pm.Beta('omega_c',
omega*(kappa-2)+1, (1-omega)*(kappa-2)+1,
shape = num_mints)
kappa_c_minus2 = pm.Gamma('kappa_c_minus2',
0.01, 0.01,
shape = num_mints)
kappa_c = pm.Deterministic('kappa_c', kappa_c_minus2 + 2)
# Parameters for coins
theta = pm.Beta('theta',
omega_c[mint_idx]*(kappa_c[mint_idx]-2)+1,
(1-omega_c[mint_idx])*(kappa_c[mint_idx]-2)+1,
shape = num_coins)
y2 = pm.Bernoulli('y2', p=theta[coin_idx], observed=observations['outcome'])
The error is:
ValueError: operands could not be broadcast together with shapes (8,) (20,)
as the model has 8 thetas for 8 coins but sees 20 rows of data.
However, if the data is grouped such that each line represents the final statistics of an individual coin, as with the following
grouped = observations.groupby(['mint', 'coin']).agg({'outcome': [np.sum, np.size]}).reset_index()
grouped.columns = ['mint', 'coin', 'heads', 'total']
And the final likelihood variable is changed to a Binomial, as follows
num_mints = grouped['mint'].nunique()
mint_idx = grouped['mint']
num_coins = grouped['coin'].nunique()
coin_idx = grouped['coin']
with pm.Model() as hierarchical_model2:
# Hyper parameters
omega = pm.Beta('omega', 1, 1)
kappa_minus2 = pm.Gamma('kappa_minus2', 0.01, 0.01)
kappa = pm.Deterministic('kappa', kappa_minus2 + 2)
# Parameters for mints
omega_c = pm.Beta('omega_c',
omega*(kappa-2)+1, (1-omega)*(kappa-2)+1,
shape = num_mints)
kappa_c_minus2 = pm.Gamma('kappa_c_minus2',
0.01, 0.01,
shape = num_mints)
kappa_c = pm.Deterministic('kappa_c', kappa_c_minus2 + 2)
# Parameter for coins
theta = pm.Beta('theta',
omega_c[mint_idx]*(kappa_c[mint_idx]-2)+1,
(1-omega_c[mint_idx])*(kappa_c[mint_idx]-2)+1,
shape = num_coins)
y2 = pm.Binomial('y2', n=grouped['total'], p=theta, observed=grouped['heads'])
Everything works. Now, the latter form is more efficient and generally preferred, but I believe the former should work as well. So I believe this is primarily a PyMC3 issue (or even more likely, a user error).
To quote DBDA Edition 1,
"The BUGS model uses a binomial likelihood distribution for total
correct, instead of using the Bernoulli distribution for individual
trials. This use of the binomial is just a convenience for shortening
the program. If the data were specified as trial-by-trial outcomes
instead of as total correct, then the model could include a
trial-by-trial loop and use a Bernoulli likelihood function"
What bothers me is that in the very first example (One Mint, Several Coins), it looks like PyMC3 can handle individual observations instead of aggregated observations just fine. So I believe the first form should work, but doesn't.
Code
http://nbviewer.jupyter.org/github/JWarmenhoven/DBDA-python/blob/master/Notebooks/Chapter%209.ipynb
References
PyMC3 - Differences in ways observations are passed to model -> difference in results?
https://discourse.pymc.io/t/pymc3-differences-in-ways-observations-are-passed-to-model-difference-in-results/501
http://www.databozo.com/deep-in-the-weeds-complex-hierarchical-models-in-pymc3
https://stats.stackexchange.com/questions/157521/is-this-correct-hierarchical-bernoulli-model
The length of mint_idx was 20 (one for each observation), but it should have been 8 (one for each coin).
Working answer, notice the mint_idx recalculation (rest remains the same):
grouped = observations.groupby(['mint', 'coin']).agg({'outcome': [np.sum, np.size]}).reset_index()
grouped.columns = ['mint', 'coin', 'heads', 'total']
num_mints = grouped['mint'].nunique()
mint_idx = grouped['mint']
num_coins = observations['coin'].nunique()
coin_idx = observations['coin']
with pm.Model() as hierarchical_model2:
# Hyper parameters
omega = pm.Beta('omega', 1, 1)
kappa_minus2 = pm.Gamma('kappa_minus2', 0.01, 0.01)
kappa = pm.Deterministic('kappa', kappa_minus2 + 2)
# Parameters for mints
omega_c = pm.Beta('omega_c',
omega*(kappa-2)+1, (1-omega)*(kappa-2)+1,
shape = num_mints)
kappa_c_minus2 = pm.Gamma('kappa_c_minus2',
0.01, 0.01,
shape = num_mints)
kappa_c = pm.Deterministic('kappa_c', kappa_c_minus2 + 2)
# Parameters for coins
theta = pm.Beta('theta',
omega_c[mint_idx]*(kappa_c[mint_idx]-2)+1,
(1-omega_c[mint_idx])*(kappa_c[mint_idx]-2)+1,
shape = num_coins)
y2 = pm.Bernoulli('y2', p=theta[coin_idx], observed=observations['outcome'])
Many thanks to #junpenglao!!
https://discourse.pymc.io/t/why-cant-i-use-a-bernoulli-as-a-likelihood-variable-in-a-hierarchical-model-in-pymc3/2022/2
I'd like to perform the mean (and quantiles) along the years on an xarray.
If the time sampling is multiple of days, I can easy do something like that:
arr.groupby('time.dayofyear').mean('time')
But I can't find an easy way to do the same if I have got also hours. (Now I'm using an horrible trick).
For example in this case:
import numpy as np
import pandas as pd
import xarray as xr
time = pd.date_range('2000-01-01', '2010-01-01', freq='6h')
arr = xr.DataArray(
np.ones(len(time)),
dims='time',
coords={'time' : ('time', time)}
)
Probably I'm missing something, I'm not very expert on pandas and xarray. Have you got some tips?
Thank you very much.
If you want daily averages, resample is the best tool for the job:
daily = arr.resample(time='D').mean('time')
Then, you can use groupby to calculate quantiles for each day of year:
quantiles_by_dayofyear = daily.groupby('time.dayofyear').apply(
xr.DataArray.quantile, q=[0.25, 0.5, 0.75])
print(quantiles_by_dayofyear)
Yields:
<xarray.DataArray (dayofyear: 366, quantile: 3)>
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.],
...,
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
Coordinates:
* quantile (quantile) float64 0.25 0.5 0.75
* dayofyear (dayofyear) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
We should probably add the quantile method to xarray's list of groupby reduce methods but this should work for now.
For the daily average I would suggest to use the resample function. In case I understood the question correctly, this should give you daily averages. You can then use these daily averages to compute your groupby dayofyear operation.
import numpy as np
import pandas as pd
import xarray as xr
time = pd.date_range('2000-01-01', '2010-01-01', freq='6h')
arr = xr.DataArray(
np.ones(len(time)),
dims='time',
coords={'time' : ('time', time)}
)
daily = arr.resample(time='D').mean('time')
Sorry, probably my question was not clear. Consider only the quantiles.
My expected output is something like that:
<xarray.DataArray (hours: 1464, quantile: 3)>
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.],
...,
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
Coordinates:
* quantile (quantile) float64 0.25 0.5 0.75
* hours (hours) int64 6 12 18 24 30 36 42 48 54 60 66 72 ...
Where hours are the hours from the beginning of the year. But instead of hours it could be good also something like a multiindex with dayofyear and hour (of day). I've got a tricky way to do it (Performing some rendexing With multindexing and unstack the time dimension), but it's really horrible. I think that there is easier and elegant way to do it.
Thank you very much.
My understanding of the question is that you either want to be able to do a groupby operation over two variables simulataneously, or groupby something that is not a method of the xarray DateTimeAccessor.
Something you might look at is using xarray.apply_ufunc. Below is some code that I used for getting grouped means by year and by month.
def _grouped_mean(
data: np.ndarray,
months: np.ndarray,
years: np.ndarray) -> np.ndarray:
"""similar to grouping year_month MultiIndex, but faster.
Should be used wrapped by _wrapped_grouped_mean"""
unique_months = np.sort(np.unique(months))
unique_years = np.sort(np.unique(years))
old_shape = list(data.shape)
new_shape = old_shape[:-1]
new_shape.append(unique_months.shape[0])
new_shape.append(unique_years.shape[0])
output = np.zeros(new_shape)
for i_month, j_year in np.ndindex(output.shape[2:]):
indices = np.intersect1d(
(months == unique_months[i_month]).nonzero(),
(years == unique_years[j_year]).nonzero()
)
output[:, :, i_month, j_year] =\
np.mean(data[:, :, indices], axis=-1)
return output
def _wrapped_grouped_mean(da: xr.DataArray) -> xr.DataArray:
"""similar to grouping by a year_month MultiIndex, but faster.
Wraps a numpy-style function with xr.apply_ufunc
"""
Y = xr.apply_ufunc(
_grouped_mean,
da,
da.time.dt.month,
da.time.dt.year,
input_core_dims=[['lat', 'lon', 'time'], ['time'], ['time']],
output_core_dims=[['lat', 'lon', 'month', 'year']],
)
Y = Y.assign_coords(
{'month': np.sort(np.unique(da.time.dt.month)),
'year': np.sort(np.unique(da.time.dt.year))})
return Y
I have two 1D matrices A and B, containing NaN values in some random places. I want to add these matrices element wise (C[i] = A[i] + B[i]) and take the mean of the element sums. This works well and efficiently in the code below:
import numpy as np
# Create some fake matrices
A = np.arange(0,10,0.5)
B = 10.0*np.arange(0,10,0.5)
# Replace some random elements in A and B with NaN
A[15] = np.nan
A[16] = np.nan
A[17] = np.nan
A[18] = np.nan
B[1] = np.nan
B[2] = np.nan
B[17] = np.nan
B[18] = np.nan
# Sum over A and B, element wise, and take the mean of the sums
C = 0.5 * ( np.where(np.isnan(A), B, A + np.nan_to_num(B)) )
But, if one of A[i] and B[i] contains NaN and the other one doesn't, I don't want to take the mean of the sum, but rather keep the value of the matrix with a value that is not NaN. This I have not been able to solve.
In other words (given A and B) eventually I want C to be:
A
array([ 0., 0.5, 1., 1.5, 2., 2.5, 3., 3.5, 4., 4.5,
5., 5.5, 6., 6.5, 7., nan, nan, nan, nan, 9.5])
B
array([ 0., nan, nan, 15., 20., 25., 30., 35., 40., 45.,
50., 55., 60., 65., 70., 75., 80., nan, nan, 95.])
# What I eventually want C to be:
C
array([ 0., 0.5, 1. , 8.25, 11., 13.75, 16.5, 19.25, 22., 24.75,
27.5, 30.25, 33., 35.75, 38.5, 75., 80., nan, nan, 52.25])
Does anyone have any (efficient) suggestions how I can do this? (For example, I would like to avoid time consuming loops if possible).
NumPy's nanmean generates warnings when both numbers are np.nan, but it gives the result you want:
C = np.nanmean([A, B], axis=0)
Why is this happening? Do I have to use the copy of a numpy.array? but
it seems to work with the 1st code. Can not figure out why.
import numpy as np
n=3
h_all=[]
h=np.zeros((n,n))
for i in range(0, n):
h = h + 1.
h_all.append(h)
print h_all
it gives
[array([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]]), array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]]), array([[ 3., 3., 3.],
[ 3., 3., 3.],
[ 3., 3., 3.]])]
Which is good
but if I code as
n=3
h_all=[]
h=np.zeros((n,n))
maxnum=3
for k in range(0, n):
for i in range(0, n):
for j in range(0, n):
h[i,j] = h[i,j] + 1.
h_all.append(h[:])
print h_all
It becomes:
[array([[ 3., 3., 3.],
[ 3., 3., 3.],
[ 3., 3., 3.]]), array([[ 3., 3., 3.],
[ 3., 3., 3.],
[ 3., 3., 3.]]), array([[ 3., 3., 3.],
[ 3., 3., 3.],
[ 3., 3., 3.]])]
for k in range(0, n):
....
h_all.append(h[:])
puts a pointer to h in the h_all list. Since you are modifying h at each step in-place, the same pointer is placed in h_all each time. At the end, h_all displays the current value of h in each slot.
This is a common issue when dealing the Python lists and dictionaries. You have to append a copy at each step in the iteration, not a mutable object.
To clarify this, look at id(h_all[0]) and id(h_all[1]); I expect they are the same. Or try h +=1 after the loop, and watch the values of h_all change.
I should add that h[:] creates a copy if h is a list, but not if it is an array.
What is a pythonic way of making list of arbitrary length containing evenly spaced numbers (not just whole integers) between given bounds? For instance:
my_func(0,5,10) # ( lower_bound , upper_bound , length )
# [ 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5 ]
Note the Range() function only deals with integers. And this:
def my_func(low,up,leng):
list = []
step = (up - low) / float(leng)
for i in range(leng):
list.append(low)
low = low + step
return list
seems too complicated. Any ideas?
Given numpy, you could use linspace:
Including the right endpoint (5):
In [46]: import numpy as np
In [47]: np.linspace(0,5,10)
Out[47]:
array([ 0. , 0.55555556, 1.11111111, 1.66666667, 2.22222222,
2.77777778, 3.33333333, 3.88888889, 4.44444444, 5. ])
Excluding the right endpoint:
In [48]: np.linspace(0,5,10,endpoint=False)
Out[48]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
You can use the following approach:
[lower + x*(upper-lower)/length for x in range(length)]
lower and/or upper must be assigned as floats for this approach to work.
Similar to unutbu's answer, you can use numpy's arange function, which is analog to Python's intrinsic function range. Notice that the end point is not included, as in range:
>>> import numpy as np
>>> a = np.arange(0,5, 0.5)
>>> a
array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
>>> a = np.arange(0,5, 0.5) # returns a numpy array
>>> a
array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
>>> a.tolist() # if you prefer it as a list
[0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5]
f = 0.5
a = 0
b = 9
d = [x * f for x in range(a, b)]
would be a way to do it.
Numpy's r_ convenience function can also create evenly spaced lists with syntax np.r_[start:stop:steps]. If steps is a real number (ending on j), then the end point is included, equivalent to np.linspace(start, stop, step, endpoint=1), otherwise not.
>>> np.r_[-1:1:6j, [0]*3, 5, 6]
array([-1. , -0.6, -0.2, 0.2, 0.6, 1.])
You can also directly concatente other arrays and also scalars:
>>> np.r_[-1:1:6j, [0]*3, 5, 6]
array([-1. , -0.6, -0.2, 0.2, 0.6, 1. , 0. , 0. , 0. , 5. , 6. ])
You can use the folowing code:
def float_range(initVal, itemCount, step):
for x in xrange(itemCount):
yield initVal
initVal += step
[x for x in float_range(1, 3, 0.1)]
Similar to Howard's answer but a bit more efficient:
def my_func(low, up, leng):
step = ((up-low) * 1.0 / leng)
return [low+i*step for i in xrange(leng)]