Operation along year with xarray - grouping

I'd like to perform the mean (and quantiles) along the years on an xarray.
If the time sampling is multiple of days, I can easy do something like that:
arr.groupby('time.dayofyear').mean('time')
But I can't find an easy way to do the same if I have got also hours. (Now I'm using an horrible trick).
For example in this case:
import numpy as np
import pandas as pd
import xarray as xr
time = pd.date_range('2000-01-01', '2010-01-01', freq='6h')
arr = xr.DataArray(
np.ones(len(time)),
dims='time',
coords={'time' : ('time', time)}
)
Probably I'm missing something, I'm not very expert on pandas and xarray. Have you got some tips?
Thank you very much.

If you want daily averages, resample is the best tool for the job:
daily = arr.resample(time='D').mean('time')
Then, you can use groupby to calculate quantiles for each day of year:
quantiles_by_dayofyear = daily.groupby('time.dayofyear').apply(
xr.DataArray.quantile, q=[0.25, 0.5, 0.75])
print(quantiles_by_dayofyear)
Yields:
<xarray.DataArray (dayofyear: 366, quantile: 3)>
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.],
...,
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
Coordinates:
* quantile (quantile) float64 0.25 0.5 0.75
* dayofyear (dayofyear) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
We should probably add the quantile method to xarray's list of groupby reduce methods but this should work for now.

For the daily average I would suggest to use the resample function. In case I understood the question correctly, this should give you daily averages. You can then use these daily averages to compute your groupby dayofyear operation.
import numpy as np
import pandas as pd
import xarray as xr
time = pd.date_range('2000-01-01', '2010-01-01', freq='6h')
arr = xr.DataArray(
np.ones(len(time)),
dims='time',
coords={'time' : ('time', time)}
)
daily = arr.resample(time='D').mean('time')

Sorry, probably my question was not clear. Consider only the quantiles.
My expected output is something like that:
<xarray.DataArray (hours: 1464, quantile: 3)>
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.],
...,
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
Coordinates:
* quantile (quantile) float64 0.25 0.5 0.75
* hours (hours) int64 6 12 18 24 30 36 42 48 54 60 66 72 ...
Where hours are the hours from the beginning of the year. But instead of hours it could be good also something like a multiindex with dayofyear and hour (of day). I've got a tricky way to do it (Performing some rendexing With multindexing and unstack the time dimension), but it's really horrible. I think that there is easier and elegant way to do it.
Thank you very much.

My understanding of the question is that you either want to be able to do a groupby operation over two variables simulataneously, or groupby something that is not a method of the xarray DateTimeAccessor.
Something you might look at is using xarray.apply_ufunc. Below is some code that I used for getting grouped means by year and by month.
def _grouped_mean(
data: np.ndarray,
months: np.ndarray,
years: np.ndarray) -> np.ndarray:
"""similar to grouping year_month MultiIndex, but faster.
Should be used wrapped by _wrapped_grouped_mean"""
unique_months = np.sort(np.unique(months))
unique_years = np.sort(np.unique(years))
old_shape = list(data.shape)
new_shape = old_shape[:-1]
new_shape.append(unique_months.shape[0])
new_shape.append(unique_years.shape[0])
output = np.zeros(new_shape)
for i_month, j_year in np.ndindex(output.shape[2:]):
indices = np.intersect1d(
(months == unique_months[i_month]).nonzero(),
(years == unique_years[j_year]).nonzero()
)
output[:, :, i_month, j_year] =\
np.mean(data[:, :, indices], axis=-1)
return output
def _wrapped_grouped_mean(da: xr.DataArray) -> xr.DataArray:
"""similar to grouping by a year_month MultiIndex, but faster.
Wraps a numpy-style function with xr.apply_ufunc
"""
Y = xr.apply_ufunc(
_grouped_mean,
da,
da.time.dt.month,
da.time.dt.year,
input_core_dims=[['lat', 'lon', 'time'], ['time'], ['time']],
output_core_dims=[['lat', 'lon', 'month', 'year']],
)
Y = Y.assign_coords(
{'month': np.sort(np.unique(da.time.dt.month)),
'year': np.sort(np.unique(da.time.dt.year))})
return Y

Related

Change matrix elements in one matrix, given statements in two other matrices, in python

I have two 1D matrices A and B, containing NaN values in some random places. I want to add these matrices element wise (C[i] = A[i] + B[i]) and take the mean of the element sums. This works well and efficiently in the code below:
import numpy as np
# Create some fake matrices
A = np.arange(0,10,0.5)
B = 10.0*np.arange(0,10,0.5)
# Replace some random elements in A and B with NaN
A[15] = np.nan
A[16] = np.nan
A[17] = np.nan
A[18] = np.nan
B[1] = np.nan
B[2] = np.nan
B[17] = np.nan
B[18] = np.nan
# Sum over A and B, element wise, and take the mean of the sums
C = 0.5 * ( np.where(np.isnan(A), B, A + np.nan_to_num(B)) )
But, if one of A[i] and B[i] contains NaN and the other one doesn't, I don't want to take the mean of the sum, but rather keep the value of the matrix with a value that is not NaN. This I have not been able to solve.
In other words (given A and B) eventually I want C to be:
A
array([ 0., 0.5, 1., 1.5, 2., 2.5, 3., 3.5, 4., 4.5,
5., 5.5, 6., 6.5, 7., nan, nan, nan, nan, 9.5])
B
array([ 0., nan, nan, 15., 20., 25., 30., 35., 40., 45.,
50., 55., 60., 65., 70., 75., 80., nan, nan, 95.])
# What I eventually want C to be:
C
array([ 0., 0.5, 1. , 8.25, 11., 13.75, 16.5, 19.25, 22., 24.75,
27.5, 30.25, 33., 35.75, 38.5, 75., 80., nan, nan, 52.25])
Does anyone have any (efficient) suggestions how I can do this? (For example, I would like to avoid time consuming loops if possible).
NumPy's nanmean generates warnings when both numbers are np.nan, but it gives the result you want:
C = np.nanmean([A, B], axis=0)

What to use for the arguments of skimage.measure.moments_central( )?

I am trying to use the measure module of sci-kit image to find the Hu moment invariants of an image, but I am getting stuck on what is going on in one of the prerequisite methods.
The function is defined here as:
skimage.measure.moments_central(image, cr, cc, order=3)
where cr and cc are defined to be the center row coordinate and the center column coordinate, respectively. Initially, I thought this just meant the midpoint, but after looking more into the central moment, I'm starting to think that it should be something more along the lines of either the centroid or the geometric center of the object in the image that I am focusing on. Does anyone have any intuition as to what values should be given to cr and cc? The documentation online is very limited.
You must pass the centroid of the object of interest. I submitted a fix to improve the documentation with examples, which should also be helpful to you. See: https://github.com/scikit-image/scikit-image/pull/1636/files.
Usage:
>>> image = np.zeros((20, 20), dtype=np.double)
>>> image[13:17, 13:17] = 1
>>> m = moments(image)
>>> cr = m[0, 1] / m[0, 0]
>>> cc = m[1, 0] / m[0, 0]
>>> moments_central(image, cr, cc)
array([[ 16., 0., 20., 0.],
[ 0., 0., 0., 0.],
[ 20., 0., 25., 0.],
[ 0., 0., 0., 0.]])

Django 1.6 Query Math Incorrect

No sure why, but the context['user_activity_percentage'] is showing 0 when it should be showing 25. This is because context['user_activity'] is 1 and it is int(1/4 * 100) = 25. I verified this in the manage.py shell_plus. Why is it showing 0 instead of 25?
context['user_activity'] = CommunityProfile.list_all_users.date_search(
date1, date2, column="last_activity").count()
context['user_activity_percentage'] = int(context['user_activity']/
CommunityProfile.objects.count() * 100)
If you are using Python 2.x, 1/4 is 0, not 0.25:
>>> 1 / 4
0
If you want to get 0.25, convert one of the value to float:
>>> float(1) / 4
0.25
This behavior is different from Python 3.x's (PEP-238: True division). If you want / works like Python 3.x, do the following:
>>> from __future__ import division
>>> 1 / 4
0.25

Selecting data from an HDFStore by floating-point data_column

I have a table in an HDFStore with a column of floats f stored as a data_column. I would like to select a subset of rows where, e.g., f==0.6.
I'm running in to trouble that I'm assuming is related to a floating-point precision mismatch somewhere. Here is an example:
In [1]: f = np.arange(0, 1, 0.1)
In [2]: s = f.astype('S')
In [3]: df = pd.DataFrame({'f': f, 's': s})
In [4]: df
Out[4]:
f s
0 0.0 0.0
1 0.1 0.1
2 0.2 0.2
3 0.3 0.3
4 0.4 0.4
5 0.5 0.5
6 0.6 0.6
7 0.7 0.7
8 0.8 0.8
9 0.9 0.9
[10 rows x 2 columns]
In [5]: with pd.get_store('test.h5', mode='w') as store:
...: store.append('df', df, data_columns=True)
...:
In [6]: with pd.get_store('test.h5', mode='r') as store:
...: selection = store.select('df', 'f=f')
...:
In [7]: selection
Out[7]:
f s
0 0.0 0.0
1 0.1 0.1
2 0.2 0.2
4 0.4 0.4
5 0.5 0.5
8 0.8 0.8
9 0.9 0.9
[7 rows x 2 columns]
I would like the query to return all of the rows but instead several are missing. A query with where='f=0.3' returns an empty table:
In [8]: with pd.get_store('test.h5', mode='r') as store:
selection = store.select('df', 'f=0.3')
...:
In [9]: selection
Out[9]:
Empty DataFrame
Columns: [f, s]
Index: []
[0 rows x 2 columns]
I'm wondering whether this is the intended behavior, and if so is there is a simple workaround, such as setting a precision limit for floating-point queries in pandas? I'm using version 0.13.1:
In [10]: pd.__version__
Out[10]: '0.13.1-55-g7d3e41c'
I don't think so, no. Pandas is built around numpy, and I have never seen any tools for approximate float equality except testing utilities like assert_allclose, and that won't help here.
The best you can do is something like:
In [17]: with pd.get_store('test.h5', mode='r') as store:
selection = store.select('df', '(f > 0.2) & (f < 0.4)')
....:
In [18]: selection
Out[18]:
f s
3 0.3 0.3
If this is a common idiom for you, make a function for it. You can even get fancy by incorporating numpy float precision.

Making a list of evenly spaced numbers in a certain range in python

What is a pythonic way of making list of arbitrary length containing evenly spaced numbers (not just whole integers) between given bounds? For instance:
my_func(0,5,10) # ( lower_bound , upper_bound , length )
# [ 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5 ]
Note the Range() function only deals with integers. And this:
def my_func(low,up,leng):
list = []
step = (up - low) / float(leng)
for i in range(leng):
list.append(low)
low = low + step
return list
seems too complicated. Any ideas?
Given numpy, you could use linspace:
Including the right endpoint (5):
In [46]: import numpy as np
In [47]: np.linspace(0,5,10)
Out[47]:
array([ 0. , 0.55555556, 1.11111111, 1.66666667, 2.22222222,
2.77777778, 3.33333333, 3.88888889, 4.44444444, 5. ])
Excluding the right endpoint:
In [48]: np.linspace(0,5,10,endpoint=False)
Out[48]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
You can use the following approach:
[lower + x*(upper-lower)/length for x in range(length)]
lower and/or upper must be assigned as floats for this approach to work.
Similar to unutbu's answer, you can use numpy's arange function, which is analog to Python's intrinsic function range. Notice that the end point is not included, as in range:
>>> import numpy as np
>>> a = np.arange(0,5, 0.5)
>>> a
array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
>>> a = np.arange(0,5, 0.5) # returns a numpy array
>>> a
array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
>>> a.tolist() # if you prefer it as a list
[0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5]
f = 0.5
a = 0
b = 9
d = [x * f for x in range(a, b)]
would be a way to do it.
Numpy's r_ convenience function can also create evenly spaced lists with syntax np.r_[start:stop:steps]. If steps is a real number (ending on j), then the end point is included, equivalent to np.linspace(start, stop, step, endpoint=1), otherwise not.
>>> np.r_[-1:1:6j, [0]*3, 5, 6]
array([-1. , -0.6, -0.2, 0.2, 0.6, 1.])
You can also directly concatente other arrays and also scalars:
>>> np.r_[-1:1:6j, [0]*3, 5, 6]
array([-1. , -0.6, -0.2, 0.2, 0.6, 1. , 0. , 0. , 0. , 5. , 6. ])
You can use the folowing code:
def float_range(initVal, itemCount, step):
for x in xrange(itemCount):
yield initVal
initVal += step
[x for x in float_range(1, 3, 0.1)]
Similar to Howard's answer but a bit more efficient:
def my_func(low, up, leng):
step = ((up-low) * 1.0 / leng)
return [low+i*step for i in xrange(leng)]