Interpolation between two xarray datasets with Basemap - python-2.7

I have two different xarray datasets that have different latitude/longitude grid resolutions. I want to regrid the one xarray with lower resolution to the same resolution as the one xarray with higher resolution. I found some examples (e.g., http://earthpy.org/interpolation_between_grids_with_basemap.html), but it does not work for me. Here is one example that I made for testing:
import numpy as np
import xarray as xray
import mpl_toolkits.basemap
var1=xray.DataArray(np.random.randn(len(np.linspace(40.5,49.5,10)),len(np.linspace(-39.5,-20.5,20))),coords=[np.linspace(40.5,49.5,10), np.linspace(-39.5,-20.5,20)],dims=['lat','lon'])
(xlon, xlat)=np.meshgrid(np.linspace(-39.875,-20.125,80),np.linspace(40.125,49.875,40))
var2=xray.DataArray(-xlon**2+xlat**2,coords=[np.linspace(40.125,49.875,40),np.linspace(-39.875,-20.125,80)],dims=['lat','lon'])
mpl_toolkits.basemap.interp(var1,var1.lon,var1.lat,var2.lon,var2.lat,checkbounds=False,masked=False,order=0)
I get following error:
ValueError: xout and yout must have same shape!
Screenshot:
Does basemap.interp() require xout and yout to be the same shape? So var2 needs to be a square? This is almost never the case with any of my datasets! How can I regrid var1 to be the same resolution as var2?
Note: After regridding, I want to subsample var1 given some condition related to var2. For example:
var1_subset = var1.where(var2>1000)
So I want to minimize any loss of grid points during the interpolation.

basemap.interp will work only when xout and yout are same in number or number of output nlons and nlats are same,
why not generate same length output nlats and nlons and subset it later.
For example:
import numpy as np
import xarray as xray
import mpl_toolkits.basemap
var1=xray.DataArray(np.random.randn(len(np.linspace(40.5,49.5,10)),len(np.linspace(-39.5,-20.5,20))),coords=[np.linspace(40.5,49.5,10), np.linspace(-39.5,-20.5,20)],dims=['lat','lon'])
(xlon,xlat)=np.meshgrid(np.linspace(-39.875,20.125,80),np.linspace(40.125,49.875,80))
var2=xray.DataArray(-xlon**2+xlat**2,coords[np.linspace(40.125,49.875,80),np.linspace(-39.875,-20.125,80)],dims=['lat','lon'])
mpl_toolkits.basemap.interp(var1,var1.lon,var1.lat,var2.lon,var2.lat,checkbounds=False,masked=False,order=0)
Here is another cool trick with xarray.
lonreg=var1.groupby_bins('lon',np.linspace(-39.875,20.125,80)).mean(dim='lon')
regridded=lonreg.groupby_bins('lat',np.linspace(-39.5,20.5,20)).mean(dim='lat')
if you want weighted averaged regridding, it is easy to extend this for area averaged regridding by using weights and sum function on groupby object.

Related

PyMC3 autocorrplot() for any given array

The autocorrplot() function gives the autocorrelation plot for the sampled data from the trace.
If I already have a sample of data in the form of an array or list, can I use autocorrplot() to do the same?
Is there any alternative to generate autocorrelation plots given a sequence of data?
Please help.
autocorrplot is a wrapper around matplotlib's acorr. To get a similar look to pymc3's version, you can use something like
import numpy as np
import matplotlib.pyplot as plt
my_array = np.random.normal(size=1000)
plt.acorr(my_array, detrend=plt.mlab.detrend_mean, maxlags=100)
plt.xlim(0, 100)
Note the call to xlim at the end, since by default PyMC3 does not show negative correlations.

sklearn PCA.transform gives different results for different trials

I am doing some PCA using sklearn.decomposition.PCA. I found that if the input matrix X is big, the results of two different PCA instances for PCA.transform will not be the same. For example, when X is a 100x200 matrix, there will not be a problem. When X is a 1000x200 or a 100x2000 matrix, the results of two different PCA instances will be different. I am not sure what's the cause for this: I suppose there is no random elements in sklearn's PCA solver? I am using sklearn version 0.18.1. with python 2.7
The script below illustrates the issue.
import numpy as np
import sklearn.linear_model as sklin
from sklearn.decomposition import PCA
n_sample,n_feature = 100,200
X = np.random.rand(n_sample,n_feature)
pca_1 = PCA(n_components=10)
pca_1.fit(X)
X_transformed_1 = pca_1.transform(X)
pca_2 = PCA(n_components=10)
pca_2.fit(X)
X_transformed_2 = pca_2.transform(X)
print(np.sum(X_transformed_1 == X_transformed_2) )
print(np.mean((X_transformed_1 - X_transformed_2)**2) )
There's a svd_solver param in PCA and by default it has value "auto". Depending on the input data size, it chooses most efficient solver.
Now as for your case, when size is larger than 500, it will choose randomized.
svd_solver : string {‘auto’, ‘full’, ‘arpack’, ‘randomized’}
auto :
the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the
number of components to extract is lower than 80% of the smallest
dimension of the data, then the more efficient ‘randomized’ method is
enabled. Otherwise the exact full SVD is computed and optionally
truncated afterwards.
To control how the randomized solver behaves, you can set random_state param in PCA which will control the random number generator.
Try using
pca_1 = PCA(n_components=10, random_state=SOME_INT)
pca_2 = PCA(n_components=10, random_state=SOME_INT)
I had a similar problem even with the same trial number but on different machines I was getting different result setting the svd_solver to 'arpack' solved the problem

How to add interrupted time series in matplotlib

Since I am just a python beginner I hope, that I bother you not too much with my question. I'd like to plot only parts of a time series from a dataset such as this:
Code:
import numpy as np
import matplotlib.pyplot as plt
a=np.arange(10)
b=np.arange(100, 105)
c=np.arange(30, 40)
d=np.arange(55, 60)
e=np.append(a,b)
f=np.append(c,d)
plt.plot(e,f)
Plot:
Then I get a plot with a long diagonal line between the data series. What to do to get rid of that, in other words I want that the x-axis only shows 0,1,2,3,4,5,6,7,8,9,100,101,102,103,104 and the corresponding y-values and nothing in between. Moreover if I have a very long time series (e.g. from an oscilloscope measurement), is it possible to show up that some patterns are repeated with e.g. big dots in the middle of the plot? I've tried to convert the x data into a string and several other things, but it does not seem to work.
Focusing on this part of your question:
Then I get a plot with a long diagonal line between the data series. What to do to get rid of that [...]
It sounds that a scatteplot should meet your needs, so try this:
Add this to your code:
fig, ax = plt.subplots()
ax.scatter(e, f)
for i, txt in enumerate(e):
ax.annotate(txt, (e[i],f[i]))
fig.show()
To get this plot:
Unfortunately, I'm not quite sure what you're trying to accomplish with this part:
Moreover if I have a very long time series (e.g. from an oscilloscope
measurement)
Anyway, I hope my suggestion at least fixes the main part of your problem.

Different types of features to train Naive Bayes in Python Pandas

I would like to use a number of features to train with Naive Bayes classifier to classify 'A' or 'non-A'.
I have three features of different value types:
1) total_length - in positive integer
2) vowel-ratio - in decimal/fraction
3) twoLetters_lastName - a array containing multiple two-letters strings
# coding=utf-8
from nltk.corpus import names
import nltk
import random
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from sklearn.naive_bayes import GaussianNB
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# Import data into pandas
data = pd.read_csv('XYZ.csv', header=0, encoding='utf-8',
low_memory=False)
df = DataFrame(data)
# Randomize records
df = df.reindex(np.random.permutation(df.index))
# Assign column into label Y
df_Y = df[df.AScan.notnull()][['AScan']].values # Labels are 'A' or 'non-A'
#print df_Y
# Assign column vector into attribute X
df_X = df[df.AScan.notnull()][['total_length', 'vowel_ratio', 'twoLetters_lastName']].values
#print df_X[0:10]
# Incorporate X and Y into ML algorithms
clf = GaussianNB()
clf.fit(df_X, df_Y)
df_Y is as follow:
[[u'non-A']
[u'A']
[u'non-A']
...,
[u'A']
[u'non-A']
[u'non-A']]
df_X is below:
[[9L 0.222222222 u"[u'ke', u'el', u'll', u'ly']"]
[17L 0.41176470600000004
u"[u'ma', u'ar', u'rg', u'ga', u'ar', u'ri', u'is']"]
[11L 0.454545455 u"[u'du', u'ub', u'bu', u'uc']"]
[11L 0.454545455 u"[u'ma', u'ah', u'he', u'er']"]
[15L 0.333333333 u"[u'ma', u'ag', u'ge', u'ee']"]
[13L 0.307692308 u"[u'jo', u'on', u'ne', u'es']"]
[12L 0.41666666700000005
u"[u'le', u'ef', u'f\\xe8', u'\\xe8v', u'vr', u're']"]
[15L 0.26666666699999997 u"[u'ni', u'ib', u'bl', u'le', u'et', u'tt']"]
[15L 0.333333333 u"[u'ki', u'in', u'ns', u'sa', u'al', u'll', u'la']"]
[11L 0.363636364 u"[u'mc', u'cn', u'ne', u'ei', u'il']"]]
I am getting this error:
E:\Program Files Extra\Python27\lib\site-packages\sklearn\naive_bayes.py:150: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Traceback (most recent call last):
File "C:werwer\wer\wer.py", line 32, in <module>
clf.fit(df_X, df_Y)
File "E:\Program Files Extra\Python27\lib\site-packages\sklearn\naive_bayes.py", line 163, in fit
self.theta_[i, :] = np.mean(Xi, axis=0)
File "E:\Program Files Extra\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 2727, in mean
out=out, keepdims=keepdims)
File "E:\Program Files Extra\Python27\lib\site-packages\numpy\core\_methods.py", line 69, in _mean
ret, rcount, out=ret, casting='unsafe', subok=False)
TypeError: unsupported operand type(s) for /: 'unicode' and 'long'
My understanding is I need to convert the features into one numpy array as a feature vector, but I don't think if I am preparing this X vector right since it contains very different value types.
Related questions: Choosing a Classification Algorithm to Classify Mix of Nominal and Numeric Data -- Mixing Categorial and Continuous Data in Naive Bayes Classifier Using Scikit-learn
Okay so there are a few things going on. As DalekSec pointed out, it's best practice to keep all your features as one type as you input them into a model like GaussianNB. The traceback indicates that while fitting the model, it tries to divide a string (presumably one of your unicode strings like u"[u'ke', u'el', u'll', u'ly']") by an integer. So what we need to do is convert the training data into a form that sklearn can use. We can do this a few ways, two of which ogrisel eloquently describes in this answer here.
We can convert all the continuous variables to categorical variables. In our case, this means converting total_length (in some cases you could probably treat this as a categorical variable, but let's not get ahead of ourselves) and vowel-ratio. For instance, you can basically bin the values you see in each feature to one of 5 values based on percentile: 'very small', 'small', 'medium', 'high', 'very high'. There's no real easy way in sk-learn as far as I know, but it should be pretty straightforward to do it yourself. The only thing that you would want to change is that you would want to use MultinomialNB instead of GaussianNB because you'll be dealing with features that would be better described by multinomial distributions rather than gaussian ones.
We can convert the categorical features to numeric ones for use with GaussianNB. Personally I find this to be the more intuitive approach. Basically, when dealing with text, you need to figure out what information you want to take from the text and pass to the classifier. It looks like to me that you want to extract the incidence of different two letter last names.
Normally I would ask you whether or not you have all the last names in your dataset, but since each one is only two letters each we can just store all the possible two letter names (including the unicode characters involving accent marks) with a minimal impact on performance. This is where something like sklearn's CountVectorizer might be useful. Assuming that you have every possible combination of two letter last names in your data, you can just directly use this to turn a row in your twoLetter_lastname column into a N-dimensional vector that records the number of occurrences of each unique last name in your row. Then just combine this new vector with your other two features into a numpy array.
In the case you do not have every possible combination of two letters (including accented ones), you should consider generating that list and pass it in as the 'vocabulary' for the CountVectorizer. This is so that your classifier knows how to handle all possible last names. It's not the end of the world if you don't handle all cases, but any new unseen two letter pairs will be ignored in this scheme.
Before you use these tools, you should make sure that you pass your last name column in as a list, and not as a string, as this can result in unintended behavior.
You can read more about general sklearn preprocessing here, and more about CountVectorizer and other text feature extraction tools provided by sklearn here. I use a lot of these tools daily, and recommend them for basic text extraction tasks. There are also plenty of tutorials and demos available online. You might also look for other types of methods of representation, like binarizing and one-hot encoding. There are many ways to solve this problem, it mostly depends on your specific problem/needs.
After you're able to turn all your data into one form or the other, you should be able to make use of either the Gaussian or Multinomial NB classifier. As for your error regarding the 1D vector, you printed df_Y and it looked like
[[u'non-A']
[u'A']
[u'non-A']
...,
[u'A']
[u'non-A']
[u'non-A']]
Basically, it's expecting this to be in a flat list, rather than as a column vector (a list of one-dimensional lists). Just reshape it accordingly by making use of commands like numpy.reshape() or numpy.ravel() (numpy.ravel() would probably be more appropriate, considering that you're dealing with just one column, as the error mentioned).
I'm not 100% sure, but I think scikit-learn.naive_bayes requires a purely numeric feature vector instead of a mixture of text and numbers. It looks like it crashes when trying to "divide" a unicode string by a long integer.
I can't be much help with finding numeric representations for text, but this scikit-learn tutorial might be a good start.

Using python/numpy to create a matrix?

I have a basic understanding using numpy to create a matrix, but the context in which I have to create one confuses me. For example, I need to create a 2X1000 matrix with normally distributed values with mean 0 and standard deviation of 1. I'm not sure what it means to make a matrix with these conditions.
Besides what was writeen above by CoDEmanX, From numpy.random.normal we can read about the generic Normal Distribution in numpy:
numpy.random.normal(loc=0.0, scale=1.0, size=None)
Where:
loc is the Mean of the distribution and scale is the standard deviation (square root of variance).
import numpy as np
A = np.random.normal(loc =0, scale =1, size=(2, 1000))
print(A)
But if these examples are confusing, then consider that
np.random.normal()
simply gives you a random number, and you can create your own custom matrix:
import numpy as np
A = [ [np.random.normal() for i in range(1000)] for j in range(2) ]
A = np.array(A)
If you refer to the numpy docs, whether there are utility functions that facilitate you reaching your goal, you'll come across a normal distribution function:
numpy.random.standard_normal(size=None)
standard_normal(size=None)
Returns samples from a Standard Normal distribution (mean=0, stdev=1).
The simple average is 0, the standard deviation 1.
arr = numpy.random.standard_normal((2, 1000))
print(arr.mean()) # -0.027...
print(arr.std()) # 1.0272...
Note that it's not exactly 0 or 1.
I still recommend to read about normal distributation and standard deviation / variance, eventhough numpy offers a simple solution.