Getting scipy to accept missing data

Getting scipy to accept missing data - python-2.7

I need to include missing data in the scipy pearsonr function. The problem is that if I read it from a file with e.g. 'NA' then I cannot convert it to float so that numpy will accept it. If I read it with numpy then the flexible array type with the 'NA' values cannot be used with 'dropna' or similar. How can I get scipy to accept missing data? I have read about data masks but I don't understand how to use it in the code.
Thanks,
#!/usr/bin/env python
import sys
import scipy.stats as sp
import numpy as np
f1=open(sys.argv[1],'r')
f2=open(sys.argv[2],'r')
g=open(sys.argv[3],'w')
f1.readline()
otus=[]
metanames=[]
result={}
for i in f1:
k1=i.split("\t")
k1[-1]=k1[-1].rstrip("\n")
otu=k1[0]
f2.seek(0)
result[otu]=[]
f2.readline()
for j in f2:
k2=j.split("\t")
k2[-1]=k2[-1].rstrip("\n")
if k2[0] not in metanames:
metanames.append(k2[0])
x=np.asarray(k1[1:])
y=np.asarray(k2[1:])
corr = sp.pearsonr(x, y)
result[otu].append(str(corr))
g.write("\t"+"\t".join(str(p) for p in metanames)+"\n")
for i in result.keys():
g.write(i+"\t"+"\t".join(str(p) for p in result[i][0])+"\n")
TypeError: cannot perform reduce with flexible type

I'm not sure I understand your problem. A minimal example
would have been helpful. But I guess you get something like the following:
>>> x = np.array([[ 1., 2., 3.], [ 4., 5., 6.], [ 7., 8., 9.]]).astype('object')
>>> x[0,1] = 'NA'
>>> x[2,2] = 'NA'
>>> x
array([[1.0, 'NA', 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 'NA']], dtype=object)
So you have a numpy array of type object because of t he 'NA's.
>>> x.astype('float')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: 'NA'
And that's why you can't convert it to float? So all you have
to do is replace the 'NA' with NaN
>>> x[x=='NA'] = np.nan
>>> x
array([[1.0, nan, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, nan]], dtype=object)
>>> x.astype('float')
array([[ 1., nan, 3.],
[ 4., 5., 6.],
[ 7., 8., nan]])

Ok, so I avoided using numpy at all and just manually removed 'nan' and corresponding value from both arrays.. by far easier and quicker, shoul dhave thought of it before.
S.
f1=open(sys.argv[1],'r')
f2=open(sys.argv[2],'r')
g=open(sys.argv[3],'w')
f1.readline()
otus=[]
metanames=[]
result={}
for i in f1:
k1=i.split("\t")
k1[-1]=k1[-1].rstrip("\n")
otu=k1[0]
if otu not in otus:
otus.append(otu)
f2.seek(0)
result[otu]=[]
f2.readline()
for j in f2:
k2=j.split("\t")
k2[-1]=k2[-1].rstrip("\n")
if k2[0] not in metanames:
metanames.append(k2[0])
x=k1[1:]
y=k2[1:]
c=-1
while c< len(x):
if x[c]=='NaN' or y[c]=='NaN':
del x[c]
del y[c]
else:
x[c]=float(x[c])
y[c]=float(y[c])
c=c+1
corr = sp.pearsonr(x, y)
result[otu].append(corr[0])
g.write("\t"+"\t".join(str(p) for p in metanames)+"\n")
for i in result.keys():
g.write(i)
for z in result[i]:
g.write("\t"+str(z))
g.write("\n")

Related

Sympify of factorials in Sympy

I have the following Sympy code that works as expected:
import numpy as np
from sympy.utilities.lambdify import lambdify
from sympy.core import sympify
from sympy import factorial
ex = sympify('-x**2 / cos(x)')
flam = lambdify(['x'], ex, "numpy")
flam(np.array(range(5)))
This returns:
array([ 0. , -1.85081572, 9.61199185, 9.09097799, 24.4781705 ])
Now, what I need to know is how to do the same for factorials, that is, using factorial(x) instead of cos(x). The code:
ex = sympify('-x**2 / factorial(x)')
flam = lambdify(['x'], ex, "numpy")
flam(np.array(range(5)))
raises a NameError
NameError: global name 'factorial' is not defined
What string should I use so that it gets converted to a factorial that can be evaluated after lambdify?
Thanks in advance for any help!

By tinkering I obtain the following code. It seems that the numpy factorial function do not works with ndarrays...
import numpy as np
from sympy.utilities.lambdify import lambdify
flam = lambdify(['x'], '-x**2 / cos(x)', "numpy")
flam(np.array(range(5)))
# >>> array([ 0. , -1.85081572, 9.61199185, 9.09097799, 24.4781705 ])
import scipy.special
flam = lambdify('x', 'factorial(x)', ['numpy', {'factorial':scipy.special.factorial}])
flam(np.array(range(5)))
# >>> array([ 1., 1., 2., 6., 24.])

Arbitrary number of 3d points how to zip to get x,y,z for plotting

I'm trying to plot a figure in 3D given an arbitrary number of points
import numpy as np
p = [
np.array([ 0.0, 0.0, 0.0]),
np.array([10.0, 0.0,10.0]),
np.array([10.0,21.0,10.0]),
np.array([14.5,25.5,14.5]),
np.array([ 0.0,40.0, 0.0]),
np.array([36.0,40.0, 0.0])]
... up to p[14]
section1 = [4, 0,1,2,3,4]
section2 = [8,14,1,2,8]
I need to combine p[4],p[0],p[1],p[2],p[3],p[4] and zip them to get the X,Y,Z I need to plot the lines.
I've been reduced to:
X=[]
Y=[]
Z=[]
for i in range(len(section1)):
X.append(p[section1[i]][0])
Y.append(p[section1[i]][1])
Z.append(p[section1[i]][2])
Whenever I put the points in a list and zip it, I get a strange list of the original points.
What is the right way to do it?

Your p is a list of arrays:
In [566]: p = [
...: np.array([ 0.0, 0.0, 0.0]),
...: np.array([10.0, 0.0,10.0]),
...: np.array([10.0,21.0,10.0]),
...: np.array([14.5,25.5,14.5]),
...: np.array([ 0.0,40.0, 0.0]),
...: np.array([36.0,40.0, 0.0])]
In [567]: len(p)
Out[567]: 6
In [568]: section1 = [4, 0,1,2,3,4]
I can convert that into a 2d array with np.stack:
In [569]: arr = np.stack(p)
In [570]: arr.shape
Out[570]: (6, 3)
Then it's easy to select rows with the section1 list:
In [571]: arr[section1,:]
Out[571]:
array([[ 0. , 40. , 0. ],
[ 0. , 0. , 0. ],
[ 10. , 0. , 10. ],
[ 10. , 21. , 10. ],
[ 14.5, 25.5, 14.5],
[ 0. , 40. , 0. ]])
X = arr[section1,0] and so on. Though for plotting you might not need to separate out the columns.

Here is a way to do that:
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt
p=np.random.rand(50,3)
section=[[21, 13, 2, 36, 20, 15,21],[7, 14, 19, 32,7]]
fig = plt.figure()
ax = fig.gca(projection='3d')
color=['red','blue']
for i in range(2):
x,y,z=p[section[i]].T
ax.plot(x,y,z,color[i])
plt.show()
For :

DictVectorizer - Is feature order consistently guaranteed with output?

I am using DictVectorizer to convert my features similar to example code:
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
X = v.fit_transform(D)
X
array([[ 2., 0., 1.],
[ 0., 1., 3.]])
My question is, if I run this code repeatedly, is order guaranteed? i.e. will 'bar' always occur in first column, 'baz' second column, and 'foo' third column
If order is not guaranteed, do you know of an option to force this? This is important, as new unseen data to be passed into a model trained on this format will obviously need the features occurring in same columns. Perhaps something could be done with the 'vocabulary_' attribute of DictVectorizer.
Cheers,
Steven

There is no problem if you use the fit and transform methods in the correct manner. First you fit the DictVectorizer to your data, and then you transform the dataset to a sparse matrix. This is done by the fit_transform() method you have called. If you have new, unseen data, you can just transform it using the transform() method. This will project the new data into the same data structure as before.
This is illustrated by the example code you have linked to:
>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer(sparse=False)
>>> D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
>>> X = v.fit_transform(D)
>>> X
array([[ 2., 0., 1.],
[ 0., 1., 3.]])
>>> v.inverse_transform(X) == [{'bar': 2.0, 'foo': 1.0}, {'baz': 1.0, 'foo': 3.0}]
True
>>> v.transform({'foo': 4, 'unseen_feature': 3})
array([[ 0., 0., 4.]])
The final transform() call takes new, unseen data, with two features. One of these is known by the DictVectorizer (because it was previously fitted to data that also had this feature), the other one is not. As the output shows, the values for the known feature foo end up in the correct column of the matrix, whereas the unknown feature is simply ignored.

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any()

Here's my code:
from scipy.io import wavfile
fName = 'file.wav'
fs, signal = wavfile.read(fName)
signal = signal / max(abs(signal)) # scale signal
assert min(signal) >= -1 and max(signal) <= 1
And the error is:
Traceback (most recent call last):
File = "vad.py", line 10, in <module>
signal = signal / max(abs(signal))
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any()
Can anyone please help me solving this error..?
Thanks in advance..

The line that produces the error shouldn't give an error if your signal were 1D (ie, a mono audio file), so you probably have a stereo wav file and your signal has the shape (nsamples, 2). Here's a short example for a stereo signal:
In [109]: x = (np.arange(10, dtype=float)-5).reshape(5,2)
In [110]: x
Out[110]:
array([[-5., -4.],
[-3., -2.],
[-1., 0.],
[ 1., 2.],
[ 3., 4.]])
In [111]: x /= abs(x).max(axis=0) # normalize each channel independently
In [112]: x
Out[112]:
array([[-1. , -1. ],
[-0.6, -0.5],
[-0.2, 0. ],
[ 0.2, 0.5],
[ 0.6, 1. ]])
Your next line will also give you trouble with a 2D array, so there try:
(x>=-1).all() and (x<=1).all()

check_arrays() limiting array dimensions in scikit-learn?

I would like to use the sklearn.learning_curves.py available in scikit-learn X0.15. After I cloned this version, several functions no longer work because check_arrays() is limiting the dimension of the arrays to 2.
>>> from sklearn import metrics
>>> from sklearn.cross_validation import train_test_split
>>> import numpy as np
>>> X = np.random.random((10,2,2,2))
>>> y = np.random.random((10,2,2,2))
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=3)
>>> error "Found array with dim 4d. Expected <= 2"
Using the same X and y I get the same error.
>>> mse = metrics.mean_squared_error
>>> mse(X,y)
>>> error "Found array with dim 4d. Expected <= 2"
If I go to sklearn.utils.validation.py and comment out lines 272, 273, and 274 as shown below everything works just fine.
# if array.ndim >= 3:
# raise ValueError("Found array with dim %d. Expected <= 2" %
# array.ndim)
Why are the dimensions of the arrays being limited to 2?

Because scikit-learn uses a 2-d convention (n_samples × n_features) for all feature data. If any function or method lets a higher-d array through, that's usually just oversight and you can't really rely on it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Getting scipy to accept missing data - python-2.7

Related

Sympify of factorials in Sympy

Arbitrary number of 3d points how to zip to get x,y,z for plotting

DictVectorizer - Is feature order consistently guaranteed with output?

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any()

check_arrays() limiting array dimensions in scikit-learn?

Categories

Resources