check_arrays() limiting array dimensions in scikit-learn?

check_arrays() limiting array dimensions in scikit-learn? - python-2.7

I would like to use the sklearn.learning_curves.py available in scikit-learn X0.15. After I cloned this version, several functions no longer work because check_arrays() is limiting the dimension of the arrays to 2.
>>> from sklearn import metrics
>>> from sklearn.cross_validation import train_test_split
>>> import numpy as np
>>> X = np.random.random((10,2,2,2))
>>> y = np.random.random((10,2,2,2))
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=3)
>>> error "Found array with dim 4d. Expected <= 2"
Using the same X and y I get the same error.
>>> mse = metrics.mean_squared_error
>>> mse(X,y)
>>> error "Found array with dim 4d. Expected <= 2"
If I go to sklearn.utils.validation.py and comment out lines 272, 273, and 274 as shown below everything works just fine.
# if array.ndim >= 3:
# raise ValueError("Found array with dim %d. Expected <= 2" %
# array.ndim)
Why are the dimensions of the arrays being limited to 2?

Because scikit-learn uses a 2-d convention (n_samples × n_features) for all feature data. If any function or method lets a higher-d array through, that's usually just oversight and you can't really rely on it.

Related

ValueError: shapes (1,4) and (5,4) not aligned: 4 (dim 1) != 5 (dim 0), when adding my variables to my prediction machine

I am creating a prediction machine with four variables. When I add the variables it all messes up and gives me:
ValueError: shapes (1,4) and (5,4) not aligned: 4 (dim 1) != 5 (dim 0)
code
import pandas as pd
from pandas import DataFrame
from sklearn import linear_model
import tkinter as tk
import statsmodels.api as sm
# Approach 1: Import the data into Python
Stock_Market = pd.read_csv(r'Training_Nis_New2.csv')
df = DataFrame(Stock_Market,columns=['Month 1','Month 2','Month 3','Month
4','Month 5','Month 6','Month 7','Month 8',
'Month 9','Month 10','Month 11','Month
12','FSUTX','MMUKX','FUFRX','RYUIX','Interest R','Housing
Sale','Unemployement Rate','Conus Average Temperature
Rank','30FSUTX','30MMUKX','30FUFRX','30RYUIX'])
X = df[['Month 1','Interest R','Housing Sale','Unemployement Rate','Conus Average Temperature Rank']]
# here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets
Y = df[['30FSUTX','30MMUKX','30FUFRX','30RYUIX']]
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)
# prediction with sklearn
# prediction with sklearn
HS=5.5
UR=6.7
CATR=8.9
New_Interest_R = 4.6
print('Predicted Stock Index Price: \n', regr.predict([[UR ,HS ,CATR
,New_Interest_R]]))
# with statsmodel
X = df[['Month 1','Interest R','Housing Sale','Unemployement Rate','Conus Average Temperature Rank']]
Y = df['30FSUTX']
print('\n\n*** Fund = FSUTX')
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)

tf.py_func , custom tensorflow function getting applied to only the first element in the tensor

I am new to tensorflow and was playing around with a deep learning network. I wanted to do a custom rounding off on all the weights after each iteration. As the round function in tensorflow library doesn't give you the option to round the values down to a certain number of decimal points.
So I wrote this
import numpy as np
import tensorflow as tf
from tensorflow.python.framework import ops
np_prec = lambda x: np.round(x,3).astype(np.float32)
def tf_prec(x,name=None):
with ops.name_scope( "d_spiky", name,[x]) as name:
y = tf.py_func(np_prec,
[x],
[tf.float32],
name=name,
stateful=False)
return y[0]
with tf.Session() as sess:
x = tf.constant([0.234567,0.712,1.2,1.7])
y = tf_prec(x)
y = tf_prec(x)
tf.global_variables_initializer
print(x.eval(), y.eval())
The output I got was this
[ 0.234567 0.71200001 1.20000005 1.70000005] [ 0.235 0.71200001 1.20000005 1.70000005]
So the custom rounding off worked only on the first item in the tensor and I am not sure about what I am doing wrong. Thanks in advance.

The error here because of the following line,
np_prec = lambda x: np.round(x,3).astype(np.float32)
you are casting the output to np.float32. You can verify the error by the following code,
print(np.round([0.234567,0.712,1.2,1.7], 3).astype(np.float32)) #prints [ 0.235 0.71200001 1.20000005 1.70000005]
The default output of np.round is float64. Moreover, you also have to change the Tout argument in tf.py_func to float64.
I have given the following code with the above fix and commented where necessary.
import numpy as np
import tensorflow as tf
from tensorflow.python.framework import ops
np_prec = lambda x: np.round(x,3)
def tf_prec(x,name=None):
with ops.name_scope( "d_spiky", name,[x]) as name:
y = tf.py_func(np_prec,
[x],
[tf.float64], #changed this line to tf.float64
name=name,
stateful=False)
return y[0]
with tf.Session() as sess:
x = tf.constant([0.234567,0.712,1.2,1.7],dtype=np.float64) #specify the input data type np.float64
y = tf_prec(x)
y = tf_prec(x)
tf.global_variables_initializer
print(x.eval(), y.eval())
Hope this helps.

vectorizer.fit_transform gives NotImplementedError : adding a nonzero scalar to a sparse matrix is not supported

I am trying to create a term document matrix using my custom analyser to extract features out of the documents. Following is the code for the same :
vectorizer = CountVectorizer( \
ngram_range=(1,2),
)
analyzer=vectorizer.build_analyzer()
def customAnalyzer(text):
grams = analyzer(text)
tgrams = [gram for gram in grams if not re.match("^[0-9\s]+$",gram)]
return tgrams
This function is called to create the custom analyser, which is used by the countVectorizer to extract the features.
for i in xrange( 0, num_rows ):
clean_query.append( review_to_words( inp["keyword"][i] , units))
vectorizer = CountVectorizer(analyzer = customAnalyzer, \
tokenizer = None, \
ngram_range=(1,2), \
preprocessor = None, \
stop_words = None, \
max_features = n,
)
features = vectorizer.fit_transform(clean_query)
z = vectorizer.get_feature_names()
This call throws the following error:
(<type 'exceptions.NotImplementedError'>, 'python.py', 128,NotImplementedError('adding a nonzero scalar to a sparse matrix is not supported',))
This error comes when we call the vectorizer to fit and transform.
But the value of the variable clean_query is not scalar. I am using sklearn-0.17.1
np.isscalar(clean_query)
False

This is a small test which I did to reproduce the error, but it did not throw the same error for me. (This example has been taken from : scikit-learn Feature extraction)
scikit-learn version : 0.19.dev0
In [1]: corpus = [
...: ... 'This is the first document.',
...: ... 'This is the second second document.',
...: ... 'And the third one.',
...: ... 'Is this the first document?',
...: ... ]
In [2]: from sklearn.feature_extraction.text import TfidfVectorizer
In [3]: vectorizer = TfidfVectorizer(min_df=1)
In [4]: vectorizer.fit_transform(corpus)
Out[4]:
<4x9 sparse matrix of type '<type 'numpy.float64'>'
with 19 stored elements in Compressed Sparse Row format>
In [5]: import numpy as np
In [6]: np.isscalar(corpus)
Out[6]: False
In [7]: type(corpus)
Out[7]: list
From the code above you can see, corpus is not a scalar and has the type list.
I think your solution lies in creating the clean_query variable, as expected by the vectorizer.fit_transform function.

SciPy curve_fit not working when one of the parameters to fit is a power

I'm trying to fit my data to a user defined function using SciPy curve_fit, which works when fitting to a function with a fixed power (func1). But curve_fit does not work when the function contains a power as a parameter to fit to (func2).
Curve_fit still does not work if I provide an initial guess for the parameters usins the keyword p0. I can not use the bounds keyword as the version of SciPy which I have does not have it.
This script illustrates the point:
import scipy
from scipy.optimize import curve_fit
import sys
print 'scipy version: ', scipy.__version__
print 'np.version: ', np.__version__
print sys.version_info
def func1(x,a):
return (x-a)**3.0
def func2(x,a,b):
return (x-a)**b
x_train = np.linspace(0, 12, 50)
y = func2(x_train, 0.5, 3.0)
y_train = y + np.random.normal(size=len(x_train))
print 'dtype of x_train: ', x_train.dtype
print 'dtype of y_train: ', y_train.dtype
popt1, pcov1 = curve_fit( func1, x_train, y_train, p0=[0.6] )
popt2, pcov2 = curve_fit( func2, x_train, y_train, p0=[0.6, 4.0] )
print 'Function 1: ', popt1, pcov1
print 'Function 2: ', popt2, pcov2
Which outputs the following:
scipy version: 0.14.0
np.version: 1.8.2
sys.version_info(major=2, minor=7, micro=6, releaselevel='final', serial=0)
dtype of x_train: float64
dtype of y_train: float64
stack_overflow.py:14: RuntimeWarning: invalid value encountered in power
return (x-a)**b
Function 1: [ 0.50138759] [[ 3.90044196e-07]]
Function 2: [ nan nan] [[ inf inf]
[ inf inf]]

(As #xnx first commented,) the problem with the second formulation (where the exponent b is unknown and considered to be real-valued) is that, in the process of testing potential values for a and b, quantities of the form z**p need to be evaluated, where z is a negative real number and p is a non-integer. This quantity is complex in general, hence the procedure fails. For example, for x=0 and test variables a=0.5, b=4.1, it holds (x-a)**b = (-0.5)**4.1 = 0.0555+0.018j.

Why am I getting this error: TypeError: Input must be a 2D array

I am working on a python code to plot Eddy Kinetic Energy. I am fairly new to python and I'm confused about an error I have been getting. I'm not worried about plotting my data on a map just yet, I just want to see if I can get it to plot. Here is my code and error:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from pylab import *
from netCDF4 import Dataset
from mpl_toolkits.basemap import Basemap
import matplotlib.cm as cm
from mpl_toolkits.basemap import shiftgrid
test = Dataset('p.34331101.atmos_daily.nc', 'r')
lat = test.variables['lat'][:]
lon = test.variables['lon'][:]
level = test.variables['level'][5]
time = test.variables['time'][:]
u = test.variables['ucomp'][:]
v = test.variables['vcomp'][:]
temp = test.variables['temp'][:]
print(lat.shape)
print(u.shape)
#uz = np.reshape(u, (30, 26, 90))
uzm = np.nanmean(u, axis=3)
#vz = np.reshape(v, (30, 26, 90))
vzm = np.nanmean(v, axis=3)
print(uzm.shape)
ustar = u-uzm[:,:,:,np.newaxis]
vstar = v-vzm[:,:,:,np.newaxis]
EKE = np.nanmean(.5*(ustar**2 + vstar**2), axis=3)
EKE1 = np.asarray(EKE)
%matplotlib inline
print(EKE.shape)
levels=[-10, -5, 0, 5, 10]
plt.contour(EKE[1,1,:])
#EKE is time, level, lat and the shape is (30, 26, 90)
TypeError: Input must be a 2D array.

Bret, you would probably get more help if you included a bit more info with your error, did you not get a line number to look at?
I would hazard a guess that your problem is passing a 1D array to contour(). This sometimes seems counter-intuitive but numpy reduces the dimensions 'automatically' when you specify a single value in an index.
i.e. try
print(EKE.shape)
print(EKE[1,1,:].shape)
print(EKE[1:2,1:2,:].shape)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

check_arrays() limiting array dimensions in scikit-learn? - python-2.7

Because scikit-learn uses a 2-d convention (n_samples × n_features) for all feature data. If any function or method lets a higher-d array through, that's usually just oversight and you can't really rely on it.

Related

ValueError: shapes (1,4) and (5,4) not aligned: 4 (dim 1) != 5 (dim 0), when adding my variables to my prediction machine

tf.py_func , custom tensorflow function getting applied to only the first element in the tensor

vectorizer.fit_transform gives NotImplementedError : adding a nonzero scalar to a sparse matrix is not supported

SciPy curve_fit not working when one of the parameters to fit is a power

Why am I getting this error: TypeError: Input must be a 2D array

Categories

Resources