Bad array shape in python - python-2.7

I am trying to implement SVM for a dataset I found online.The features_test,features_train,labels_test,labels_train are python lists of tuples.I did the below to convert it into numpy arrays.But clf.fit is giving me the below error.
File "ebola.py", line 47, in <module>
clf.fit(features_train_numpy,labels_train_numpy)
File "/usr/lib64/python2.7/site-packages/sklearn/svm/base.py", line 151, in fit
y = self._validate_targets(y)
File "/usr/lib64/python2.7/site-packages/sklearn/svm/base.py", line 514, in _validate_targets
y_ = column_or_1d(y, warn=True)
File "/usr/lib64/python2.7/site-packages/sklearn/utils/validation.py", line 551, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (2923, 9)
Code is as below
features_train_numpy = np.asarray(features_train)
labels_train_numpy= np.asarray(labels_train)
features_test_numpy = np.asarray(features_test)
labels_test_numpy= np.asarray(labels_test)
from sklearn.svm import SVC
temp = 100
clf=SVC(C=temp,kernel="rbf")
clf.fit(features_train_numpy,labels_train_numpy)`

Even from the error itself it is easy to notice that your labels matrix is two-dimensional, while it should be 1D vector. It should contain on i'th position - label of i'th example. In your case it looks like each sample has 9 labels, which is not supported by sklearn SVM.

Related

Adding outputs of two layers in keras

I have an issue that seems to have no straight forward solution in Keras.
My server runs on ubuntu 14.04, keras with backend tensorflow.
Here's the issue:
I have two input tenors of the shape: Input(shape=(30,125,1)), each of them is fed to a cascade of three layers below:
CNN1 = Conv2D(filters = 8, kernel_size = (1,64) , padding = "same" , activation = "relu" )
CNN2 = Conv2D(filters = 8, kernel_size = (8,1) , padding = "same" , activation = "relu" )
pool = MaxPooling2D((2, 2))
Each of the obtained output tensors for respective inputs is of shape (None, 15, 62, 8). Now, I wish to add each of the (15,62) matrix for both inputs for each filter and get an output of dimension again (None, 15, 62, 8).
I tried with the following lines of code using Lambda layer but it throws an error.
from keras import backend as K
from keras.layers import Lambda
def myadd(x):
increment = x[1]
result = K.update_add(x[0], increment)
return result
in_1 = Input(shape=(30,125,1))
in_1CNN1 = CNN1(in_1)
in_1CNN2 = CNN2(in_1CNN1)
in_1pool = pool(in_1CNN2)
in_2 = Input(shape=(30,125,1))
in_2CNN1 = CNN1(in_2)
in_2CNN2 = CNN2(in_2CNN1)
in_2pool = pool(in_2CNN2)
y1 =y1.astype(np.float32) # an input regression label array of shape (numsamples,1) loaded from a mat file
out1 = Lambda(myadd, output_shape=(None, 15, 62, 8))([in_1pool,in_2pool])
a= keras.layers.Flatten()(out1)
pre1 = Dense(1000, activation='sigmoid')(a)
pre2 =Dropout(0.2)(pre1)
predictions = Dense(1, activation='sigmoid')(pre2)
model = Model(inputs=[in_1,in_2], outputs=predictions)
model.compile(optimizer='sgd',loss='mean_squared_error')
model.fit([inputdata1,inputdata2], y1, epochs=20, validation_split=0.5)
#inputdata1, inputdata2 are arrays loaded from a mat file and are each of shape (5169, 30, 125, 1)
The error is highlighted below:
Traceback (most recent call last):
File "keras_workshop/keras_multipleinputs_multiple CNN.py", line 225, in <module>
out1 = Lambda(myadd, output_shape=(None, 15, 62, 8))([in_1pool,in_2pool])
File "/home/tharun/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 603, in __call__
output = self.call(inputs, **kwargs)
File "/home/tharun/anaconda2/lib/python2.7/site-packages/keras/layers/core.py", line 651, in call
return self.function(inputs, **arguments)
File "keras_workshop/keras_multipleinputs_multiple CNN.py", line 75, in myadd
result = K.update_add(x[0], increment)
File "/home/tharun/anaconda2/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 958, in update_add
return tf.assign_add(x, increment)
File "/home/tharun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/state_ops.py", line 245, in assign_add
return ref.assign_add(value)
AttributeError: 'Tensor' object has no attribute 'assign_add'
Try the Add() layer or the add() function that Keras provides.
Add
keras.layers.Add()
Layer that adds a list of inputs.
It takes as input a list of tensors, all of the same shape, and returns a single tensor (also of the same shape).
add
keras.layers.add(inputs)
Functional interface to the Add layer.
Arguments
inputs: A list of input tensors (at least 2).
**kwargs: Standard layer keyword arguments.
Returns
A tensor, the sum of the inputs.

How to save a list to a text file?

I want to save all x ad y coordinates (center each pixel in a raster layer) as a list in a text file. First for test I write below code that it's correct:
import os
import pickle
mylist = [(12, 25), (65, 96), (10, 15)]
path = r"data/listfile"
file = 'file.txt'
if not os.path.exists(path):
os.makedirs(path)
with open(os.path.join(path, file), 'wb') as handle:
pickle.dump(mylist, handle)
with open(os.path.join(path, file), 'rb') as handle:
aa = pickle.loads(handle.read())
print aa
In next step I used this code in real for my raster layer. MCVE of that code is :
from qgis.core import *
from PyQt4 import *
import os
import pickle
ds = QgsRasterLayer("/LData/Pop/lorst.tif", "Raster")
pixelWidth = ds.rasterUnitsPerPixelX()
pixelHeight = ds.rasterUnitsPerPixelY()
originX, originY = (ext.xMinimum(), ext.yMinimum())
src_cols = ds.width()
src_rows = ds.height()
path = r"LData/Pop"
file = 'List.txt'
if not os.path.exists(path):
os.makedirs(path)
def pixel2coord(x, y):
xp = (pixelWidth * x) + originX + (pixelWidth / 2)
yp = (pixelHeight * y) + originY + (pixelHeight / 2)
return QgsPoint(xp, yp)
list =[]
for i in range(0, src_cols):
for j in range(0, src_rows):
rspnt = pixel2coord(i, j)
list.append(rspnt)
with open(os.path.join(path, file), 'wb') as handle:
pickle.dump(list, handle)
with open(os.path.join(path, file), 'rb') as handle:
lst = pickle.loads(handle.read())
But I received this error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/tmp/tmp4rPKQ_.py", line 70, in <module>
pickle.dump(pntRstList, handle)
File "/usr/lib/python2.7/pickle.py", line 1376, in dump
Pickler(file, protocol).dump(obj)
File "/usr/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 606, in save_list
self._batch_appends(iter(obj))
File "/usr/lib/python2.7/pickle.py", line 621, in _batch_appends
save(x)
File "/usr/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/usr/lib/python2.7/copy_reg.py", line 71, in _reduce_ex
state = base(self)
TypeError: the sip.wrapper type cannot be instantiated or sub-classed
Is there any way to convert xy list to a text file and back read it in number format not str?
The easiest would be to forgo the use of QgsPoint(xp, yp) and use tuples instead, i.e. just (xp, yp). It seems that the QgsPoint is a SIP wrapper for a C++ class; and SIP wrappers wouldn't know about pickling.
Notice also that pyqgis documentation says this:
Note
The tuples (x,y) are not real tuples, they are QgsPoint objects, the values are accessible with x() and y() methods.
They just look like tuples but they're nothing like tuples, you cannot even access the individual coordinates with t[0].
That said, you can convert a list of such points into a list of tuples easily with
lst = [(p.x(), p.y()) for p in lst]
pickle.dump(lst, handle)

Invalid literal for float in k nearest neighbor

I am having the hardest time figuring out why i am getting this error. I have searched a lot but unable to fine any solution
import numpy as np
import warnings
from collections import Counter
import pandas as pd
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features)-
np.array(predict))
distances.append([euclidean_distance,group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv("data.txt")
df.replace('?',-99999, inplace=True)
df.drop(['id'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
print(full_data)
After running. it gives error
Traceback (most recent call last):
File "E:\Jazab\Machine Learning\Lec18(Testing K Neatest Nerighbors
Classifier)\Lec18(Testing K Neatest Nerighbors
Classifier)\Lec18_Testing_K_Neatest_Nerighbors_Classifier_.py", line 25, in
<module>
full_data = df.astype(float).values.tolist()
File "C:\Python27\lib\site-packages\pandas\util\_decorators.py", line 91, in
wrapper
return func(*args, **kwargs)
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 3299, in
astype
**kwargs)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 3224, in
astype
return self.apply('astype', dtype=dtype, **kwargs)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 3091, in
apply
applied = getattr(b, f)(**kwargs)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 471, in
astype
**kwargs)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 521, in
_astype
values = astype_nansafe(values.ravel(), dtype, copy=True)
File "C:\Python27\lib\site-packages\pandas\core\dtypes\cast.py", line 636,
in astype_nansafe
return arr.astype(dtype)
ValueError: invalid literal for float(): 3) <-----Reappears in Group 8 as:
Press any key to continue . . .
if i remove astype(float) program run fine
What should i need to do ?
There are bad data (3)), so need to_numeric with apply because need processes all columns.
Non numeric are converted to NaNs, which are replaced by fillna to some scalar, e.g. 0:
full_data = df.apply(pd.to_numeric, errors='coerce').fillna(0).values.tolist()
Sample:
df = pd.DataFrame({'A':[1,2,7], 'B':['3)',4,5]})
print (df)
A B
0 1 3)
1 2 4
2 7 5
full_data = df.apply(pd.to_numeric, errors='coerce').fillna(0).values.tolist()
print (full_data)
[[1.0, 0.0], [2.0, 4.0], [7.0, 5.0]]
It looks like you have 3) as an entry in your CSV file, and Pandas is complaining because it can't cast it to a float because of the ).

chi squared selectKbest bad input shape error

I'm a little new to scikit and ML. I'm trying to train an Adaboost classifier for one vs Rest classification. I'm using the following code
# To Read Training data set
test = pd.read_csv("train.csv", header=0, delimiter=",", \
quoting=1, error_bad_lines=False)
num_reviews = len(test["text"])
clean_train_reviews = []
catlist=[]
for i in xrange(0,num_reviews):
data=processText(test["text"][i])
data1=test["category"][i]
clean_train_reviews.append(data)
catlist.append(data1.split('.'))
# To read test dataset
test = pd.read_csv("test.csv", header=0, delimiter=",", \
quoting=1, error_bad_lines=False)
num_reviews = len(test["text"])
clean_test_reviews = []
for i in xrange(0,num_reviews):
data=processText(test["text"][i])
clean_test_reviews.append(data)
X_test=np.array(clean_test_reviews)
lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(catlist)
classifier = Pipeline([
('vectorizer', CountVectorizer(ngram_range=(1,2), max_features=1500,min_df=4)),
('tfidf', TfidfTransformer()),
('chi2', SelectKBest(chi2, k=200)),
('clf', OneVsRestClassifier(AdaBoostClassifier()))])
classifier.fit(clean_train_reviews, Y)
predicted = classifier.predict(X_test)
I use a pipeline, where text is inserted as clean_train_reviews and Y is the class (multi-Label, N = 10). Textual features are extracted in the pipeline using TfidfVectorizer() and selected using Chi squared feature selection method. Adaboost classifiers give: ValueError: bad input shape (1000, 10)
File "<ipython-input-10-9dbc8b18e6b8>", line 1, in <module>
runfile('C:/Users/Administrator/Desktop/nincymiss/adaboost.py', wdir='C:/Users/Administrator/Desktop/nincymiss')
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 601, in runfile
execfile(filename, namespace)
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 66, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Administrator/Desktop/nincymiss/adaboost.py", line 179, in <module>
classifier.fit(clean_train_reviews, Y)
File "C:\Python27\lib\site-packages\sklearn\pipeline.py", line 164, in fit
Xt, fit_params = self._pre_transform(X, y, **fit_params)
File "C:\Python27\lib\site-packages\sklearn\pipeline.py", line 145, in _pre_transform
Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
File "C:\Python27\lib\site-packages\sklearn\base.py", line 458, in fit_transform
return self.fit(X, y, **fit_params).transform(X)
File "C:\Python27\lib\site-packages\sklearn\feature_selection\univariate_selection.py", line 322, in fit
X, y = check_X_y(X, y, ['csr', 'csc'])
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 515, in check_X_y
y = column_or_1d(y, warn=True)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 551, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (1000, 10)
This is because feature selection does not work as you'd expect for multilabel problems. You can try the following which will select the 'best' features for each label separately.
classifier = Pipeline([
('vectorizer', CountVectorizer(ngram_range=(1,2), max_features=1500, min_df=4)),
('tfidf', TfidfTransformer()),
('chi2', SelectKBest(chi2, k=200)),
('clf', AdaBoostClassifier())])
clf = OneVsRestClassifier(classifier)

Rehaspe a 2D matrix into a 3D ? (x, y) -> (x/72,72,y)

I have a text file from which I load the original matrix.
The text file has comments with # and it basically has multiple matrices of 77*44.
I would like to read this file and store each matrix from this complete number of mats.
import os
import sys
import numpy as np
from numpy import zeros, newaxis
import io
#read the txt file and store all vaules into a np.array
f = open('/path/to/akiyo_cif.txt','r')
x = np.loadtxt(f,dtype=np.uint8,comments='#',delimiter='\t')
nof = x.shape[0]/72
print ("shape after reading the file is "+ str(x.shape) )
#example program that works
newmat =np.zeros((nof+1,72,44))
for i in range(0,nof+1):
newmat[i,:,:]= x[i*72 : (i*72)+72 , :]
print ("Shape after resizing the file is "+ str(newmat.shape) )
Output :-Shape after reading the file is (21240, 44)
Shape after resizing the file is (274, 72, 44)
If I run this
newmat=x.reshape((nof,72,44))
newmat = x.reshape((nof,72,44))
ValueError: total size of new array must be unchanged
I would like to re size this matrix to (21240/72,72,44).
Where the first 77 lines corresponds to newmat[0,:,:] and the next 77 lines to newmat[1,:,:].
Use x.reshape(-1, 72, 44):
In [146]: x = np.loadtxt('data' ,dtype=np.uint8, comments='#', delimiter='\t')
In [147]: x = x.reshape(-1, 72, 44)
In [148]: x.shape
Out[148]: (34, 72, 44)
When you specify one of the dimensions as -1, np.reshape replaces the -1 with a value inferred from the length of the array and the remaining dimensions.