python - Error with Mariana/Theano neural network - python-2.7

I am facing a problem when I start my trainer and I can't figure out the cause.
My input data is of dimension 42 and my output should be one value out of 4.
This is the shape of my training and test set:
Training set:
input = (1152, 42) target = (1152,)
Training set: input = (1152, 42) target = (1152,)
Test set: input = (384, 42) target = (384,)
This is the construction of my network:
ls = MS.GradientDescent(lr=0.01)
cost = MC.CrossEntropy()
i = ML.Input(42, name='inp')
h = ML.Hidden(23, activation=MA.Sigmoid(), initializations=[MI.GlorotTanhInit()], name="hid")
o = ML.SoftmaxClassifier(4, learningScenario=ls, costObject=cost, name="out")
mlp = i > h > o
And this is the construction of the datasets, trainers and recorders:
trainData = MDM.RandomSeries(distances = train_set[0], next_state = train_set[1])
trainMaps = MDM.DatasetMapper()
trainMaps.mapInput(i, trainData.distances)
trainMaps.mapOutput(o, trainData.next_state)
testData = MDM.RandomSeries(distances = test_set[0], next_state = test_set[1])
testMaps = MDM.DatasetMapper()
testMaps.mapInput(i, testData.distances)
testMaps.mapOutput(o, testData.next_state)
earlyStop = MSTOP.GeometricEarlyStopping(testMaps, patience=100, patienceIncreaseFactor=1.1, significantImprovement=0.00001, outputFunction="score", outputLayer=o)
epochWall = MSTOP.EpochWall(1000)
trainer = MT.DefaultTrainer(
trainMaps=trainMaps,
testMaps=testMaps,
validationMaps=None,
stopCriteria=[earlyStop, epochWall],
testFunctionName="testAndAccuracy",
trainMiniBatchSize=MT.DefaultTrainer.ALL_SET,
saveIfMurdered=False
)
recorder = MREC.GGPlot2("MLP", whenToSave = [MREC.SaveMin("test", o.name, "score")], printRate=1, writeRate=1)
trainer.start("MLP", mlp, recorder = recorder)
But the following error is being produced:
Traceback (most recent call last):
File "nn-mariana.py", line 82, in <module>
trainer.start("MLP", mlp, recorder = recorder)
File "SUPRESSED/Mariana/Mariana/training/trainers.py", line 226, in start
Trainer_ABC.start( self, runName, model, recorder, trainingOrder, moreHyperParameters )
File "SUPRESSED/Mariana/Mariana/training/trainers.py", line 110, in start
return self.run(runName, model, recorder, *args, **kwargs)
File "SUPRESSED/Mariana/Mariana/training/trainers.py", line 410, in run
outputLayers
File "SUPRESSED/Mariana/Mariana/training/trainers.py", line 269, in _trainTest
res = modelFct(output, **kwargs)
File "SUPRESSED/Mariana/Mariana/network.py", line 47, in __call__
return self.callTheanoFct(outputLayer, **kwargs)
File "SUPRESSED/Mariana/Mariana/network.py", line 44, in callTheanoFct
return self.outputFcts[ol](**kwargs)
File "SUPRESSED/Mariana/Mariana/wrappers.py", line 110, in __call__
return self.run(**kwargs)
File "SUPRESSED/Mariana/Mariana/wrappers.py", line 102, in run
fres = iter(self.theano_fct(*self.fctInputs.values()))
File "SUPRESSED/Theano/theano/compile/function_module.py", line 871, in __call__
storage_map=getattr(self.fn, 'storage_map', None))
File "SUPRESSED/Theano/theano/gof/link.py", line 314, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File "SUPRESSED/Theano/theano/compile/function_module.py", line 859, in __call__
outputs = self.fn()
ValueError: Input dimension mis-match. (input[0].shape[1] = 1152, input[1].shape[1] = 4)
Apply node that caused the error: Elemwise{Composite{((i0 * i1) + (i2 * log(i3)))}}[(0, 1)](InplaceDimShuffle{x,0}.0, LogSoftmax.0, Elemwise{sub,no_inplace}.0, Elemwise{sub,no_inplace}.0)
Toposort index: 18
Inputs types: [TensorType(int32, row), TensorType(float64, matrix), TensorType(int32, row), TensorType(float64, matrix)]
Inputs shapes: [(1, 1152), (1152, 4), (1, 1152), (1152, 4)]
Inputs strides: [(4608, 4), (32, 8), (4608, 4), (32, 8)]
Inputs values: ['not shown', 'not shown', 'not shown', 'not shown']
Outputs clients: [[Sum{axis=[1], acc_dtype=float64}(Elemwise{Composite{((i0 * i1) + (i2 * log(i3)))}}[(0, 1)].0)]]
Versions:
Mariana (1.0.1rc1, /media/guilhermevrs/Data/Documentos/Academico/TCC-code/Mariana)
Theano (0.8.0.dev0, SUPRESSED/Theano)
This code was produced having as base the tutorial code from the mnist example.
Could you please help me to figure out what's going on?
Thank you in advance

I talked directly to the authors of Mariana and the cause and solution is explained in this issue

Related

Adding outputs of two layers in keras

I have an issue that seems to have no straight forward solution in Keras.
My server runs on ubuntu 14.04, keras with backend tensorflow.
Here's the issue:
I have two input tenors of the shape: Input(shape=(30,125,1)), each of them is fed to a cascade of three layers below:
CNN1 = Conv2D(filters = 8, kernel_size = (1,64) , padding = "same" , activation = "relu" )
CNN2 = Conv2D(filters = 8, kernel_size = (8,1) , padding = "same" , activation = "relu" )
pool = MaxPooling2D((2, 2))
Each of the obtained output tensors for respective inputs is of shape (None, 15, 62, 8). Now, I wish to add each of the (15,62) matrix for both inputs for each filter and get an output of dimension again (None, 15, 62, 8).
I tried with the following lines of code using Lambda layer but it throws an error.
from keras import backend as K
from keras.layers import Lambda
def myadd(x):
increment = x[1]
result = K.update_add(x[0], increment)
return result
in_1 = Input(shape=(30,125,1))
in_1CNN1 = CNN1(in_1)
in_1CNN2 = CNN2(in_1CNN1)
in_1pool = pool(in_1CNN2)
in_2 = Input(shape=(30,125,1))
in_2CNN1 = CNN1(in_2)
in_2CNN2 = CNN2(in_2CNN1)
in_2pool = pool(in_2CNN2)
y1 =y1.astype(np.float32) # an input regression label array of shape (numsamples,1) loaded from a mat file
out1 = Lambda(myadd, output_shape=(None, 15, 62, 8))([in_1pool,in_2pool])
a= keras.layers.Flatten()(out1)
pre1 = Dense(1000, activation='sigmoid')(a)
pre2 =Dropout(0.2)(pre1)
predictions = Dense(1, activation='sigmoid')(pre2)
model = Model(inputs=[in_1,in_2], outputs=predictions)
model.compile(optimizer='sgd',loss='mean_squared_error')
model.fit([inputdata1,inputdata2], y1, epochs=20, validation_split=0.5)
#inputdata1, inputdata2 are arrays loaded from a mat file and are each of shape (5169, 30, 125, 1)
The error is highlighted below:
Traceback (most recent call last):
File "keras_workshop/keras_multipleinputs_multiple CNN.py", line 225, in <module>
out1 = Lambda(myadd, output_shape=(None, 15, 62, 8))([in_1pool,in_2pool])
File "/home/tharun/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 603, in __call__
output = self.call(inputs, **kwargs)
File "/home/tharun/anaconda2/lib/python2.7/site-packages/keras/layers/core.py", line 651, in call
return self.function(inputs, **arguments)
File "keras_workshop/keras_multipleinputs_multiple CNN.py", line 75, in myadd
result = K.update_add(x[0], increment)
File "/home/tharun/anaconda2/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 958, in update_add
return tf.assign_add(x, increment)
File "/home/tharun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/state_ops.py", line 245, in assign_add
return ref.assign_add(value)
AttributeError: 'Tensor' object has no attribute 'assign_add'
Try the Add() layer or the add() function that Keras provides.
Add
keras.layers.Add()
Layer that adds a list of inputs.
It takes as input a list of tensors, all of the same shape, and returns a single tensor (also of the same shape).
add
keras.layers.add(inputs)
Functional interface to the Add layer.
Arguments
inputs: A list of input tensors (at least 2).
**kwargs: Standard layer keyword arguments.
Returns
A tensor, the sum of the inputs.

Python-pptx line chart error

Newb here just getting into Python and ran into an issue that's beating me down. I have the following excerpt of Python code to create a PPT slide from an existing template. The layout and placeholders are correct but I can't get it to run with my data listed below (x, y_in, & y_out). Any help is greatly appreciated.
x = [datetime.datetime(2017, 8, 4, 15, 5, tzinfo=<FixedOffset u'+00:00' datetime.timedelta(0)>), datetime.datetime(2017, 8, 4, 15, 10, tzinfo=<FixedOffset u'+00:00' datetime.timedelta(0)>), datetime.datetime(2017, 8, 4, 15, 15, tzinfo=<FixedOffset u'+00:00' datetime.timedelta(0)>), datetime.datetime(2017, 8, 4, 15, 20, tzinfo=<FixedOffset u'+00:00' datetime.timedelta(0)>)]
y_in = [780993, 538962, 730180, 1135936]
y_out = [5631489, 6774738, 6485944, 6611580]
prs = Presentation('Network_Utilization_template_master.pptx')
slide = prs.slides.add_slide(prs.slide_layouts[2])
placeholder = slide.placeholders[17]
chart_data = CategoryChartData()
chart_data.categories = x
chart_data.add_series(y_in)
chart_data.add_series(y_out)
graphic_frame = placeholder.insert_chart(XL_CHART_TYPE.LINE, chart_data)
chart = graphic_frame.chart
chart.has_legend = True
chart.legend.include_in_layout = True
chart.series[0-2].smooth = True
prs.save("Network_Utilization_" + today_s + ".pptx")
the compiler spits out the following:
Traceback (most recent call last):
File "/Users/jemorey/Documents/pptx-2.py", line 81, in <module>
graphic_frame = placeholder.insert_chart(XL_CHART_TYPE.LINE, chart_data)
File "/Users/jemorey/Library/Python/2.7/lib/python/site-packages/pptx/shapes/placeholder.py", line 291, in insert_chart
rId = self.part.add_chart_part(chart_type, chart_data)
File "/Users/jemorey/Library/Python/2.7/lib/python/site-packages/pptx/parts/slide.py", line 174, in add_chart_part
chart_part = ChartPart.new(chart_type, chart_data, self.package)
File "/Users/jemorey/Library/Python/2.7/lib/python/site-packages/pptx/parts/chart.py", line 29, in new
chart_blob = chart_data.xml_bytes(chart_type)
File "/Users/jemorey/Library/Python/2.7/lib/python/site-packages/pptx/chart/data.py", line 104, in xml_bytes
return self._xml(chart_type).encode('utf-8')
File "/Users/jemorey/Library/Python/2.7/lib/python/site-packages/pptx/chart/data.py", line 128, in _xml
return ChartXmlWriter(chart_type, self).xml
File "/Users/jemorey/Library/Python/2.7/lib/python/site-packages/pptx/chart/xmlwriter.py", line 803, in xml
'ser_xml': self._ser_xml,
File "/Users/jemorey/Library/Python/2.7/lib/python/site-packages/pptx/chart/xmlwriter.py", line 902, in _ser_xml
'tx_xml': xml_writer.tx_xml,
File "/Users/jemorey/Library/Python/2.7/lib/python/site-packages/pptx/chart/xmlwriter.py", line 191, in tx_xml
'series_name': self.name,
File "/Users/jemorey/Library/Python/2.7/lib/python/site-packages/pptx/chart/xmlwriter.py", line 121, in name
return escape(self._series.name)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/saxutils.py", line 32, in escape
data = data.replace("&", "&")
AttributeError: 'list' object has no attribute 'replace'
David Zemens is quite right in his comment. A series has a name, which appears as the first argument to ChartData.add_series(). The name appears in the legend next to the line color for that series and also appears as the column heading for the data for that series. Adding that in should get you to your next step.
Something like:
chart_data.add_series('MB in', y_in)
chart_data.add_series('MB out', y_out)

Unable to load custom initializer from the saved model, passing custom_objects is not working

I saved model and weights in Keras and then try to load them ,but it shows that Invalid initialization: my_init.How can I fix the problem?
model = Sequential()
def my_init(shape, name=None):
return initializations.normal(shape, scale=0.1, name=name)
def m6_1():
model.add(Convolution2D(32, 3, 3, init=my_init))
model.add(Activation('relu'))
model.add(Convolution2D(32, 3, 3, init=my_init))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(256, init=my_init))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
save model and weights
model_json = model.to_json()
with open("model.json", "w") as json_file:
json_file.write(model_json)
model.save_weights("model.h5")
load model and weights
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json,custom_objects={'my_init':my_init})
loaded_model.load_weights("model.h5")
error messageTraceback (most recent call last):
File "revised_learn_ETL6_load_model.py", line 73, in <module>
loaded_model = model_from_json(loaded_model_json,custom_objects={"my_init": my_init})
File "/home/ubuntu/.env/local/lib/python2.7/site-packages/keras/models.py", line 197, in model_from_json
return layer_from_config(config, custom_objects=custom_objects)
File "/home/ubuntu/.env/local/lib/python2.7/site-packages/keras/utils/layer_utils.py", line 36, in layer_from_config
return layer_class.from_config(config['config'])
File "/home/ubuntu/.env/local/lib/python2.7/site-packages/keras/models.py", line 1019, in from_config
layer = get_or_create_layer(first_layer)
File "/home/ubuntu/.env/local/lib/python2.7/site-packages/keras/models.py", line 1003, in get_or_create_layer
layer = layer_from_config(layer_data)
File "/home/ubuntu/.env/local/lib/python2.7/site-packages/keras/utils/layer_utils.py", line 36, in layer_from_config
return layer_class.from_config(config['config'])
File "/home/ubuntu/.env/local/lib/python2.7/site-packages/keras/engine/topology.py", line 929, in from_config
return cls(**config)
File "/home/ubuntu/.env/local/lib/python2.7/site-packages/keras/layers/convolutional.py", line 381, in __init__
self.init = initializations.get(init, dim_ordering=dim_ordering)
File "/home/ubuntu/.env/local/lib/python2.7/site-packages/keras/initializations.py", line 107, in get
'initialization', kwargs=kwargs)
File "/home/ubuntu/.env/local/lib/python2.7/site-packages/keras/utils/generic_utils.py", line 16, in get_from_module
str(identifier))
Exception: Invalid initialization: my_init

RNN regression using Tensorflow?

I am currently trying to implement a RNN for regression.
I need to create a neural network capable of converting audio samples into vector of mfcc feature. I've already know what the feature for each audio samples is, so the task it self is to create a neural network that is capable of converting a list of audio samples in to the desired MFCC feature.
The second problem I am facing is that since the audio files I am sampling has different length, will the list with the audio sample also have different length, which would cause problem with the number of input I need to feed into to the neural network. I found this post on how to handle variable sequence length, and tried to incorporate into my implementation of a RNN, but seem to not be able to get a lot of errors for unexplainable reasons..
Could anyone see what is going wrong with my implementation?
Here is the code:
def length(sequence): ##Zero padding to fit the max lenght... Question whether that is a good idea.
used = tf.sign(tf.reduce_max(tf.abs(sequence), reduction_indices=2))
length = tf.reduce_sum(used, reduction_indices=1)
length = tf.cast(length, tf.int32)
return length
def cost(output, target):
# Compute cross entropy for each frame.
cross_entropy = target * tf.log(output)
cross_entropy = -tf.reduce_sum(cross_entropy, reduction_indices=2)
mask = tf.sign(tf.reduce_max(tf.abs(target), reduction_indices=2))
cross_entropy *= mask
# Average over actual sequence lengths.
cross_entropy = tf.reduce_sum(cross_entropy, reduction_indices=1)
cross_entropy /= tf.reduce_sum(mask, reduction_indices=1)
return tf.reduce_mean(cross_entropy)
def last_relevant(output):
max_length = int(output.get_shape()[1])
relevant = tf.reduce_sum(tf.mul(output, tf.expand_dims(tf.one_hot(length, max_length), -1)), 1)
return relevant
files_train_path = [dnn_train+f for f in listdir(dnn_train) if isfile(join(dnn_train, f))]
files_test_path = [dnn_test+f for f in listdir(dnn_test) if isfile(join(dnn_test, f))]
files_train_name = [f for f in listdir(dnn_train) if isfile(join(dnn_train, f))]
files_test_name = [f for f in listdir(dnn_test) if isfile(join(dnn_test, f))]
os.chdir(dnn_train)
train_name,train_data = generate_list_of_names_data(files_train_path)
train_data, train_names, train_output_data, train_class_output = load_sound_files(files_train_path,train_name,train_data)
max_length = 0 ## Used for variable sequence input
for element in train_data:
if element.size > max_length:
max_length = element.size
NUM_EXAMPLES = len(train_data)/2
test_data = train_data[NUM_EXAMPLES:]
test_output = train_output_data[NUM_EXAMPLES:]
train_data = train_data[:NUM_EXAMPLES]
train_output = train_output_data[:NUM_EXAMPLES]
print("--- %s seconds ---" % (time.time() - start_time))
#----------------------------------------------------------------------#
#----------------------------Main--------------------------------------#
### Tensorflow neural network setup
batch_size = None
sequence_length_max = max_length
input_dimension=1
data = tf.placeholder(tf.float32,[batch_size,sequence_length_max,input_dimension])
target = tf.placeholder(tf.float32,[None,14])
num_hidden = 24 ## Hidden layer
cell = tf.nn.rnn_cell.LSTMCell(num_hidden,state_is_tuple=True) ## Long short term memory
output, state = tf.nn.dynamic_rnn(cell, data, dtype=tf.float32,sequence_length = length(data)) ## Creates the Rnn skeleton
last = last_relevant(output)#tf.gather(val, int(val.get_shape()[0]) - 1) ## Appedning as last
weight = tf.Variable(tf.truncated_normal([num_hidden, int(target.get_shape()[1])]))
bias = tf.Variable(tf.constant(0.1, shape=[target.get_shape()[1]]))
prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)
cross_entropy = cost(output,target)# How far am I from correct value?
optimizer = tf.train.AdamOptimizer() ## TensorflowOptimizer
minimize = optimizer.minimize(cross_entropy)
mistakes = tf.not_equal(tf.argmax(target, 1), tf.argmax(prediction, 1))
error = tf.reduce_mean(tf.cast(mistakes, tf.float32))
## Training ##
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
batch_size = 1000
no_of_batches = int(len(train_data)/batch_size)
epoch = 5000
for i in range(epoch):
ptr = 0
for j in range(no_of_batches):
inp, out = train_data[ptr:ptr+batch_size], train_output[ptr:ptr+batch_size]
ptr+=batch_size
sess.run(minimize,{data: inp, target: out})
print "Epoch - ",str(i)
incorrect = sess.run(error,{data: test_data, target: test_output})
print('Epoch {:2d} error {:3.1f}%'.format(i + 1, 100 * incorrect))
sess.close()
Error message:
Traceback (most recent call last):
File "tensorflow_test.py", line 177, in <module>
last = last_relevant(output)#tf.gather(val, int(val.get_shape()[0]) - 1) ## Appedning as last
File "tensorflow_test.py", line 132, in last_relevant
relevant = tf.reduce_sum(tf.mul(output, tf.expand_dims(tf.one_hot(length, max_length), -1)), 1)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 2778, in one_hot
name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1413, in _one_hot
axis=axis, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 454, in apply_op
as_ref=input_arg.is_ref)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 621, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 180, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 163, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_util.py", line 421, in make_tensor_proto
tensor_proto.string_val.extend([compat.as_bytes(x) for x in proto_values])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/compat.py", line 45, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got <function length at 0x7f51a7a3ede8>
Edit:
Changing the tf.one_hot(lenght(output),max_length) gives me this error message:
Traceback (most recent call last):
File "tensorflow_test.py", line 184, in <module>
cross_entropy = cost(output,target)# How far am I from correct value?
File "tensorflow_test.py", line 121, in cost
cross_entropy = target * tf.log(output)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 754, in binary_op_wrapper
return func(x, y, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 903, in _mul_dispatch
return gen_math_ops.mul(x, y, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1427, in mul
result = _op_def_lib.apply_op("Mul", x=x, y=y, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2312, in create_op
set_shapes_for_outputs(ret)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1704, in set_shapes_for_outputs
shapes = shape_func(op)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1801, in _BroadcastShape
% (shape_x, shape_y))
ValueError: Incompatible shapes for broadcasting: (?, 14) and (?, 138915, 24)
tf.one_hot(length, ...)
here length is a function, not a tensor. Try length(something) instead.

chi squared selectKbest bad input shape error

I'm a little new to scikit and ML. I'm trying to train an Adaboost classifier for one vs Rest classification. I'm using the following code
# To Read Training data set
test = pd.read_csv("train.csv", header=0, delimiter=",", \
quoting=1, error_bad_lines=False)
num_reviews = len(test["text"])
clean_train_reviews = []
catlist=[]
for i in xrange(0,num_reviews):
data=processText(test["text"][i])
data1=test["category"][i]
clean_train_reviews.append(data)
catlist.append(data1.split('.'))
# To read test dataset
test = pd.read_csv("test.csv", header=0, delimiter=",", \
quoting=1, error_bad_lines=False)
num_reviews = len(test["text"])
clean_test_reviews = []
for i in xrange(0,num_reviews):
data=processText(test["text"][i])
clean_test_reviews.append(data)
X_test=np.array(clean_test_reviews)
lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(catlist)
classifier = Pipeline([
('vectorizer', CountVectorizer(ngram_range=(1,2), max_features=1500,min_df=4)),
('tfidf', TfidfTransformer()),
('chi2', SelectKBest(chi2, k=200)),
('clf', OneVsRestClassifier(AdaBoostClassifier()))])
classifier.fit(clean_train_reviews, Y)
predicted = classifier.predict(X_test)
I use a pipeline, where text is inserted as clean_train_reviews and Y is the class (multi-Label, N = 10). Textual features are extracted in the pipeline using TfidfVectorizer() and selected using Chi squared feature selection method. Adaboost classifiers give: ValueError: bad input shape (1000, 10)
File "<ipython-input-10-9dbc8b18e6b8>", line 1, in <module>
runfile('C:/Users/Administrator/Desktop/nincymiss/adaboost.py', wdir='C:/Users/Administrator/Desktop/nincymiss')
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 601, in runfile
execfile(filename, namespace)
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 66, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Administrator/Desktop/nincymiss/adaboost.py", line 179, in <module>
classifier.fit(clean_train_reviews, Y)
File "C:\Python27\lib\site-packages\sklearn\pipeline.py", line 164, in fit
Xt, fit_params = self._pre_transform(X, y, **fit_params)
File "C:\Python27\lib\site-packages\sklearn\pipeline.py", line 145, in _pre_transform
Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
File "C:\Python27\lib\site-packages\sklearn\base.py", line 458, in fit_transform
return self.fit(X, y, **fit_params).transform(X)
File "C:\Python27\lib\site-packages\sklearn\feature_selection\univariate_selection.py", line 322, in fit
X, y = check_X_y(X, y, ['csr', 'csc'])
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 515, in check_X_y
y = column_or_1d(y, warn=True)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 551, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (1000, 10)
This is because feature selection does not work as you'd expect for multilabel problems. You can try the following which will select the 'best' features for each label separately.
classifier = Pipeline([
('vectorizer', CountVectorizer(ngram_range=(1,2), max_features=1500, min_df=4)),
('tfidf', TfidfTransformer()),
('chi2', SelectKBest(chi2, k=200)),
('clf', AdaBoostClassifier())])
clf = OneVsRestClassifier(classifier)