I'm trying to classify some web posts using weka and naive bayes classifier.
First I manually classified many posts (about 100 negative and 100 positive) and I created an .arff file with this form:
#relation classtest
#attribute 'post' string
#attribute 'class' {positive,negative}
#data
'RT #burnreporter: Google has now indexed over 30 trillion URLs. Wow. #LeWeb',positive
'A special one for me Soundcloud at #LeWeb ',positive
'RT #dianaurban: Lost Internet for 1/2 hour at a conference called #LeWeb. Ironic, yes?',negative
.
.
.
Then I open Weka Explorer loading that file and applying the StringToWordVector filter to split the posts in single word attributes.
Then, after doing the same with my dataset, selecting (in classify tab of weka) naive bayes classifier and choosing select test set, it returns Train and test set are not compatible. What can I do? Thanks!
Probably the ordering of the attributes is different in train and test sets.
You can use batch filtering as described in http://weka.wikispaces.com/Batch+filtering
I used batch filter but still have problem. Here is what I did:
java -cp /usr/share/java/weka.jar weka.filters.unsupervised.attribute.NumericToNominal -R last -b -i trainData.arff -o trainDataProcessed.csv.arff -r testData.arff -s testDataProcessed.csv.arff
I then get the error below:
Input file formats differ.
Later.I figured out two ways to make the trained model working on supplied test set.
Method 1.
Use knowledge flow. For example something like below: CSVLoader(for train set) -> classAssigner -> TrainingSetMaker -->(classifier of your choice) -> ClassfierPerformanceEvaluator - TextViewer. CSVLoader(for test set) -> classAssigner -> TestgSetMaker -->(the same classifier instance above) -> PredictionAppender -> CSVSaver. Then load the data from the CSVLoader or arffLoder for the training set. The model will be trained. After that load data from the loader for the test set. It will evaluate the model(classifier, for example) on the supplied test set and you can see the result from the textviewer (connected to the ClassifierPerformanceEvaluator) and get the saved result from the CSVSaver or arffSaver connected to the PredictionAppender.An additional column, the "classfied as" will be added to the output file. In my case, I used "?" for the class column in the supplied test set if the class labels are not available.
Method 2.
Combine the Training and Test set into one file. Then the exact same filter can be applied to both training and test set. Then you can separate training set and test set by applying instance filter. Since I use "?" as class label in the test set. It is not visible in the instance filter indices. Hence just select those indices that you can see in the attribute values to be removed when apply the instance filter. You will get the test data left only. Save it and load it in supply test set at the classifier page.This time it will work. I guess it is the class attribute that causes the NOT compatible train and test set issue. As many classfier requires nominal class attribute. The value of which is converted to the index to available values of the class attribute according to http://weka.wikispaces.com/Why+do+I+get+the+error+message+%27training+and+test+set+are+not+compatible%27%3F
Related
I have two datasets regarding whether a sentence contains a mention of a drug adverse event or not, both the training and test set have only two fields the text and the labels{Adverse Event, No Adverse Event} I have used weka with the stringtoWordVector filter to build a model using Random Forest on the training set.
I want to test the model built with removing the class labels from the test data set, applying the StringToWordVector filter on it and testing the model with it. When I try to do that it gives me the error saying training and test set not compatible probably because the filter identifies a different set of attributes for the test dataset. How do I fix this and output the predictions for the test set.
The easiest way to do this for a one off test is not to pre-filter the training set, but to use Weka's FilteredClassifier and configure it with the StringToWordVector filter, and your chosen classifier to do the classification. This is explained well in this video from the More Data Mining with Weka online course.
For a more general solution, if you want to build the model once then evaluate it on different test sets in future, you need to use InputMappedClassifier:
Wrapper classifier that addresses incompatible training and test data
by building a mapping between the training data that a classifier has
been built with and the incoming test instances' structure. Model
attributes that are not found in the incoming instances receive
missing values, so do incoming nominal attribute values that the
classifier has not seen before. A new classifier can be trained or an
existing one loaded from a file.
Weka requires a label even for the test data. It uses the labels or „ground truth“ of the test data to compare the result of the model against it and measure the model performance. How would you tell whether a model is performing well, if you don‘t know whether its predictions are right or wrong. Thus, the test data needs to have the very same structure as the training data in WEKA, including the labels. No worries, the labels are not used to help the model with its predictions.
The best way to go is to select cross validation (e.g. 10 fold cross validation) which automatically will split your data into 10 parts, using 9 for training and the remaining 1 for testing. This procedure is repeated 10 times so that each of the 10 parts has once been used as test data. The final performance verdict will be an average of all 10 rounds. Cross validation gives you a quite realistic estimate of the model performance on new, unseen data.
What you were trying to do, namely using the exact same data for training and testing is a bad idea, because the measured performance you end up with is way too optimistic. This means, you‘ll get very impressive figures like 98% accuracy during testing - but as soon as you use the model against new unseen data your accuracy might drop to a much worse level.
I am classifying iris data using DECISION TREE (C4.5), RANDOM FOREST and NAIVE BAYES. I am using the dataset downloaded from iris-train and iris-test. When I train the all networks everything is fine with proper results with 'classifier output', 'Detailed accuracy with class' and 'confusion matrix'. But, when I select the iris-test data in the Weka-explorer-classify-test options and select the iris-test file and in 'more options' select 'output prediction' as 'csv' and click start, I am getting the result as shown in the figure below. The 'classifier output' is showing the classified samples correctly, but, 'Detailed accuracy with class' and 'confusion matrix' is with all values zeros. Any suggestion where I am going wrong in selecting any option. Thank you.
The confusion matrix shows you how well your trained classifier performs by comparing the actual class of the instances in the test set with the class that was predicted by the classifier. But you are supplying a test set with no class information, so there's nothing to compare against. This is why you see
Total Number of Instances 0
Ignored Class Unknown Instances 120
in the output in your screenshot.
Typically you would first evaluate the performance of your classifier using cross-validation, or a test set that has class information. Then you can use the trained classifier to classify unknown data, for example using the Re-evaluate model on current test set right-click option as described in the help.
I use WEKA for Text classification , I have trained data set , and I apply StringToWOrdVector and NumericToNominal filters , and have test data set and applied the same filters on it .
When I try to apply my model on test data ,it gave me the following error
Train and test set are not compatible
I searched for a solution , the error occurred because number of attributes different between two sets, and it always be different because texts in two sets are different
How I can solve this error please ?
The best thing you can do is combine your training and test set into one file and then apply the filter to it all in one go, then split them up again and copy the #attribute values from the combined file into both the training and test files. This way the attributes will be consistent across both files.
I am a bit worried when using Weka's ReplaceMissingValues to input the missing values only for the test arff dataset but not for the training dataset. Below is the commandline:
java -classpath weka.jar weka.filters.unsupervised.attribute.ReplaceMissingValues -c last -i "test_file_with_missing_values.arff" -o "test_file_with_filled_missing_values.arff"
From a previous post (Replace missing values with mean (Weka)), I came to know that Weka's ReplaceMissingValues simply replace each missing value with the mean of the corresponding attribute. This implies that the mean needs to be computed for each attribute. While computation of this mean is perfectly fine for the training file, it is not okay for the test file.
This is because in the typical test scenario, we should not assume that we know the mean of the test attribute for the input missing values. We only have one test record with multiple attributes for classification instead of having the entire set of test records in a test file. Therefore, instead, we shall input the missing value based on the mean computed using the training data. Then above command would become incorrect as we would need to have another input (the means of the train attributes).
Has anybody thought about this before? How do you work around this by using weka?
Easy, see Batch Filtering
Instances train = ... // from somewhere
Instances test = ... // from somewhere
Standardize filter = new Standardize();
filter.setInputFormat(train); // initializing the filter once with training set
Instances newTrain = Filter.useFilter(train, filter); // configures the Filter based on train instances and returns filtered instances
Instances newTest = Filter.useFilter(test, filter); // create new test set
The filter is initialized using the training data and then applied on both training and test data.
The problem is when you apply the ReplaceMissingValue filter outside any processing pipeline, because after writing the filtered data, you can't distinguish between "real" values and "imputed" values anymore. This is why you should do everything that needs to be done in a single pipeline, e.g., using the FilteredClassifier:
java -classpath weka.jar weka.classifiers.meta.FilteredClassifier
-t "training_file_with_missing_values.arff"
-T "test_file_with_missing_values.arff"
-F weka.filters.unsupervised.attribute.ReplaceMissingValues
-W weka.classifiers.functions.MultilayerPerceptron -- -L 0.3 -M 0.2 -H a
This example will initialize the ReplaceMissingValues filter using the "training_file_with_missing_values.arff" data set, then apply the filter on "test_file_with_missing_values.arff" (with the means learning from the training set), then train a multilayer perceptron on the filtered training data and predict the class of the filtered test data.
Each row of my training and test datasets has intensity values for pixels in an image with the last column having the label which tells what digit is represented in the image; the label can be any number from 0 to 9 in training set and is always ? on test set. I loaded the training dataset on Weka Explorer, passed the data through NumericalToNominal filter and used RemovePercentage filter to split the data in 70-30 ratio, the 30% file being used as cross validation set. I built a classifer and saved the model.
Then, I loaded the test data which has ? against label for each row and applied the NumericToNominal filter and saved it as arff file.Now, when i load the test data and try to user the model against it, I always get the error message saying "training and test set are not compatible". Both datasets have undergone the same processing. What possibly could have gone wrong?
As you can read from ARFF manual (http://www.cs.waikato.ac.nz/ml/weka/arff.html):
Nominal values are defined by providing an
listing the possible values: {, ,
, ...}
For example, the class value of the Iris dataset can be defined as
follows:
#ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
So when you apply NumericToNominal to your test file you can possibly have different number of possible values for one or more attributes within train and test arff - it really can happen, it bothered me many times - so one solution is to check your arff's manually (if it is not to big, or just copy and paste invocation of arff file with
e.g.
#attribute 'My first binary attribute' {0,1}
(...)
#attribute 'My last binary attribute' {0,1}
from train to test file - should work
you can use batch filtering, here you can read how to batch filtering in weka