I am a bit worried when using Weka's ReplaceMissingValues to input the missing values only for the test arff dataset but not for the training dataset. Below is the commandline:
java -classpath weka.jar weka.filters.unsupervised.attribute.ReplaceMissingValues -c last -i "test_file_with_missing_values.arff" -o "test_file_with_filled_missing_values.arff"
From a previous post (Replace missing values with mean (Weka)), I came to know that Weka's ReplaceMissingValues simply replace each missing value with the mean of the corresponding attribute. This implies that the mean needs to be computed for each attribute. While computation of this mean is perfectly fine for the training file, it is not okay for the test file.
This is because in the typical test scenario, we should not assume that we know the mean of the test attribute for the input missing values. We only have one test record with multiple attributes for classification instead of having the entire set of test records in a test file. Therefore, instead, we shall input the missing value based on the mean computed using the training data. Then above command would become incorrect as we would need to have another input (the means of the train attributes).
Has anybody thought about this before? How do you work around this by using weka?
Easy, see Batch Filtering
Instances train = ... // from somewhere
Instances test = ... // from somewhere
Standardize filter = new Standardize();
filter.setInputFormat(train); // initializing the filter once with training set
Instances newTrain = Filter.useFilter(train, filter); // configures the Filter based on train instances and returns filtered instances
Instances newTest = Filter.useFilter(test, filter); // create new test set
The filter is initialized using the training data and then applied on both training and test data.
The problem is when you apply the ReplaceMissingValue filter outside any processing pipeline, because after writing the filtered data, you can't distinguish between "real" values and "imputed" values anymore. This is why you should do everything that needs to be done in a single pipeline, e.g., using the FilteredClassifier:
java -classpath weka.jar weka.classifiers.meta.FilteredClassifier
-t "training_file_with_missing_values.arff"
-T "test_file_with_missing_values.arff"
-F weka.filters.unsupervised.attribute.ReplaceMissingValues
-W weka.classifiers.functions.MultilayerPerceptron -- -L 0.3 -M 0.2 -H a
This example will initialize the ReplaceMissingValues filter using the "training_file_with_missing_values.arff" data set, then apply the filter on "test_file_with_missing_values.arff" (with the means learning from the training set), then train a multilayer perceptron on the filtered training data and predict the class of the filtered test data.
Related
If I am currently using a Weka decision tree (or other) classifier as follows in my Java code:
// Get training and testing data.
Instances train = new Instances ("from training file");
train.setClassIndex(train.numAttributes() - 1);
Instances test = new Instances ("from testing file");
test.setClassIndex(test.numAttributes() - 1);
// Set classifier.
Object obj = Class.forName("weka.classifiers.trees.J48").newInstance();
Classifier cls = (Classifier) Class.forName("weka.classifiers.trees.J48").cast(obj);
// Set parameters for classifier.
String options = ("-C 0.05 -M 2");
String[] optionsArray = options.split(" ");
cls.setOptions(optionsArray);
// Train classifier.
cls.buildClassifier(train);
Evaluation eval = new Evaluation(train);
// Test trained classifier.
eval.evaluateModel(cls, test);
What happens if I want to use a meta classifier, e.g. bagging, to try to boost results? In Weka's Explorer, if I use bagging with my training and testing data, the parameter string for the classifier is:
weka.classifiers.meta.Bagging -P 100 -S 1 -num-slots 1 -I 10 -W weka.classifiers.trees.J48 -- -C 0.25 -M 2
Does anyone know what a code representation of this might be?
Ideally, I want to store the classes of the classifier and meta classifier in a database table, i.e. so line:
Object obj = Class.forName("weka.classifiers.trees.J48").newInstance();
becomes:
Object obj = Class.forName(classifier.getWekaClass()).newInstance();
And where the parameters could be listed in a database table as well to make them easy to change if I swap over classifiers from J48 to NB.
I believe that this is what I'm looking for but...
http://weka.wikispaces.com/Use+WEKA+in+your+Java+code#Attribute selection-Meta-Classifier
The javadoc suggests that there is a method setClassifier() that you would use to set the classifier you want to use. Beyond that, it's simply a matter of instantiating the class and setting the options accordingly.
You can of course store the class names in a database and use them as an your example. Storing parameters would be a bit trickier as the number and type would vary with each classifier -- you would have to provide a wrapper that can serialise and deserialise them properly.
I am trying to use weka to classify text. What I do is this:
I create on big ARFF file with all of the data: all_of_it.arff.
I split that data into training and test:train.arff and test.arff
I do feature selection on the training set and output a new training file:train_fs.arff
I build a classifier with only those selected features.
And the problem is.....
I don't quite know how to standardize the test set to only use the features I selected from the training set. Something like create new test file from test.arff according to train_fs.arff
*I tried using
java -cp weka.jar weka.filters.unsupervised.attribute.Standardize -b -i train_fs.arff -o train2.arff -r test.arff -s test2.arff
but I got the infamous Src and Dest differ in # of attributes.
Is there any way to normalize/standardize the sets according to an arff file (namely my new training data with few features) I don't see how to do this with the Standardize or StringToWordVector filter.
Batch filtering is one solution to your problem.
Pros:
It will apply the same filter to your test dataset as you apply to your training dataset. When you perform feature selection, the two datasets will be compatible
Cons:
It is only availabe from the command line interface or Weka's Java API
The two datasets must be filtered at the same time
You can read more about Batch filtering here.
You may also want to look into InputMappedClassifier. It is a wrapper classifier that addresses incompatible training and testing data.
I'm trying to classify some web posts using weka and naive bayes classifier.
First I manually classified many posts (about 100 negative and 100 positive) and I created an .arff file with this form:
#relation classtest
#attribute 'post' string
#attribute 'class' {positive,negative}
#data
'RT #burnreporter: Google has now indexed over 30 trillion URLs. Wow. #LeWeb',positive
'A special one for me Soundcloud at #LeWeb ',positive
'RT #dianaurban: Lost Internet for 1/2 hour at a conference called #LeWeb. Ironic, yes?',negative
.
.
.
Then I open Weka Explorer loading that file and applying the StringToWordVector filter to split the posts in single word attributes.
Then, after doing the same with my dataset, selecting (in classify tab of weka) naive bayes classifier and choosing select test set, it returns Train and test set are not compatible. What can I do? Thanks!
Probably the ordering of the attributes is different in train and test sets.
You can use batch filtering as described in http://weka.wikispaces.com/Batch+filtering
I used batch filter but still have problem. Here is what I did:
java -cp /usr/share/java/weka.jar weka.filters.unsupervised.attribute.NumericToNominal -R last -b -i trainData.arff -o trainDataProcessed.csv.arff -r testData.arff -s testDataProcessed.csv.arff
I then get the error below:
Input file formats differ.
Later.I figured out two ways to make the trained model working on supplied test set.
Method 1.
Use knowledge flow. For example something like below: CSVLoader(for train set) -> classAssigner -> TrainingSetMaker -->(classifier of your choice) -> ClassfierPerformanceEvaluator - TextViewer. CSVLoader(for test set) -> classAssigner -> TestgSetMaker -->(the same classifier instance above) -> PredictionAppender -> CSVSaver. Then load the data from the CSVLoader or arffLoder for the training set. The model will be trained. After that load data from the loader for the test set. It will evaluate the model(classifier, for example) on the supplied test set and you can see the result from the textviewer (connected to the ClassifierPerformanceEvaluator) and get the saved result from the CSVSaver or arffSaver connected to the PredictionAppender.An additional column, the "classfied as" will be added to the output file. In my case, I used "?" for the class column in the supplied test set if the class labels are not available.
Method 2.
Combine the Training and Test set into one file. Then the exact same filter can be applied to both training and test set. Then you can separate training set and test set by applying instance filter. Since I use "?" as class label in the test set. It is not visible in the instance filter indices. Hence just select those indices that you can see in the attribute values to be removed when apply the instance filter. You will get the test data left only. Save it and load it in supply test set at the classifier page.This time it will work. I guess it is the class attribute that causes the NOT compatible train and test set issue. As many classfier requires nominal class attribute. The value of which is converted to the index to available values of the class attribute according to http://weka.wikispaces.com/Why+do+I+get+the+error+message+%27training+and+test+set+are+not+compatible%27%3F
I have a largish dataset that I am using Weka to explore. It goes like this: today I will analyze as much data as I can, and create a trained classifier. I'll save this model as a file. Then tomorrow I will acquire a new batch of data, and want to use the saved model to predict the class for the new data. This repeats every day. Eventually I will update the saved model, but for now assume that it is static.
Due to the size and frequency of this task, I want to run this automatically, which means the command line or similar. However, my problem exists in the Explorer, as well.
My question has to do with the fact that, as my dataset grows, the list of possible labels for attributes also grows. Weka says such attribute lists cannot change, or the training set and test set are said to be incompatible (see: http://weka.wikispaces.com/Why+do+I+get+the+error+message+%27training+and+test+set+are+not+compatible%27%3F). But in my world there is no way that I could possibly know today all the attribute labels that I will stumble across next week.
To rectify the situation, it is suggested that I run batch filtering (http://weka.wikispaces.com/How+do+I+generate+compatible+train+and+test+sets+that+get+processed+with+a+filter%3F). Okay, that appears to mean that I need to re-build my model with the refiltered training data each day.
At this point the whole thing seems difficult enough that I fear I am making a horrible, simple newbie mistake, and so I ask for help.
DETAILS:
The model was created by
java -Xmx1280M weka.classifiers.meta.FilteredClassifier ^
-t .\training.arff -d .\my.model -c 15 ^
-F "weka.filters.supervised.attribute.Discretize -R first-last" ^
-W weka.classifiers.trees.J48 -- -C 0.25 -M 2
Naively, to predict I would try:
java -Xmx1280M weka.core.converters.DatabaseLoader ^
-url jdbc:odbc:(database) ^
-user (user) ^
-password (password) ^
-Q "exec (my_stored_procedure) '1/1/2012', '1/2/2012' " ^
\> .\NextDay.arff
And then:
java -Xmx1280M weka.classifiers.trees.J48 ^
-T .\NextDay.arff ^
-l .\my.model ^
-c 15 ^
-p 0 ^
\> .\MyPredictions.txt
this yields:
java.lang.Exception: training and test set are not compatible
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1035)
at weka.classifiers.Classifier.runClassifier(Classifier.java:312)
at weka.classifiers.trees.J48.main(J48.java:948)
A related question is asked at kdkeys.net/training-and-test-set-are-not-compatible-weka/
An associated problem is that the command-line version of the database extraction requires generation of a temporary .arff file, and it appears JDBC-generated arff files do not handle "date" data correctly. My database generates dates of the ISO-8601 format "yyyy-MM-dd'T'HH:mm:ss" but both Explorer and generated .arff files from JDBC data represent these as type NOMINAL. And so the list of labels for date attributes in the header is very, very long and never the same from dataset to dataset.
I'm not a java or python programmer, but if that's what it takes, I'll go buy some books! Thanks in advance.
I think you can use Incremental classifiers. But only few classifier can support for this option. Like SMO, J48 classifiers wont support this. So you will use some other classifier to classify.
To know more visit
http://weka.wikispaces.com/Classifying+large+datasets
http://wiki.pentaho.com/display/DATAMINING/Handling+Large+Data+Sets+with+Weka
There is a bigger problem with your plan too, it seems. If you have data from day 1 and you use it to build a model, then you use it on data from day n that has new and never before seen class labels, it will be impossible to predict the new labels because there is no training data for them. Similarly, if you have new attributes, it will be impossible to use those for classification because none of your training data has them to associate with the class labels.
Thus, if you want to use a model trained on data with only a subset of the new data's attributes/classes, then you might as well filter the new data to remove the new classes/attributes since they wouldn't be used even if you could execute weka without errors on two dissimilar datasets.
If it's not in your training set, exclude it from your test set. Then everything should work. If you need to be able to test/predict on it, then you need to retrain a new model that has examples of the new classes/attributes.
Doing this in your environment might require manually querying data out of the database into arff files, so as to query out only the attributes/classes that were in the training set. Look into sql and any major scripting language (e.g. perl, python) to do this without much fuss.
The university who maintains Weka also created MOA (Massive Online Analysis) to analyse and solve your kind of problem. All of their classifiers are updatable and you can compare classifiers performance over the time for your data flow. It also allows you to detect change of models (concept drift/shift) and optimize (ie. limit) your data window over the time (forget old data mechanism...).
Once you're done with testing and tuning with MOA, you can then use MOA classifiers from Weka (there is an extension to enable it) and batch all your process.
So I have a training and testing sets and they contain multi-valued nominal values. As long as I need to train & test NaiveBayesMultinomial classifier, which doesn't support multi-valued nominal values, I do the following:
java weka.filters.supervised.attribute.NominalToBinary -i train.arff -o train_bin.arff -c last
java weka.filters.supervised.attribute.NominalToBinary -i test.arff -o test_bin.arff -c last
Then i run this:
java weka.classifiers.bayes.NaiveBayesMultinomial -t train_bin.arff -T test_bin.arff
And the following error arises:
Weka exception: Train and test files not compatible!
As far as I understood after I examined both .arff files, they became incompatible after I ran NominalToBinary, since train and test sets are different and thus different binary variables are generated.
Is it possible to perform NominalToBinary conversion in a way that sets keep being compatible?
Concatenate the two sets into one, perform the NominalToBinary conversion, then split them again. This way, they should be normalized the same way.
But are you sure the files were compatible before? Or does maybe your test and/or training set contain attribute cases that the other doesn't have?