Building compatible datasets for Weka for large, evolving data - weka

I have a largish dataset that I am using Weka to explore. It goes like this: today I will analyze as much data as I can, and create a trained classifier. I'll save this model as a file. Then tomorrow I will acquire a new batch of data, and want to use the saved model to predict the class for the new data. This repeats every day. Eventually I will update the saved model, but for now assume that it is static.
Due to the size and frequency of this task, I want to run this automatically, which means the command line or similar. However, my problem exists in the Explorer, as well.
My question has to do with the fact that, as my dataset grows, the list of possible labels for attributes also grows. Weka says such attribute lists cannot change, or the training set and test set are said to be incompatible (see: http://weka.wikispaces.com/Why+do+I+get+the+error+message+%27training+and+test+set+are+not+compatible%27%3F). But in my world there is no way that I could possibly know today all the attribute labels that I will stumble across next week.
To rectify the situation, it is suggested that I run batch filtering (http://weka.wikispaces.com/How+do+I+generate+compatible+train+and+test+sets+that+get+processed+with+a+filter%3F). Okay, that appears to mean that I need to re-build my model with the refiltered training data each day.
At this point the whole thing seems difficult enough that I fear I am making a horrible, simple newbie mistake, and so I ask for help.
DETAILS:
The model was created by
java -Xmx1280M weka.classifiers.meta.FilteredClassifier ^
-t .\training.arff -d .\my.model -c 15 ^
-F "weka.filters.supervised.attribute.Discretize -R first-last" ^
-W weka.classifiers.trees.J48 -- -C 0.25 -M 2
Naively, to predict I would try:
java -Xmx1280M weka.core.converters.DatabaseLoader ^
-url jdbc:odbc:(database) ^
-user (user) ^
-password (password) ^
-Q "exec (my_stored_procedure) '1/1/2012', '1/2/2012' " ^
\> .\NextDay.arff
And then:
java -Xmx1280M weka.classifiers.trees.J48 ^
-T .\NextDay.arff ^
-l .\my.model ^
-c 15 ^
-p 0 ^
\> .\MyPredictions.txt
this yields:
java.lang.Exception: training and test set are not compatible
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1035)
at weka.classifiers.Classifier.runClassifier(Classifier.java:312)
at weka.classifiers.trees.J48.main(J48.java:948)
A related question is asked at kdkeys.net/training-and-test-set-are-not-compatible-weka/
An associated problem is that the command-line version of the database extraction requires generation of a temporary .arff file, and it appears JDBC-generated arff files do not handle "date" data correctly. My database generates dates of the ISO-8601 format "yyyy-MM-dd'T'HH:mm:ss" but both Explorer and generated .arff files from JDBC data represent these as type NOMINAL. And so the list of labels for date attributes in the header is very, very long and never the same from dataset to dataset.
I'm not a java or python programmer, but if that's what it takes, I'll go buy some books! Thanks in advance.

I think you can use Incremental classifiers. But only few classifier can support for this option. Like SMO, J48 classifiers wont support this. So you will use some other classifier to classify.
To know more visit
http://weka.wikispaces.com/Classifying+large+datasets
http://wiki.pentaho.com/display/DATAMINING/Handling+Large+Data+Sets+with+Weka

There is a bigger problem with your plan too, it seems. If you have data from day 1 and you use it to build a model, then you use it on data from day n that has new and never before seen class labels, it will be impossible to predict the new labels because there is no training data for them. Similarly, if you have new attributes, it will be impossible to use those for classification because none of your training data has them to associate with the class labels.
Thus, if you want to use a model trained on data with only a subset of the new data's attributes/classes, then you might as well filter the new data to remove the new classes/attributes since they wouldn't be used even if you could execute weka without errors on two dissimilar datasets.
If it's not in your training set, exclude it from your test set. Then everything should work. If you need to be able to test/predict on it, then you need to retrain a new model that has examples of the new classes/attributes.
Doing this in your environment might require manually querying data out of the database into arff files, so as to query out only the attributes/classes that were in the training set. Look into sql and any major scripting language (e.g. perl, python) to do this without much fuss.

The university who maintains Weka also created MOA (Massive Online Analysis) to analyse and solve your kind of problem. All of their classifiers are updatable and you can compare classifiers performance over the time for your data flow. It also allows you to detect change of models (concept drift/shift) and optimize (ie. limit) your data window over the time (forget old data mechanism...).
Once you're done with testing and tuning with MOA, you can then use MOA classifiers from Weka (there is an extension to enable it) and batch all your process.

Related

Training and Test Set in Weka InCompatible in Text Classification

I have two datasets regarding whether a sentence contains a mention of a drug adverse event or not, both the training and test set have only two fields the text and the labels{Adverse Event, No Adverse Event} I have used weka with the stringtoWordVector filter to build a model using Random Forest on the training set.
I want to test the model built with removing the class labels from the test data set, applying the StringToWordVector filter on it and testing the model with it. When I try to do that it gives me the error saying training and test set not compatible probably because the filter identifies a different set of attributes for the test dataset. How do I fix this and output the predictions for the test set.
The easiest way to do this for a one off test is not to pre-filter the training set, but to use Weka's FilteredClassifier and configure it with the StringToWordVector filter, and your chosen classifier to do the classification. This is explained well in this video from the More Data Mining with Weka online course.
For a more general solution, if you want to build the model once then evaluate it on different test sets in future, you need to use InputMappedClassifier:
Wrapper classifier that addresses incompatible training and test data
by building a mapping between the training data that a classifier has
been built with and the incoming test instances' structure. Model
attributes that are not found in the incoming instances receive
missing values, so do incoming nominal attribute values that the
classifier has not seen before. A new classifier can be trained or an
existing one loaded from a file.
Weka requires a label even for the test data. It uses the labels or „ground truth“ of the test data to compare the result of the model against it and measure the model performance. How would you tell whether a model is performing well, if you don‘t know whether its predictions are right or wrong. Thus, the test data needs to have the very same structure as the training data in WEKA, including the labels. No worries, the labels are not used to help the model with its predictions.
The best way to go is to select cross validation (e.g. 10 fold cross validation) which automatically will split your data into 10 parts, using 9 for training and the remaining 1 for testing. This procedure is repeated 10 times so that each of the 10 parts has once been used as test data. The final performance verdict will be an average of all 10 rounds. Cross validation gives you a quite realistic estimate of the model performance on new, unseen data.
What you were trying to do, namely using the exact same data for training and testing is a bad idea, because the measured performance you end up with is way too optimistic. This means, you‘ll get very impressive figures like 98% accuracy during testing - but as soon as you use the model against new unseen data your accuracy might drop to a much worse level.

How to do prediction with weka

i'm using weka to do some text mining, i'm a little bit confused so i'm here to ask how can i ( with a set of comments that are in a some way classified as: notes, status of work, not conformity, warning) predict if a new comment belong to a specific class, with all the comment (9551) i've done a preprocess obtaining with the filter "stringtowordvector" a vector of tokens, and then i've used the simple kmeans to obtain a number of cluster.
So the question is: if a user post a new comment can i predict with those data if it belong to a category of comment?
sorry if my question is a little bit confused but so am i.
thank you
Trivial Training-validation-test
Create two datasets from your labelled instances. One will be training set and the other will be validation set. The training set will contain about 60% of the labelled data and the validation will contain 40% of the labelled data. There is no hard and fast rule for this split, but a 60-40 split is a good choice.
Use K-means (or any other clustering algorithm) on your training data. Develop a model. Record the model's error on training set. If the error is low and acceptable, you are fine. Save the model.
For now, your validation set will be your test dataset. Apply the model you saved on your validation set. Record the error. What is the difference between training error and validation error? If they both are low, the model's generalization is "seemingly" good.
Prepare a test dataset where you have all the features of your training and test dataset but the class/cluster is unknown.
Apply the model on the test data.
10-fold cross validation
Use all of your labelled data instances for this task.
Apply K-means (or any other algorithm of your choice) with a 10-fold CV setup.
Record the training error and CV error. Are they low? Is the difference between the errors is low? If yes, then save the model and apply it on the test data whose class/cluster is unknown.
NB: The training/test/validation errors and their differences will give you an "very initial" idea of overfitting/underfitting of your model. They are sanity tests. You need to perform other tests like learning curves to see if your model overfits or underfits or perfect. If there appears to be an overfitting and underfitting problem, you need to try many different techniques to overcome them.

Missing the obvious with inconsistently delimited data?

I have built something in SAS to pull down Yahoo! finance .csv data. The code I have built now works fine and I have built some robust error handling into the code. The problem I have had with the data though is that the .csv feed is unsupported and not clean.
The data is comma delimited, but some of the data also has commas in it. Some of the fields are in quotes and some are not. Also the length of the fields varies wildly as as well. A field like Market Capitlisation for example could run form a few million to hundreds of billions.
As a result, if you pass multiple stock metrics for multiple stocks through to the Yahoo! API at the same time, you will get rows of .csv data where each field is in a different place, is a different length and is inconsistently delimited.
I have tried multiple infile options that could handle some of these errors in isolation, but not all of them together. My only solution that works is to download single stock metrics by multiple stocks at the same time.
This gives me what I want, but it takes over an hour to run the data for the NASDAQ and the NYSE. Have I overlooked another method for handling this type of problem?
Thanks
This is the outline of a way to do what you are looking for. The whole of the code to do this would be too long to post here and out of scope of what this site looks to do.
Create a SAS program that takes a stock ticker from the SYSPARM automatic macro, and downloads the data to a data set named the same as the ticker into a permanent library.
The SYSPARM macro is set by the value you set on the commandline to call SAS
sas.exe myprog.sas -sysparm XYZ
This would set &SYSPARM to resolve XYZ
Write a SAS program that merges all the ticker data sets together for further processing.
Create a program in a language like Perl or Python, (or shell script, etc.) that loops over a range of tickers and calls your SAS program, passing the ticker through SYSPARM.
Use a threading, forking, etc. package from that language to have multiple of these running at the same time. You can probably go to some multiple of the CPU cores on your machine as this processing will not be CPU intensive. Test values to you find one that works.
From that same language call your SAS program to merge the datasets.

Weka: Src and Dest differ in # of attributes after I do feature selection on the training set

I am trying to use weka to classify text. What I do is this:
I create on big ARFF file with all of the data: all_of_it.arff.
I split that data into training and test:train.arff and test.arff
I do feature selection on the training set and output a new training file:train_fs.arff
I build a classifier with only those selected features.
And the problem is.....
I don't quite know how to standardize the test set to only use the features I selected from the training set. Something like create new test file from test.arff according to train_fs.arff
*I tried using
java -cp weka.jar weka.filters.unsupervised.attribute.Standardize -b -i train_fs.arff -o train2.arff -r test.arff -s test2.arff
but I got the infamous Src and Dest differ in # of attributes.
Is there any way to normalize/standardize the sets according to an arff file (namely my new training data with few features) I don't see how to do this with the Standardize or StringToWordVector filter.
Batch filtering is one solution to your problem.
Pros:
It will apply the same filter to your test dataset as you apply to your training dataset. When you perform feature selection, the two datasets will be compatible
Cons:
It is only availabe from the command line interface or Weka's Java API
The two datasets must be filtered at the same time
You can read more about Batch filtering here.
You may also want to look into InputMappedClassifier. It is a wrapper classifier that addresses incompatible training and testing data.

Weka: ReplaceMissingValues for a test file

I am a bit worried when using Weka's ReplaceMissingValues to input the missing values only for the test arff dataset but not for the training dataset. Below is the commandline:
java -classpath weka.jar weka.filters.unsupervised.attribute.ReplaceMissingValues -c last -i "test_file_with_missing_values.arff" -o "test_file_with_filled_missing_values.arff"
From a previous post (Replace missing values with mean (Weka)), I came to know that Weka's ReplaceMissingValues simply replace each missing value with the mean of the corresponding attribute. This implies that the mean needs to be computed for each attribute. While computation of this mean is perfectly fine for the training file, it is not okay for the test file.
This is because in the typical test scenario, we should not assume that we know the mean of the test attribute for the input missing values. We only have one test record with multiple attributes for classification instead of having the entire set of test records in a test file. Therefore, instead, we shall input the missing value based on the mean computed using the training data. Then above command would become incorrect as we would need to have another input (the means of the train attributes).
Has anybody thought about this before? How do you work around this by using weka?
Easy, see Batch Filtering
Instances train = ... // from somewhere
Instances test = ... // from somewhere
Standardize filter = new Standardize();
filter.setInputFormat(train); // initializing the filter once with training set
Instances newTrain = Filter.useFilter(train, filter); // configures the Filter based on train instances and returns filtered instances
Instances newTest = Filter.useFilter(test, filter); // create new test set
The filter is initialized using the training data and then applied on both training and test data.
The problem is when you apply the ReplaceMissingValue filter outside any processing pipeline, because after writing the filtered data, you can't distinguish between "real" values and "imputed" values anymore. This is why you should do everything that needs to be done in a single pipeline, e.g., using the FilteredClassifier:
java -classpath weka.jar weka.classifiers.meta.FilteredClassifier
-t "training_file_with_missing_values.arff"
-T "test_file_with_missing_values.arff"
-F weka.filters.unsupervised.attribute.ReplaceMissingValues
-W weka.classifiers.functions.MultilayerPerceptron -- -L 0.3 -M 0.2 -H a
This example will initialize the ReplaceMissingValues filter using the "training_file_with_missing_values.arff" data set, then apply the filter on "test_file_with_missing_values.arff" (with the means learning from the training set), then train a multilayer perceptron on the filtered training data and predict the class of the filtered test data.