I was wondering if there's a way to train the model using Naive Bayes, and then apply that to a single record. I'm new to weka so I dont know if this is possible. Also, is there a way to store the classifier output in a file?
The answer is yes since Naive Bayes is a model based on simple probabilistic Bayes theorem that can be used for classification challenges.
For classification using Naive Bayes, and other classifiers, you need to first train the model with a sample dataset, once trained the model can be applied to any record.
Of course there will be always an error probability when using this approach, but that depends mostly on the quality of your sample and the properties of your data set.
I haven't used Weka directly, but as an extension for Rapid Miner, but the principles must apply. Once the model is trained you should be able to see/print the model parameters.
I am currently searching for the same answer, while using java.
I created an arff file, which contains training date and used the programm http://weka.wikispaces.com/file/view/WekaDemo.java as an example to train and evaluate the classifer.
I still need to figure out, howto save and load a model in java and (more importantly) how to test against a single record.
WekaDemo.java
...
public void execute() throws Exception {
// run filter
m_Filter.setInputFormat(m_Training);
Instances filtered = Filter.useFilter(m_Training, m_Filter);
// train classifier on complete file for tree
m_Classifier.buildClassifier(filtered);
// 10fold CV with seed=1
m_Evaluation = new Evaluation(filtered);
m_Evaluation.crossValidateModel(
m_Classifier, filtered, 10, m_Training.getRandomNumberGenerator(1));
//TODO Save model
//TODO Load model
//TODO Test against a single information
}
...
Edit 1:
Save and loading a model is explained here: How to test existing model with new instance in weka, using java code?
In http://weka.wikispaces.com/Use+WEKA+in+your+Java+code#Classification-Classifying%20instances there is a quick how to for classifying a single instance.
//load model (saved from user interface)
Classifier tree = (Classifier) weka.core.SerializationHelper.read("/some/where/j48.model");
// load unlabeled data
Instances unlabeled = new Instances( new BufferedReader(new FileReader("/some/where/unlabeled.arff")));
// set class attribute
unlabeled.setClassIndex(unlabeled.numAttributes() - 1);
// create copy
Instances labeled = new Instances(unlabeled);
// label instances
for (int i = 0; i < unlabeled.numInstances(); i++) {
double clsLabel = tree.classifyInstance(unlabeled.instance(i));
labeled.instance(i).setClassValue(clsLabel);
System.out.println(clsLabel + " -> " + unlabeled.classAttribute().value((int) clsLabel));
double[] dist =tree.distributionForInstance(unlabeled.instance(i))
for(int j=0; j<dist.length;j++){
System.print(unlabeled.classAttribute().value(j)+": " +dist[j]);
}
}
Edit This method doesn't train, evaluate and save a model. This is something I usually do using the weka gui. ( http://weka.wikispaces.com/Serialization )
This method uses a tree type model in the example with a nominal class, but that should be easily converted to a Naive Bayes example.
Related
I have two datasets regarding whether a sentence contains a mention of a drug adverse event or not, both the training and test set have only two fields the text and the labels{Adverse Event, No Adverse Event} I have used weka with the stringtoWordVector filter to build a model using Random Forest on the training set.
I want to test the model built with removing the class labels from the test data set, applying the StringToWordVector filter on it and testing the model with it. When I try to do that it gives me the error saying training and test set not compatible probably because the filter identifies a different set of attributes for the test dataset. How do I fix this and output the predictions for the test set.
The easiest way to do this for a one off test is not to pre-filter the training set, but to use Weka's FilteredClassifier and configure it with the StringToWordVector filter, and your chosen classifier to do the classification. This is explained well in this video from the More Data Mining with Weka online course.
For a more general solution, if you want to build the model once then evaluate it on different test sets in future, you need to use InputMappedClassifier:
Wrapper classifier that addresses incompatible training and test data
by building a mapping between the training data that a classifier has
been built with and the incoming test instances' structure. Model
attributes that are not found in the incoming instances receive
missing values, so do incoming nominal attribute values that the
classifier has not seen before. A new classifier can be trained or an
existing one loaded from a file.
Weka requires a label even for the test data. It uses the labels or „ground truth“ of the test data to compare the result of the model against it and measure the model performance. How would you tell whether a model is performing well, if you don‘t know whether its predictions are right or wrong. Thus, the test data needs to have the very same structure as the training data in WEKA, including the labels. No worries, the labels are not used to help the model with its predictions.
The best way to go is to select cross validation (e.g. 10 fold cross validation) which automatically will split your data into 10 parts, using 9 for training and the remaining 1 for testing. This procedure is repeated 10 times so that each of the 10 parts has once been used as test data. The final performance verdict will be an average of all 10 rounds. Cross validation gives you a quite realistic estimate of the model performance on new, unseen data.
What you were trying to do, namely using the exact same data for training and testing is a bad idea, because the measured performance you end up with is way too optimistic. This means, you‘ll get very impressive figures like 98% accuracy during testing - but as soon as you use the model against new unseen data your accuracy might drop to a much worse level.
I came across this snippet in the Tensorflow documentation, MNIST For ML Beginners.
eval_data = mnist.test.images # Returns np.array
eval_labels = np.asarray(mnist.test.labels, dtype=np.int32)
Now, I want to feed my own test images, without labelling them and would like the model to predict the labels, how do I achieve this?
Yes you can, but it would not be deep learning instead it would be clustering. ( Ex: K means Clustering )
Basic idea is like the following:
Create two placeholders for input and centroids
Decide a distance metric
Create graph
feed only dataset to run the graph
I'm using the opencv Decision Trees for create a classifier. I would like to know if it is possible to retrain that model (that can be saved and loaded in a .yml file) adding new data. The version of Opencv that i'm using is 2.4.
I was thinking on something like this
CvDTree dtree;
dtree.load("existingTree.yml");
dtree.train(newValues, CV_ROW_SAMPLE, newResponses);
newValues contains only the new samples and newResponses contains the classes for that values. This would generate a new decision tree trained with the old values of the first training process and this new ones?
I didn't find any information on opencv documentation about this.
Short answer: No
Long answer: During training, when a decision tree is passed a large training set, each split node in the tree learns a feature set and a corresponding threshold. The branches of the tree terminate with leaf nodes that then stores the prediction values. If you have already trained a decision tree, then it has already learned, from a training set, all the features, threshold and prediction values. Training it again with a additional data would render the previously learned parameters useless.
Another way to look at this would be to think of Random Forest, which is formed by an ensemble of trees. Given that your new dataset is not too different from the data that the model has previously seen. If you want you can train a new tree and add it to a group of previously trained trees. During prediction, you can average the prediction of all trees to get an overall prediction.
In solving one of the machine learning problem, I am implementing PCA on training data and and then applying .transform on train data using sklearn. After observing the variances, I retain only those columns from the transformed data whose variance is large. Then I am training the model using RandomForestClassifier. Now, I am confused how to apply that trained model on the test data as the number of columns of test data and the retained transformed data (on which random forest is applied) is different. Any solution would be appreciated.
Thank you.
Here is a way of doing it if this is what you seek... ideally u should use the same number of principle components in test as well as train... otherwise defeats the purpose of a hold-out set.
pca = PCA(n_components=20)
train_features = pca.fit_transform(train_data)
rfr = sklearn.RandomForestClassifier(n_estimators = 100, n_jobs = 1,
random_state = 2016, verbose = 1,
class_weight='balanced',oob_score=True)
rfr.fit(train_features)
test_features = pca.transform(test_data)
rfr.predict(test_features)
If I am currently using a Weka decision tree (or other) classifier as follows in my Java code:
// Get training and testing data.
Instances train = new Instances ("from training file");
train.setClassIndex(train.numAttributes() - 1);
Instances test = new Instances ("from testing file");
test.setClassIndex(test.numAttributes() - 1);
// Set classifier.
Object obj = Class.forName("weka.classifiers.trees.J48").newInstance();
Classifier cls = (Classifier) Class.forName("weka.classifiers.trees.J48").cast(obj);
// Set parameters for classifier.
String options = ("-C 0.05 -M 2");
String[] optionsArray = options.split(" ");
cls.setOptions(optionsArray);
// Train classifier.
cls.buildClassifier(train);
Evaluation eval = new Evaluation(train);
// Test trained classifier.
eval.evaluateModel(cls, test);
What happens if I want to use a meta classifier, e.g. bagging, to try to boost results? In Weka's Explorer, if I use bagging with my training and testing data, the parameter string for the classifier is:
weka.classifiers.meta.Bagging -P 100 -S 1 -num-slots 1 -I 10 -W weka.classifiers.trees.J48 -- -C 0.25 -M 2
Does anyone know what a code representation of this might be?
Ideally, I want to store the classes of the classifier and meta classifier in a database table, i.e. so line:
Object obj = Class.forName("weka.classifiers.trees.J48").newInstance();
becomes:
Object obj = Class.forName(classifier.getWekaClass()).newInstance();
And where the parameters could be listed in a database table as well to make them easy to change if I swap over classifiers from J48 to NB.
I believe that this is what I'm looking for but...
http://weka.wikispaces.com/Use+WEKA+in+your+Java+code#Attribute selection-Meta-Classifier
The javadoc suggests that there is a method setClassifier() that you would use to set the classifier you want to use. Beyond that, it's simply a matter of instantiating the class and setting the options accordingly.
You can of course store the class names in a database and use them as an your example. Storing parameters would be a bit trickier as the number and type would vary with each classifier -- you would have to provide a wrapper that can serialise and deserialise them properly.