Train an existing opencv model with new data - c++

I'm using the opencv Decision Trees for create a classifier. I would like to know if it is possible to retrain that model (that can be saved and loaded in a .yml file) adding new data. The version of Opencv that i'm using is 2.4.
I was thinking on something like this
CvDTree dtree;
dtree.load("existingTree.yml");
dtree.train(newValues, CV_ROW_SAMPLE, newResponses);
newValues contains only the new samples and newResponses contains the classes for that values. This would generate a new decision tree trained with the old values of the first training process and this new ones?
I didn't find any information on opencv documentation about this.

Short answer: No
Long answer: During training, when a decision tree is passed a large training set, each split node in the tree learns a feature set and a corresponding threshold. The branches of the tree terminate with leaf nodes that then stores the prediction values. If you have already trained a decision tree, then it has already learned, from a training set, all the features, threshold and prediction values. Training it again with a additional data would render the previously learned parameters useless.
Another way to look at this would be to think of Random Forest, which is formed by an ensemble of trees. Given that your new dataset is not too different from the data that the model has previously seen. If you want you can train a new tree and add it to a group of previously trained trees. During prediction, you can average the prediction of all trees to get an overall prediction.

Related

Correct approach to improve/retrain an offiline model

I have a recommendation system that was trained using Behavior Cloning (BC) with offline data generated using a supervised learning model converted to batch format using the approach described here. Currently, the model is exploring using an e-greedy strategy. I want to migrate from BC to MARWIL changing the beta.
There is a couple of ways to do that:
Convert the data employed to train the BC algorithm plus the agent’s new data and retrain from scratch using MARWIL.
Convert the new data generated by the agent and put it together with the previous converted data employed to train the BC algorithm, using the input parameter, doing something similar to what is described here, and retrain from scratch using MARWIL .
Convert the new data generated by the agent and put it together with the previous converted data employed to train the BC algorithm, using the input parameter, doing something similar to what is described here, and retrain using the restored BC agent using MARWIL .
Questions:
Following option 1.:
Given that the new data slice would be very small compared with the previous one, would the model learn something new?
When we stop using original data?
Following option 2.:
Given that the new data slice would be very small compared with the previous one, would the model learn something new?
When we stop using original data?
This approach works for trajectories associated with new episodes ids, but it will extend the trajectories of episodes already present in the original batch?
Following option 3.:
Given that the new data slice would be very small compared with the previous one, would the model learn something new?
When we stop using original data?
This approach works for trajectories associated with new episodes ids, but it will extend the trajectories of episodes already present in the original batch?
The retrain would update the networks’ weights using the new data points, but to do that how many iterations should we use?
How to prevent catastrophic forgetting?

Training and Test Set in Weka InCompatible in Text Classification

I have two datasets regarding whether a sentence contains a mention of a drug adverse event or not, both the training and test set have only two fields the text and the labels{Adverse Event, No Adverse Event} I have used weka with the stringtoWordVector filter to build a model using Random Forest on the training set.
I want to test the model built with removing the class labels from the test data set, applying the StringToWordVector filter on it and testing the model with it. When I try to do that it gives me the error saying training and test set not compatible probably because the filter identifies a different set of attributes for the test dataset. How do I fix this and output the predictions for the test set.
The easiest way to do this for a one off test is not to pre-filter the training set, but to use Weka's FilteredClassifier and configure it with the StringToWordVector filter, and your chosen classifier to do the classification. This is explained well in this video from the More Data Mining with Weka online course.
For a more general solution, if you want to build the model once then evaluate it on different test sets in future, you need to use InputMappedClassifier:
Wrapper classifier that addresses incompatible training and test data
by building a mapping between the training data that a classifier has
been built with and the incoming test instances' structure. Model
attributes that are not found in the incoming instances receive
missing values, so do incoming nominal attribute values that the
classifier has not seen before. A new classifier can be trained or an
existing one loaded from a file.
Weka requires a label even for the test data. It uses the labels or „ground truth“ of the test data to compare the result of the model against it and measure the model performance. How would you tell whether a model is performing well, if you don‘t know whether its predictions are right or wrong. Thus, the test data needs to have the very same structure as the training data in WEKA, including the labels. No worries, the labels are not used to help the model with its predictions.
The best way to go is to select cross validation (e.g. 10 fold cross validation) which automatically will split your data into 10 parts, using 9 for training and the remaining 1 for testing. This procedure is repeated 10 times so that each of the 10 parts has once been used as test data. The final performance verdict will be an average of all 10 rounds. Cross validation gives you a quite realistic estimate of the model performance on new, unseen data.
What you were trying to do, namely using the exact same data for training and testing is a bad idea, because the measured performance you end up with is way too optimistic. This means, you‘ll get very impressive figures like 98% accuracy during testing - but as soon as you use the model against new unseen data your accuracy might drop to a much worse level.

How does Weka calculate the output predictions in J48 and other classifier?

I have used the output predictions of J48 classifier in Weka and got the results with predictions (probability). As I need to use these predictions number in my research, I need to know how the weka calculates these numbers? What is the formula? Is it specified for each classifier?
In addition to Jan Eglinger answer.
The J48 classifier is Weka's implementation of the infamous C4.5 decision tree classifier, which is a classification algorithm based on ID3 that classifies using information entropy.
The training data is a set S = {s_1, s_2, ...} of already classified samples. Each sample s_i consists of a p-dimensional vector (x_{1,i}, x_{2,i}, ...,x_{p,i}) , where the x_j represent attribute values or features of the sample, as well as the class in which s_i falls.
At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sublists.
This algorithm has a few base cases.
All the samples in the list belong to the same class. When this
happens, it simply creates a leaf node for the decision tree saying
to choose that class.
None of the features provide any information gain. In this case,
C4.5 creates a decision node higher up the tree using the expected
value of the class.
Instance of previously-unseen class encountered. Again, C4.5 creates
a decision node higher up the tree using the expected value.
You can find the information Gain and entropy in the Weka Api package. For that you need to start dubbing the java weka api and go through each step.
In general, if you don't worry about how algorithm works internally using high level mathematics. Try to calculate InformationGain and entropy and explain them in your research apart from decision trees, you have methods for both of these to calculate their value.
What is the formula?
Weka's J48 classifier is an implementation of the C4.5 algorithm.
I need to know how the weka calculates these numbers?
You can find implementation details in J48.java and in the weka.classifiers.trees.j48 package.

Do OpenCV's machine learning algorithms continuously update a model?

Most machine learning algorithms implemented in OpenCV 2.4 built upon a CvStatModel which comes with a CvStatModel::train method.
There it says:
By default, the input feature vectors are stored as train_data rows, that is, all the components (features) of a training vector are stored continuously.
and
Usually, the previous model state is cleared by CvStatModel::clear() before running the training procedure. However, some algorithms may optionally update the model state with the new training data, instead of resetting it.
How do I know which ml algorithm isn't resetting the current model state. Since I wanted to use CvGBTrees::train which has a update parameter declared as being only a dummy parameter, I guess the model is discarded after every training call. Can I take it that if there is no such update parameter the current model state will always be discarded?
I need a machine learning algorithm which continuously trains one model and doesn't start with an initial model every training call.
Is this doable with the current ml implementations in OpenCV? and if so with which ones? Furthermore, if not are there other c++ libraries that would do so?

Handle multi-label dataset in classification using j48 tree

I'm trying to use j48 tree to perform a text categorization task. I read a lot of papers and websites that explain how to use classification having datasets whose data are single labeled.
In my case I have only multi-labeled data in my training set, what can I have to treat these data in a single decision tree? Or the only solution is generating many trees as many as the number of the labels?
You can use a tree with adapted entropy formula. You must define beforehand if your dataset has hierachical labels:
papers and code