Handle multi-label dataset in classification using j48 tree - weka

I'm trying to use j48 tree to perform a text categorization task. I read a lot of papers and websites that explain how to use classification having datasets whose data are single labeled.
In my case I have only multi-labeled data in my training set, what can I have to treat these data in a single decision tree? Or the only solution is generating many trees as many as the number of the labels?

You can use a tree with adapted entropy formula. You must define beforehand if your dataset has hierachical labels:
papers and code

Related

number expected.read Token[2015-02-02 14:19:00] weka project

i hope you all are doing well!
I have a project at data mining class.Τhe data consists of numerical data and many algorithms do not work.I have to do this:"you should compare the performance of the following categorization algorithms:
RandomForest, C4.5, JRip, Bayesian Network. Where necessary use them
Weka filters to replace or create values ​​for some properties
new properties. For comparison, adopt the Train / Test Percentage Split type with
percentage for training data equal to 80%.Describe your observations by giving tables with the results and
presenting the performance of the algorithms. Repeat the experiment by putting
percentage for training data equal to 70% and 50% presenting the results."
So my first try was to transform the data inside weka with preprocessing data numeric to nominal but a friend of mine suggest that is statistical wrong.So my second try was to use excel to transform all data even the date to numeric,remove the first row(id) and pass it to the weka(I leave double quotes only at date)
.But i have the error that i mention on the title.The dataset is:https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
Thank you for the time.
If you define date-like data as a DATE attribute in the ARFF file (using the right format for parsing the strings), then WEKA will treat it as a numeric attribute internally (Java epoch, ie milli-seconds since 1970-01-01).
Instead of using NumericToNominal, use either the supervised or unsupervised Discretize filter if the algorithm cannot handle numeric attributes.
Converting nominal attributes to numeric ones is not a recommended approach. Instead, try the supervised or unsupervised NominalToBinary filter.

Train an existing opencv model with new data

I'm using the opencv Decision Trees for create a classifier. I would like to know if it is possible to retrain that model (that can be saved and loaded in a .yml file) adding new data. The version of Opencv that i'm using is 2.4.
I was thinking on something like this
CvDTree dtree;
dtree.load("existingTree.yml");
dtree.train(newValues, CV_ROW_SAMPLE, newResponses);
newValues contains only the new samples and newResponses contains the classes for that values. This would generate a new decision tree trained with the old values of the first training process and this new ones?
I didn't find any information on opencv documentation about this.
Short answer: No
Long answer: During training, when a decision tree is passed a large training set, each split node in the tree learns a feature set and a corresponding threshold. The branches of the tree terminate with leaf nodes that then stores the prediction values. If you have already trained a decision tree, then it has already learned, from a training set, all the features, threshold and prediction values. Training it again with a additional data would render the previously learned parameters useless.
Another way to look at this would be to think of Random Forest, which is formed by an ensemble of trees. Given that your new dataset is not too different from the data that the model has previously seen. If you want you can train a new tree and add it to a group of previously trained trees. During prediction, you can average the prediction of all trees to get an overall prediction.

Voted Perceptron on Weka

I am trying to run the "Voted Perceptron" on iris data set using weka. However, when I load the data, the Voted Perceptron refuses to run, while it runs on many other data sets like ionosphere.arff, diabetes.arff etc.
Please help.
Because VotedPerceptron only works on datasets that are binary classifiers. Iris.arff has three different classification types while diabetes.arff and ionosphere.arff only have two.
If you want it to work you'll have to entirely remove one of iris.arff's classification types.

How does Weka calculate the output predictions in J48 and other classifier?

I have used the output predictions of J48 classifier in Weka and got the results with predictions (probability). As I need to use these predictions number in my research, I need to know how the weka calculates these numbers? What is the formula? Is it specified for each classifier?
In addition to Jan Eglinger answer.
The J48 classifier is Weka's implementation of the infamous C4.5 decision tree classifier, which is a classification algorithm based on ID3 that classifies using information entropy.
The training data is a set S = {s_1, s_2, ...} of already classified samples. Each sample s_i consists of a p-dimensional vector (x_{1,i}, x_{2,i}, ...,x_{p,i}) , where the x_j represent attribute values or features of the sample, as well as the class in which s_i falls.
At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sublists.
This algorithm has a few base cases.
All the samples in the list belong to the same class. When this
happens, it simply creates a leaf node for the decision tree saying
to choose that class.
None of the features provide any information gain. In this case,
C4.5 creates a decision node higher up the tree using the expected
value of the class.
Instance of previously-unseen class encountered. Again, C4.5 creates
a decision node higher up the tree using the expected value.
You can find the information Gain and entropy in the Weka Api package. For that you need to start dubbing the java weka api and go through each step.
In general, if you don't worry about how algorithm works internally using high level mathematics. Try to calculate InformationGain and entropy and explain them in your research apart from decision trees, you have methods for both of these to calculate their value.
What is the formula?
Weka's J48 classifier is an implementation of the C4.5 algorithm.
I need to know how the weka calculates these numbers?
You can find implementation details in J48.java and in the weka.classifiers.trees.j48 package.

Input arff file for Weka Apriori

I am trying to do association mining on version history. I have my transaction data in mysql. Weka apriori algorithm requires arff or csv file in a certain format. It has to have columns for each item. The values will be specified as TRUE or FALSE for each item in a transaction. I am looking for a way to create this file using Weka InstanceQuery. Also what are the options if the transaction data is huge.
I can answer for the second part: options if the transaction data is huge. Weka is a good software but their apriori implementation is horribly slow. I recommend implementations at http://fimi.ua.ac.be/src/ (I used the first one in the list from Ferenc Bodon).
Bodon's implementation use Trie data structure instead of hashtables that Weka uses. Because of this, I found in my work, that Weka would take 3 days to finish something that Bodon's implementation could in less than an hour (yes, the difference is this huge!!).
Plus, Bodon's implementation uses a simple input format: one line for each transaction, with items separated by spaces.
If you want a fast Java implementation of FPGrowth or Apriori, have a look at my project SPMF. The FPGrowth implementation in SPMF beats Weka implementation by up to two orders of magnitude on some datasets. For example, you can see this performance comparison:
http://www.philippe-fournier-viger.com/spmf/performance/chess_fpgrowth_spmf_vs_weka.png
This is the main project webpage:
http://www.philippe-fournier-viger.com/spmf/index.php
Moreover, note that SPMF offers more than 50 algorithms for itemset mining, association rule mining, sequential pattern mining, etc. Also, the GUI version of SPMF also support the ARFF format used by Weka.