what is the correct way to apply a feature selection method to an imbalanced dataset? - weka

I am new to data science & machine learning, so I'll write my question in detail.
I have an imbalanced dataset (binary classification dataset), and I want to apply these methods by using Weka paltform:
10-Fold cross validation.
Oversampling to balance the data.
A Wrapper feature selection method.
6 classifiers and compare between their performance.
I want to apply them under these conditions:
Balancing the data before applying a feature selection method (reference).
Balancing the data during cross validation (reference).
What is the correct procedure?
I've written a post below with a suggested procedure.

Is this procedure correct?
Firstly, using a feature selection method to reduce the number of features:
From Preprocess tab: Balancing the entire dataset.
From Select atributes tab: Applying a feature selection method to the balanced dataset.
From Preprocess tab: Removing the unselected attributes (resulting from step #2) from the original imbalanced dataset and saving the new copy of the dataset in order to use it in the following.
Then, applying coss validation and balancing methods to the new copy of the dataset:
From Classify tab: Chosing the 10-fold cross validation.
Chosing FilterClassifier and editting its properties:
classifier: selecting the classifier (one by one).
filter: Resampling.

Related

How can I change the order of the attributes in Weka?

I was doing a machine learning task in Weka and the dataset has 486 attributes. So, I wanted to do attribute selection using chi-square and it provides me ranked attributes like below:
Now, I also have a testing dataset and I have to make it compatible. But how can I reorder the test attributes in the same manner that can be compatible with the train set?
Changing the order of attributes (e.g., when using the Ranker in conjunction with an attribute evaluator) will probably not have much influence on the performance of your classifier model (since all the attributes will stay in the dataset). Removing attributes, on the other hand, will more likely have an impact (for that, use subset evaluators).
If you want the ordering to get applied to the test set as well, then simply define your attribute selection search and evaluation schemes in the AttributeSelectedClassifier meta-classifier, instead of using the Attribute selection panel (that panel is more for exploration).

how to classify using j48 weka with information gain and random attribute selection?

I know that j48 decision tree uses gain ratio to select attribute for making tree.
But i want to use information gain and random selection instead of gain ratio. In select attribute tab in Weka Explorer, I choose InfoGainAttributeEval and put start button. After that I see the sorted list of attribute with information gain method. But I don't know how to use this list to run j48 in Weka. Moreover I don't know how to select attribute randomly in j48.
Please help me if you can.
If you want to perform feature selection on the data before running the algorithm you have two options:
In the Classify tab use AttributeSelectedClassifier (under the meta folder). There you can configure the feature selection algorithm you want. (The default is J48 with CfsSubsetEval).
In the Preprocess tab find and apply AttributeSelect filter (located at supervised\attribute folder). The default here is also the CfsSubsetEval algorithm.
Notice that the first method will apply the algorithm only on the train set when you'll evaluate the algorithm, while the second method will use the entire dataset and will remove features that were not selected (you can use undo to bring them back).
Notice that the way J48 selects features during the training process will remain the same. To change it you need to implement your own algorithm or change the current implementation.

improve weka classifier results

I have a database which consists of 27 attributes and 597 instances .
I want to classify it with as best results as possible using Weka.
Which classifier is not important .The class attribute is nominal and the rest are numeric .
The Best results until now was LWL (83.2215) and oneR(83.389). I used attribute selection filter but the results are not improved and no other classifier can give better results even NN or SMO or meta classes.
Any idea about how to improve this database knowing that there are no missing values and the database is about 597 patients gathered in three years.
Have you tried boosting or bagging? These generally can help improve results.
http://machinelearningmastery.com/improve-machine-learning-results-with-boosting-bagging-and-blending-ensemble-methods-in-weka/
Boosting
Boosting is an ensemble method that starts out with a base classifier
that is prepared on the training data. A second classifier is then
created behind it to focus on the instances in the training data that
the first classifier got wrong. The process continues to add
classifiers until a limit is reached in the number of models or
accuracy.
Boosting is provided in Weka in the AdaBoostM1 (adaptive boosting)
algorithm.
Click “Add new…” in the “Algorithms” section. Click the “Choose”
button. Click “AdaBoostM1” under the “meta” selection. Click the
“Choose” button for the “classifier” and select “J48” under the “tree”
section and click the “choose” button. Click the “OK” button on the
“AdaBoostM1” configuration.
Bagging
Bagging (Bootstrap Aggregating) is an ensemble method that creates
separate samples of the training dataset and creates a classifier for
each sample. The results of these multiple classifiers are then
combined (such as averaged or majority voting). The trick is that each
sample of the training dataset is different, giving each classifier
that is trained, a subtly different focus and perspective on the
problem.
Click “Add new…” in the “Algorithms” section. Click the “Choose”
button. Click “Bagging” under the “meta” selection. Click the “Choose”
button for the “classifier” and select “J48” under the “tree” section
and click the “choose” button. Click the “OK” button on the “Bagging”
configuration.
I tried Boosting and Bagging as #applecrusher has mentioned. It showed a little improvement in the accuracy; but for the same data with SKLearn, I was getting a lot better accuracy. When I compared the code and output at each step, I found that train-test split function in SKLearn was, by default, shuffling the data. When I shuffled the data for WEKA using Collections.shuffle(), I saw improved results. Give it a try.

What is the effect of using filtered classifier over normal classifier in weka

I have used weka for text classification. First I used StringToWordVector filter and filtered data were used with SVM classifier (LibSVM) for cross validation. Later I have read a blog post here
It said that it is not suitable to use filter first and then perform cross validation. Instead it proposes FilteredClassifer to use. His justification is
Two weeks ago, I wrote a post on how to chain filters and classifiers in WEKA, in order to avoid misleading results when performing experiments with text collections. The issue was that, when using N Fold Cross Validation (CV) in your data, you should not apply the StringToWordVector (STWV) filter on the full data collection and then perform the CV evaluation on your data, because you would be using words that are present in your test subset (but not in your training subset) for each run.
I can not understand the reason behind this. Anyone knows that?
When you using filter before N Fold cross validation you would be filtering every word appear in each instance despite being a test instance or train instance. At the moment Filter has no way to know if a instance is a test instance or a train instance. So if you are using StringtoWordVector with TFTransform or any similar operation, any word in test instances may affect the transform value. (Simply, if you are implementing bag of words then you would take test instance for the consideration too). This is not acceptable since the training parameters should not affected by the testing data. So instead you can do the Filtering on the run. That is FilteredClassifer.
For get an idea about how N Fold cross validation works, please refer to the Rushdi Shams's answer in following question. Please let me know if you understood it or not. Cheers..!!
Cross Validation in Weka

what does the attribute selection in preprocess tab do in weka?

I cant seem to find out what attribute selection filter does in pre process tab? someone could please tell me in simple language as im new to weka
when i apply it to my dataset it seems to remove a couple of attributes but im unsure why
A real data set may contain many attributes. Applying any data mining process on this data set (e.g. finding clusters, generating a classification model ...) may take very long time.
Instead of that, we can select some attributes(dimensions) which is called the most discriminative attributes. These attributes can almost describe the data set with lower number of attributes and this will speed up any process done on the data.
Attribute selection tab contains many different methods for selecting these attributes. One of them is CFS Feature Set Evaluation This filter gives you the attributes that have higher correlation with the class label which makes them discriminative attributes.