1) Applying MultinomialNaivesBayes(not by any other classifier) in weka raises exception "problem evaluating classifier: Numeric attribute values must all be greater or equal to zero" ? How to fix it ?
2) Is dimensionality reduction (PCA, LSI, Random projection) is an alternative for feature selection (InformationGain, ChiSqr) or we need to apply both ? I have seen conflicting opinions about them on internet ?
Related
I want to know what is the meaning of the "question mark" in === Detailed Accuracy By Class === when I use the Classifier Output in WEKA. My dataset is Fertility Dataset. Does this " question mark" influence the tree?
Thank you.
A ? in the output means that the result (value) is mathematically undefined. This might arise from a division by zero, for instance.
Source: See here.
Hope it helps.
As written by #MWiesner, ? is the Weka representation of Nan.
Just to add to his answer, this means that you propably have division by zero when calculating some evaluation metrics.
This means that your dataset doesn't have enough samples to perform a reliable classification (at least for the O class). If there were enough data, it would have to be really unlucky (or your features would have the be really poor) to get a ? result.
So my suggestion here is to add more instances. If you are working with image classification for example, try to use data-augmentation through rotation for example.
I'm new to data mining and Weka. I built a classifier with J48 in Weka using the GUI, with J48 (training set) for an attribute of interest in five levels. I have to evaluate the precision of the model, but I don't know very well how to do it! Some information may be of interest:
== Detailed Accuracy By Class ===
Precision
0.80
?
0.67
0.56
?
?
First, I would like to know the meaning of the "?" in the precision column. When probing with an attribute of interest in two levels I got no "?". The tree is bigger now than when dividing into two levels. I am questioning if this means that taking an attribute of interest in five levels could generate a less efficient tree in terms of classification and computation time. This seems quite obvious as the number of Correctly Classified Instances when the attribute had 2 levels were up to 72%.
Thank you in advance, all interesting answers will be rewarded!
"I would like to know the meaning of the "?" in the precision column"
Note that for these same classes the TP and FP rates are 0. It appears that J48 has not assigned any of your observations to these classes.
Are these classes relatively small? If so, you might want to consider using the ClassBalancer filter. This will use weights to make all classes look the same size.
Of course, after you get the model you need to "convert back" to the real situation. This is similar for correcting for physically oversampling or undersampling. See my answer here: https://stats.stackexchange.com/questions/211174/how-to-exact-prediction-from-over-sampled-dataundoing-oversampling/257507#257507
I am using KNIME in order to activate a WEKA node AttributeSelectedClassifier .
But i keep getting this exception claiming that my attribute is nominal and has duplicate values.
But, it is numeric and it is very expected to have duplicate values in the dataset!
AttributeSelectedClassifier - How to deal with error "A nominal attribute (likes) cannot have duplicate labels ('(0.045455-0.045455]')"
I found similar topics to this one but none of them is covering how to chose the scalar to scale values with
1st Question: so i will be happy if someone can explain why is this behavior? I mean why duplicate values is bad?!
Anyway,
One of the threads of a similar topic recommended to scale the values by a large enough number (a scalar)!
Based on that I multiplied values with 10^6 and got error about this value: 27027.027027-27027.027027
I multiplied by 10^7 and then got an error about this value: 270270.27027-270270.27027
when i multiplied by 10^8 it succeeded.
2nd Question: what is the right way to deal with this? and how can i, programatically, chose the scalar to scale with ?
The full error:
ERROR AttributeSelectedClassifier - Execute failed: IllegalArgumentException in Weka during training. Please verify your settings. A nominal attribute (Meanlikes) cannot have duplicate labels ('(0.045455-0.045455]').
I want to do classification in weka. I am using some methods(Random Tree, Random Forest, Decision Table, RandomSubspace...) but they give results like below.
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.1678
Mean absolute error 0.4832
Root mean squared error 0.4931
Relative absolute error 96.6501 %
Root relative squared error 98.6323 %
Total Number of Instances 100000
However I want results as accurancy and confusion matrix. How can I get results like that?
Note: When I use small dataset, it gives results as confusion matrix. Can it be related with the size of dataset?
The output of the training/testing in Weka depends on the type of the attribute that you are trying to predict. If your attribute is nominal, you will get a confusion matrix and accuracy value. If your attribute is numeric, you will get a correlation coefficient.
In your small and large datasets that you mention, what is your type of the attribute that you are predicting?
I have run a 2-class problem using J48 and RandomForest with 100000 instances and the confusion matrix appeared correctly. I additionally increased the problem complexity to run 20 different classes and the confusion matrix appeared correctly as well.
If you look under more options, please ensure that the 'output confusion matrix' is checked and see if this resolves the issue.
Usecase : Selecting the "optimal threshold" for a Logistic model built with statsmodel's Logit to predict say, binary classes (or multinomial, but integer classes)
To select your threshold for a (say,logistic) model in Python, is there something that's inbuilt ? For small data sets, I remember, optimizing for the "threshold", by picking up the maximum buckets of true predicted labels (true "0" and true "1") , best seen from the graph here -
http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
I also know intuitively that if I set alpha values, it should give me a "threshold" that I can use below. How should I compute the threshold given a reduced model with variables, all of which are significant at 95% confidence ? Obviously setting the threshold >0.5 ->"1" would be too naive & since I am looking at 95% confidence this threshold should be "smaller" , meaning p >0.2 or something.
This would then mean something like range of "critical values" if the label should be "1" and "0" otherwise.
What I want is something like this -:
test_scores = smf.Logit(y_train,x_train,missing='drop').fit()
threshold =0.2
#test_scores.predict(x_train,transform=False) will give the continues probability class, so to transform it into labels, I need to compare it against a threshold, (or x_test if I am testing the model)
y_predicted_train = np.array(test_scores.predict(x_train,transform=False) > threshold, dtype=float)
table = np.histogram2d(y_train, y_predicted_train, bins=2)[0]
# will do the similar on "test" data
# crude way of selecting an optimal threshold
from scipy.stats import ks_2samp
import numpy as np
ks_2samp(y_train, y_predicted_train)
(0.39963996399639962, 0.958989)
# must get <95 % here & keep modifying the threshold as above till I fail to reject the Null at 95%
# where y_train is REAL values & y_predicted back on the TRAIN dataset . Note that to get y_predicted (as binary, I already did the thresholding as above
Question :-
1. How can I select the threshold in an objective way - ie reduce the percentage of misclassified labels (say I care more for missing "1" (true positives), but not so much if I mispredict a "0" as "1" ( false negatives) & try to reduce this error. This I get from ROC curve . The roc curve in statsmodels(roc_curve) assumes that I have done the labelling for y_predicted class, and I am just revalidating this over test ( point me if my understanding is incorrect). I also think, using the confusion matrix also will not solve picking up the threshold problem
2. Which bring me to - How should I consume the output of these inbuilt functions (oob , confusion_matrix) to suit for selecting the optimal threshold (first on train sample, & then fine tune it over Test & cross validation sample)
I also looked up the official documentation of K-S tests in scipy here-
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest
Related -:
Statistics Tests (Kolmogorov and T-test) with Python and Rpy2