Weka improve model TP Rate - weka

j48 weka
Hi,
I have problem with my model in weka (j48 cross-validation) that many instances are classified wrong when it comes to the second class. Is there any way to improve it or rather not? I'm not an expert in weka. Thank you in advance. My output is above.
In NaiveBayes it presents better but still TP Rate < 0.5 for the second class.
NaiveByes weka

It is hard to reproduce your example with the given information. However the solution is probably to turn your classifiert into a cost sensitive classifier
https://weka.wikispaces.com/CostSensitiveClassifier?responseToken=019a566fb2ce3b016b9c8c791c92e8e35
What it does it assigns a higher value to misclassifications of a certain class. In your case this would be the "True" class.
You can also simulate such an algorithm by oversampling your positive examples. This is, if you have n positive examples you sample k*n positive example, while you keep your negative examples as they are. You could also simply double positive examples.

Related

Weka can I train a model to minimize or maximize an input value?

Is it possible in Weka to train a model minimizing a cost factor?
I have a data set containing a cost factor in each sample. It defines what using this sample would cost. Now, I would like to select as much of the samples as possible while minimizing this cost factor.
E.g. with Multilayer perceptron, I want to train the neurons in a way, that it chooses as many samples as possible while minimizing the sum of the cost factor.
I've checked all the model options and also searched the package manager for something like that, but I was unable to find anything. Could someone tell me whether this can be done using Weka?
What you are describing sounds more like an optimization problem rather than a classification or regression problem (for which you would use a Weka classifier).
Weka does have some limited support for optimization through its abstract weka.core.Optimization class (e.g., used internally by weka.classifiers.functions.Logistic). But that requires implementing some methods.
To cast your net wider, you might want to take a look at the following article that describes various optimization techniques:
https://machinelearningmastery.com/tour-of-optimization-algorithms/

What the is the meaning of "question mark" in classifier output in WEKA?

I want to know what is the meaning of the "question mark" in === Detailed Accuracy By Class === when I use the Classifier Output in WEKA. My dataset is Fertility Dataset. Does this " question mark" influence the tree?
Thank you.
A ? in the output means that the result (value) is mathematically undefined. This might arise from a division by zero, for instance.
Source: See here.
Hope it helps.
As written by #MWiesner, ? is the Weka representation of Nan.
Just to add to his answer, this means that you propably have division by zero when calculating some evaluation metrics.
This means that your dataset doesn't have enough samples to perform a reliable classification (at least for the O class). If there were enough data, it would have to be really unlucky (or your features would have the be really poor) to get a ? result.
So my suggestion here is to add more instances. If you are working with image classification for example, try to use data-augmentation through rotation for example.

Interpreting results using J48 for a divided attribute of interest in x levels (WEKA)

I'm new to data mining and Weka. I built a classifier with J48 in Weka using the GUI, with J48 (training set) for an attribute of interest in five levels. I have to evaluate the precision of the model, but I don't know very well how to do it! Some information may be of interest:
== Detailed Accuracy By Class ===
Precision
0.80
?
0.67
0.56
?
?
First, I would like to know the meaning of the "?" in the precision column. When probing with an attribute of interest in two levels I got no "?". The tree is bigger now than when dividing into two levels. I am questioning if this means that taking an attribute of interest in five levels could generate a less efficient tree in terms of classification and computation time. This seems quite obvious as the number of Correctly Classified Instances when the attribute had 2 levels were up to 72%.
Thank you in advance, all interesting answers will be rewarded!
"I would like to know the meaning of the "?" in the precision column"
Note that for these same classes the TP and FP rates are 0. It appears that J48 has not assigned any of your observations to these classes.
Are these classes relatively small? If so, you might want to consider using the ClassBalancer filter. This will use weights to make all classes look the same size.
Of course, after you get the model you need to "convert back" to the real situation. This is similar for correcting for physically oversampling or undersampling. See my answer here: https://stats.stackexchange.com/questions/211174/how-to-exact-prediction-from-over-sampled-dataundoing-oversampling/257507#257507

Random Forest with more features than data points

I am trying to predict whether a particular service ticket raised by client needs a code change.
I have training data.
I have around 17k data points with problem description and tag (Y for code change required and N for no code change)
I did TF-IDF and it gave me 27k features. So I tried to fit RandomForestClassifier (sklearn python) with this 17k x 27k matrix.
I am getting very low scores on test set while training accuracy is very high.
Precision on train set: 89%
Precision on test set: 21%
Can someone suggest any workarounds?
I am using this model now:
sklearn.RandomForestClassifier(n_jobs=3,n_estimators=100,class_weight='balanced',max_features=None,oob_score=True)
Please help!
EDIT:
I have 11k training data with 900 positives (skewed). I tried LinearSVC sparsify but didn't work as well as Truncated SVD (Latent Semantic Indexing). maxFeatures=None performs better on the test set than without it.
I have also tried SVM, logistic (l2 and l1), ExtraTrees. RandonForest still is working best.
Right now, going at 92% precision on positives but recall is 3% only
Any other suggestions would be appreciated!
Update:
Feature engineering helped a lot. I pulled features out of the air (len of chars, len of words, their, difference, ratio, day of week the problem was of reported, day of month, etc) and now I am at 19-20% recall with >95% accuracy.
Food for your thoughts on using word2vec average vectors as deep features for the free text instead of tf-idf or bag of words ???
[edited]
Random forest handles more features than data points quite fine. RF is e.g. used for micro-array studies with e.g. a 100:5000 data point/feature ratio or in single-nucleotide_polymorphism(SNP) studies with e.g 5000:500,000 ratio.
I do disagree with the diagnose provided by #ncfirth, but the suggested treatment of variable selection may help anyway.
Your default random forest is not badly overfitted. It is just not meaningful to pay any attention to a non-cross validated training set prediction performance for a RF model, because any sample will end in the terminal nodes/leafs it has itself defined. But the overall ensemble model is still robust.
[edit] If you would change the max_depth or min_samples_split, the training precision would probably drop, but that is not the point. The non-cross validated training error/precision of a random forest model or many other ensemble models simply does not estimate anything useful.
[I did before edit confuse max_features with n_estimators, sry I mostly use R]
Setting max_features="none" is not random forest, but rather 'bagged trees'. You may benefit from a somewhat lower max_features which improve regularization and speed, maybe not. I would try lowering max_features to somewhere between 27000/3 and sqrt(27000), the typical optimal range.
You may achieve better test set prediction performance by feature selection. You can run one RF model, keep the top ~5-50% most important features and then re-run the model with fewer features. "L1 lasso" variable selection as ncfirth suggests may also be a viable solution.
Your metric of prediction performance, precision, may not be optimal in case unbalanced data or if the cost of false-negative and false-positive is quite different.
If your test set is still predicted much worse than the out-of-bag cross-validated training set, you may have problems with your I.I.D. assumptions that any supervised ML model rely on or you may need to wrap the entire data processing in an outer cross-validation loop, to avoid over optimistic estimation of prediction performance due to e.g. the variable selection step.
Seems like you've overfit on your training set. Basically the model has learnt noise on the data rather than the signal. There are a few ways to combat this, but it seems fairly obvious that you're model has overfit because of the incredibly large number of features you're feeding it.
EDIT:
It seems I was perhaps too quick to jump to the conclusion of overfitting, however this may still be the case (left as an exercise to the reader!). However feature selection may still improve the generalisability and reliability of your model.
A good place to start for removing features in scikit-learn would be here. Using sparsity is a fairly common way to perform feature selection:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
import numpy as np
# Create some data
X = np.random.random((1800, 2700))
# Boolean labels as the y vector
y = np.random.random(1800)
y = y > 0.5
y = y.astype(bool)
lsvc = LinearSVC(C=0.05, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print X_new.shape
Which returns a new matrix of shape (1800, 640). You can tune the number of features selected by altering the C parameter (called the penalty parameter in scikit-learn but sometimes called the sparsity parameter).

Data mining with Weka

I am learning how to do data mining and I am using this data set from UCI's website.
http://archive.ics.uci.edu/ml/datasets/Forest+Fires
The problem I am encountering is how to deal with the area class. My understanding from the description is that I need to apply ln(x+1) to area using AddExpression.
Am I going in the correct direction with this? Or are there other filters I should investigate? Thank you.
I try to answer your question based on the little information you provide. And I haven't worked with the forest-fires data set, but by inspection I see that the classifier attribute "area" often has the value 0. Maybe you can't simply filter out these rows with Area = 0. Your dataset might become too small, or whatnot.
I think you are asked to perform regression of some attribute(s) against "log(area)" in order to linearize it. However,when you try to calculate the log of the Area, values such as log(0) are a problem. values between 0 and 1 might also be problematic.
So a common fix is to add 1 to the value of "Area". This introduces a systematic error, but it is small, and it removes all 0-values, and you can still derive useful models from your log(x+1)-transformed dataset.
And yes, in Weka you do this by "Preprocess"/ AddExpression(x+1). This creates a new attribute. Then you might remove the old area attribute.
Of course, in interpreting your model, you should be aware of the transformation. If you just want to find out what the significant independent attributes are in your linear regression model, I'd say the transformation does not matter. The data points are just shifted a little bit.