I have a binary classification problem i am solving with SVM. The classes are unbalanced in the training data. I now need to get posterior probabilities outputs, and not just a binary score. I tried to use Platt scaling by either Weka's SMO, and LibSVM. For both of these implementations i get results which, in terms of f1-measure for the minority class, are worse then when i generated only binary results.
Do you know of a way to transform SVM binary results to probabilities which keeps the next rule:
"prob > = 0.5 if and only if decision value >= 0".
Meaning that the label the each sample gets is the same when using either binary classification, or probabilities.
SVM can be set so that they output class membership probabilities. You should look documentation of your toolkit to learn how to enable this.
For example sckit-learn
When the constructor option probability is set to True, class
membership probability estimates (from the methods predict_proba and
predict_log_proba) are enabled.
Related
I have used to below hyper parameters to train the model.
rcf.set_hyperparameters(
num_samples_per_tree=200,
num_trees=250,
feature_dim=1,
eval_metrics =["accuracy", "precision_recall_fscore"])
is there any best way to choose the num_samples_per_tree and num_trees parameters.
what are the best numbers for both num_samples_per_tree and num_trees.
There are natural interpretations for these two hyper-parameters that can help you determine good starting approximations for HPO:
num_samples_per_tree -- the reciprocal of this value approximates the density of anomalies in your data set/stream. For example, if you set this to 200 then the assumption is that approximately 0.5% of the data is anomalous. Try exploring your dataset to make an educated estimate.
num_trees -- the more trees in your RCF model the less noise in scores. That is, if more trees are reporting that the input inference point is an anomaly then the point is much more likely to be an anomaly than if few trees suggest so.
The total number of points sampled from the input dataset is equal to num_samples_per_tree * num_trees. You should make sure that the input training set is at least this size.
(Disclosure - I helped create SageMaker Random Cut Forest)
I did research on multiple websites, but I couldn't find any solution.
Here's the problem:
I am implementing a pixel-wise classification using RTrees from OpenCV. I need the posterior probability for each class. I tried to get it via cv::ml::StatModel::predict(), but the output matrix only contains the predicted value. Is there another way to get the posterior probability from RTrees?
PS: I'm still quite new to Machine Learning, so please forgive me my lack of knowledge ^^"
Instead of using cv::ml::StatModel::predict, you could refer to the cv::ml::RTrees::getVotes member function. This way, in case of classification, you get the number of trees which voted for each class for given sample. By dividing these numbers of votes by the forest size you get an approximation of posterior probabilities.
The getVotes function should be called instead of predict like this:
cv::Mat samples = [one or multiple samples (their feature vectors)]
cv::Mat votes;
classifier.getVotes(sample, votes, 0);
// provide 0 here unless you would like to manipulate with RTrees flags
What you should be aware of is that the votes matrix is going to have one more row than the number of samples. In this first row there are your classes enumerated (in ascending order if I remember well from the OpenCV source code).
The answer is up to date as of the 3.4.1 version of OpenCV.
With respect to semantic segmentation, it seems to me that there are multiple ways for the final pixel-wise labeling, such as
softmax, sigmoid, logistic regression or other classical classification methods.
However, for softmax approach, we need to ensure the output map resulting from the network architecture has multiple channels. The number of channels matches the number of classes. For instance, if we are talking two-classes problem, masks and un-masks, then we will use two channels. Is this right?
Moreover, each channel in the output map can be treated as a probability map for a given class. Is this understanding right?
Yes to both questions. The goal of the softmax function is to transform the scores into probabilities so that you can maximize the probability of the true label.
I am a frequent user of scikit-learn, I want some insights about the “class_ weight ” parameter with SGD.
I was able to figure out till the function call
plain_sgd(coef, intercept, est.loss_function,
penalty_type, alpha, C, est.l1_ratio,
dataset, n_iter, int(est.fit_intercept),
int(est.verbose), int(est.shuffle), est.random_state,
pos_weight, neg_weight,
learning_rate_type, est.eta0,
est.power_t, est.t_, intercept_decay)
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/stochastic_gradient.py
After this it goes to sgd_fast and I am not very good with cpython. Can you give some celerity on these questions.
I am having a class biased in the dev set where positive class is somewhere 15k and negative class is 36k. does the class_weight will resolve this problem. Or doing undersampling will be a better idea. I am getting better numbers but it’s hard to explain.
If yes then how it actually does it. I mean is it applied on the features penalization or is it a weight to the optimization function. How I can explain this to layman ?
class_weight can indeed help increasing the ROC AUC or f1-score of a classification model trained on imbalanced data.
You can try class_weight="auto" to select weights that are inversely proportional to class frequencies. You can also try to pass your own weights has a python dictionary with class label as keys and weights as values.
Tuning the weights can be achieved via grid search with cross-validation.
Internally this is done by deriving sample_weight from the class_weight (depending on the class label of each sample). Sample weights are then used to scale the contribution of individual samples to the loss function used to trained the linear classification model with Stochastic Gradient Descent.
The feature penalization is controlled independently via the penalty and alpha hyperparameters. sample_weight / class_weight have no impact on it.
I want to give weights to features of a data set before using the feature in any classification algorithm like KNN or J48, but i don't know how to evaluate a weighted feature vector.
dose any of the classification algorithms accept weights as input instead of just '0' and '1'?
especially, is any of Weka's ready classification functions capable of working with weights (not 0 and 1 as filters)?
In most situations, you can just scale the data set according to your weights. This is trivial to prove for Minkowski distances such as Euclidean distance.
Not all of weka's classification algorithms support weights but some do.
You need to set weight information while after loading your dataset , see example code in weka wiki. I remember that Weka J48 , decision tree , supports weights in developer version but can not find reference. There exists a patch though.
This search for feature weights in weka wiki may help.
I suggest trying add weights to data set and training in your data.