ML.NET - Multiclass Classification score values - ml.net

I currently have a project to take large bits of text and classify them as types. This is similar to the sentiment sample provided by microsoft except its multiclass instead of binary.
I have the code working just fine and will likely become stronger as we add data to it. However, i have hit a snag where i am unable to determine if the prediction just straight doesn't know what to choose. For my project it is much more valuable to not know the answer than to get it wrong. I am not sure if that is even a thing in ML.net. I was looking through documentation and the only thing i could find was the score value produced by the prediction. The problem therein lies that i don't know what any of the score values mean. I know they are broken out per class, but the numeric values are different between algorithms. Does anyone have any insight on these values? Or if any advice on the "don't know" vs "guessing" issue?
Appreciate your time, thanks.

The scores are largely learner-specific, the only requirement is that they are monotonic (higher score means higher likelihood of the example belonging to that class).
But in ML.NET multiclass learners they are always between 0 and 1, sum up to 1. You can think of the scores as 'predicted probabilities to belong to that class'.
Now to the question of how to take confidence into account. For a binary classification problem, I would have a standard recommendation: plot a precision-recall curve, and then instead of choosing one threshold on the score, choose two: one that gives a high-precision (potentially low-recall) positive, and another one that gives a high-precision potentially low recall) negative.
So:
if (score > threshold1)
return "positive";
else if (score < threshold2)
return "negative";
else
return "don't know";
For the multiclass case, you can employ the same procedure independently for each class. This way, you will have a per-class 'yes-no-maybe' answer.
You will have to deal with a potential for multiple 'yes', or other kinds of conflicts with this approach, but at least it gives an idea.

Related

how to use tf-idf with Naive Bayes?

As per my search regarding the query, that I am posting here, I have got many links which propose solution but haven't mentioned exactly how this is to be done. I have explored, for example, the following links :
Link 1
Link 2
Link 3
Link 4
etc.
Therefore, I am presenting my understanding as to how the Naive Bayes formula with tf-idf can be used here and it is as follows:
Naive-Bayes formula :
P(word|class)=(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_all_classes(basically vocabulary of words in the entire training set))
tf-idf weighting can be employed in the above formula as:
word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.
total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)
total_unique_words_in_all_classes : as is.
This question has been posted multiple times on stack overflow but nothing substantial has been answered so far. I want to know that the way I am thinking about the problem is correct or not i.e. implementation that I have shown above. I need to know this as I am implementing the Naive Bayes myself without taking help of any Python library which comes with the built-in functions for both Naive Bayes and tf-idf. What I actually want is to improve the accuracy(currently 30%) of the model which was using Naive Bayes trained classifier. So, if there are better ways to achieve good accuracy, suggestions are welcome.
Please suggest me. I am new to this domain.
It would be better if you actually gave us the exact features and class you would like to use, or at least give an example. Since none of those have been concretely given, I'll just assume the following is your problem:
You have a number of documents, each of which has a number of words.
You would like to classify documents into categories.
Your feature vector consists of all possible words in all documents, and has values of number of counts in each document.
Your Solution
The tf idf you gave is the following:
word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.
total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)
Your approach sounds reasonable. The sum of all probabilities would sum to 1 independent of the tf-idf function, and the features would reflect tf-idf values. I would say this looks like a solid way to incorporate tf-idf into NB.
Another potential Solution
It took me a while to wrap my head around this problem. The main reason for this was having to worry about maintaining probability normalization. Using a Gaussian Naive Bayes would help ignore this issue entirely.
If you wanted to use this method:
Compute mean, variation of tf-idf values for each class.
Compute the prior using a gaussian distribution generated by the above mean and variation.
Proceed as normal (multiply to prior) and predict values.
Hard coding this shouldn't be too hard since numpy inherently has a gaussian function. I just prefer this type of generic solution for these type of problems.
Additional methods to increase
Apart from the above, you could also use the following techniques to increase accuracy:
Preprocessing:
Feature reduction (usually NMF, PCA, or LDA)
Additional features
Algorithm:
Naive bayes is fast, but inherently performs worse than other algorithms. It may be better to perform feature reduction, and then switch to a discriminative model such as SVM or Logistic Regression
Misc.
Bootstrapping, boosting, etc. Be careful not to overfit though...
Hopefully this was helpful. Leave a comment if anything was unclear
P(word|class)=(word_count_in_class+1)/(total_words_in_class+total_unique_words_in_all_classes
(basically vocabulary of words in the entire training set))
How would this sum up to 1? If using the above conditional probabilities, I assume the SUM is
P(word1|class)+P(word2|class)+...+P(wordn|class) =
(total_words_in_class + total_unique_words_in_class)/(total_words_in_class+total_unique_words_in_all_classes)
To correct this, I think the P(word|class) should be like
(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_classes(vocabulary of words in class))
Please correct me if I am wrong.
I think there are two ways to do it:
Round down tf-idf as integers, then use the multinomial distribution for the conditional probabilities. See this paper https://www.cs.waikato.ac.nz/ml/publications/2004/kibriya_et_al_cr.pdf.
Use Dirichlet distribution which is a continuous version of the multinomial distribution for the conditional probabilities.
I am not sure if Gaussian mixture will be better.

Assurance of ICP, internal Metrics

So I have an iterative closest point (ICP) algorithm that has been written and will fit a model to a point cloud. As a quick tutorial for those not in the know ICP is a simple algorithm that fits points to a model ultimately providing a homogeneous transform matrix between the model and points.
Here is a quick picture tutorial.
Step 1. Find the closest point in the model set to your data set:
Step 2: Using a bunch of fun maths (sometimes based on gradiant descent or SVD) pull the clouds closer together and repeat untill a pose is formed:
![Figure 2][2]
Now that bit is simple and working, what i would like help with is:
How do I tell if the pose that I have is a good one?
So currently I have two ideas, but they are kind of hacky:
How many points are in the ICP Algorithm. Ie, if I am fitting to almost no points, I assume that the pose will be bad:
But what if the pose is actually good? It could be, even with few points. I dont want to reject good poses:
So what we see here is that low points can actually make a very good position if they are in the right place.
So the other metric investigated was the ratio of the supplied points to the used points. Here's an example
Now we exlude points that are too far away because they will be outliers, now this means we need a good starting position for the ICP to work, but i am ok with that. Now in the above example the assurance will say NO, this is a bad pose, and it would be right because the ratio of points vs points included is:
2/11 < SOME_THRESHOLD
So thats good, but it will fail in the case shown above where the triangle is upside down. It will say that the upside down triangle is good because all of the points are used by ICP.
You don't need to be an expert on ICP to answer this question, i am looking for good ideas. Using knowledge of the points how can we classify whether it is a good pose solution or not?
Using both of these solutions together in tandem is a good suggestion but its a pretty lame solution if you ask me, very dumb to just threshold it.
What are some good ideas for how to do this?
PS. If you want to add some code, please go for it. I am working in c++.
PPS. Someone help me with tagging this question I am not sure where it should fall.
One possible approach might be comparing poses by their shapes and their orientation.
Shapes comparison can be done with Hausdorff distance up to isometry, that is poses are of the same shape if
d(I(actual_pose), calculated_pose) < d_threshold
where d_threshold should be found from experiments. As isometric modifications of X I would consider rotations by different angles - seems to be sufficient in this case.
Is poses have the same shape, we should compare their orientation. To compare orientation we could use somewhat simplified Freksa model. For each pose we should calculate values
{x_y min, x_y max, x_z min, x_z max, y_z min, y_z max}
and then make sure that each difference between corresponding values for poses does not break another_threshold, derived from experiments as well.
Hopefully this makes some sense, or at least you can draw something useful for your purpose from this.
ICP attempts to minimize the distance between your point-cloud and a model, yes? Wouldn't it make the most sense to evaluate it based on what that distance actually is after execution?
I'm assuming it tries to minimize the sum of squared distances between each point you try to fit and the closest model point. So if you want a metric for quality, why not just normalize that sum, dividing by the number of points it's fitting. Yes, outliers will disrupt it somewhat but they're also going to disrupt your fit somewhat.
It seems like any calculation you can come up with that provides more insight than whatever ICP is minimizing would be more useful incorporated into the algorithm itself, so it can minimize that too. =)
Update
I think I didn't quite understand the algorithm. It seems that it iteratively selects a subset of points, transforms them to minimize error, and then repeats those two steps? In that case your ideal solution selects as many points as possible while keeping error as small as possible.
You said combining the two terms seemed like a weak solution, but it sounds to me like an exact description of what you want, and it captures the two major features of the algorithm (yes?). Evaluating using something like error + B * (selected / total) seems spiritually similar to how regularization is used to address the overfitting problem with gradient descent (and similar) ML algorithms. Selecting a good value for B would take some experimentation.
Looking at your examples, it seems that one of the things that determines whether the match is good or not, is the quality of the points. Could you use/calculate a weighting factor in calculating your metric?
For example, you could weight down points which are co-linear / co-planar, or spatially close, as they probably define the same feature. That would perhaps allow your upside-down triangle to be rejected (as the points are in a line, and that not a great indicator of the overall pose) but the corner-case would be ok, as they roughly define the hull.
Alternatively, maybe the weighting should be on how distributed the points are around the pose, again trying to ensure you have good coverage, rather than matching small indistinct features.

Two way clustering in ordered logit model, restricting rstudent to mitigate outlier effects

I have an ordered dependent variable (1 through 21) and continuous independent variables. I need to run the ordered logit model, clustering by firm and time, eliminating outliers with Studentized Residuals <-2.5 or > 2.5. I just know ologit command and some options for the command; however, I have no idea about how to do two way clustering and eliminate outliers with studentized residuals:
ologit rating3 securitized retained, cluster(firm)
As far as I know, two way clustering has only been extended to a few estimation commands (like ivreg2 from scc and tobit/logit/probit here). Eliminating outliers can easily be done on your own and there's no automated way of doing it.
Use the logit2.ado from the link Dimitriy gave (Mitchell Petersen's website) and modify it to use the ologit command. It's simple enough to do with a little trial and error. Good luck!
If you have a variable with 21 ordinal categories, I would have no problems treating that as a continuous one. If you want to back that up somehow, I wrote a paper on welfare measurement with ordinal variables, see DOI:10.1111/j.1475-4991.2008.00309.x. Then you can use ivreg2. You should be aware of all the issues involved with that estimator, in particular, that it implicitly assumed that the correlations are fully modeled by this two-way structure, and observations for firms i and j and times t and s are definitely uncorrelated for i!=j and t!=s. Sometimes, this is a strong assumption to make -- i.e., New York and New Jersey may be correlated in 2010, but New York 2010 is uncorrelated with New Jersey 2009.
I have no idea of what you might mean by ordinal outliers. Somebody must have piled a bunch of dissertation advice (or worse analysis requests) without really trying to make sense of every bit.

How many samples are optimal in one class using k-nearest neighbor?

I have implemented k-nearest algorithm in my system. It consists from 26 classes, each of 100 samples. In my case, K=7 and it was completely trial and error to get the best classification result.
I know that K should be chosen wisely to reduce the noise on the classification. But what about the number of samples? Is there any general rule such as "the more samples the better result"? Does it depend on something?
Thank you for all your responses.
You could try considering whatever underlying mechanism is generating your data, or whatever background knowledge you have on the problem, which might give you an idea of the relative size of noise and true underlying variation. E.g. predicting favourite sports team from location I would expect more change than predicting favourite sport, so would use smaller k. However I don't know of much general guidance, except to use cross-validation.

appropriate minimum support for itemset?

Please suggest me for any kind material about appropriate minimum support and confidence for itemset!
::i use apriori algorithm to search frequent itemset. i still don't know appropriate support and confidence for itemset. i wish to know what kinds of considerations to decide how big is the support.
The answer is that the appropriate values depends on the data.
For some datasets, the best value may be 0.5. But for some other datasets it may be 0.05. It depends on the data.
But if you set minsup =0 and minconf = 0, some algorithms will run out of memory before terminating, or you may run out of disk space because there is too many patterns.
From my experience, the best way to choose minsup and minconf is to start with a high value and then to lower them down gradually until you find enough patterns.
Alternatively, if you don't want to have to set minsup, you can use a top-k algorithms where instead of specifying minsup, you specify for example that you want the k most frequent rules. For example, k = 1000 rules.
If you are interested by top-k association rule mining, you can check my Java code here:
http://www.philippe-fournier-viger.com/spmf/
The algorithm is called TopKRules and the article describing it will be published next month.
Besides that, you need to know that there is many other interestingness measures beside the support and confidence: lift, all-confidence, ... To know more about this, you can read this article: "On selecting interestingness measures for association rules" and "A Survey of Interestingness Measures for Association Rules" Basically, all measures have some problems in some cases... no measure is perfect.
Hope this helps!
In any association rule mining algorithm, including Apriori, it is up to the user to decide what support and confidence values they want to provide. Depending on your dataset and your objectives you decide the minSup and minConf.
Obviously, if you set these values lower, then your algorithm will take longer to execute and you will get a lot of results.
The minimum support and minimum confidence parameters are a user preference. If you want a larger quantity of results (with lower statistical confidence), choose the parameters appropriately. In theory you can set them to 0. The algorithm will run, but it will take a long time, and the result will not be particularly useful, as it contains just about anything.
So choose them so that the result suit your needs. Mathematically, any value is "correct".