Why does the C4.5 algorithm use pruning in order to reduce the decision tree and how does pruning affect the predicion accuracy? - weka

I have searched on google about this issue and I can't find something that explains this algorithm in a simple yet detailed way.
For instance, I know the id3 algorithm doesn't use pruning at all, so if you have a continuous characteristic, the prediction success rates will be very low.
So the C4.5 in order to support continuous characteristics it uses pruning, but is this the only reason?
Also I can't really understand in the WEKA application, how exactly the confidence factor affects the efficiency of the predictions. The smaller the confidence factor the more pruning the algorithm will do, however what is the correlation between pruning and the prediction's accuracy? The more you prune, the better the predictions or the worse?
Thanks

Pruning is a way of reducing the size of the decision tree. This will reduce the accuracy on the training data, but (in general) increase the accuracy on unseen data. It is used to mitigate overfitting, where you would achieve perfect accuracy on training data, but the model (i.e. the decision tree) you learn is so specific that it doesn't apply to anything but that training data.
In general, if you increase pruning, the accuracy on the training set will be lower. WEKA does however offer various things to estimate the accuracy better, namely training/test split or cross-validation. If you use cross-validation for example, you'll discover a "sweet spot" of the pruning confidence factor somewhere where it prunes enough to make the learned decision tree sufficiently accurate on test data, but doesn't sacrifice too much accuracy on the training data. Where this sweet spot lies however will depend on your actual problem and the only way to determine it reliably is to try.

Related

What is considered a good accuracy for trained Word2Vec on an analogy test?

After training Word2Vec, how high should the accuracy be during testing on analogies? What level of accuracy should be expected if it is trained well?
The analogy test is just a interesting automated way to evaluate models, or compare algorithms.
It might not be the best indicator of how well word-vectors will work for your own project-specific goals. (That is, a model which does better on word-analogies might be worse for whatever other info-retrieval, or classification, or other goal you're really pursuing.) So if at all possible, create an automated evaluation that's tuned to your own needs.
Note that the absolute analogy scores can also be quite sensitive to how you trim the vocabulary before training, or how you treat analogy-questions with out-of-vocabulary words, or whether you trim results at the end to just higher-frequency words. Certain choices for each of these may boost the supposed "correctness" of the simple analogy questions, but not improve the overall model for more realistic applications.
So there's no absolute accuracy rate on these simplistic questions that should be the target. Only relative rates are somewhat indicative - helping to show when more data, or tweaked training parameters, seem to improve the vectors. But even vectors with small apparent accuracies on generic analogies might be useful elsewhere.
All that said, you can review a demo notebook like the gensim "Comparison of FastText and Word2Vec" to see what sorts of accuracies on the Google word2vec.c `questions-words.txt' analogy set (40-60%) are achieved under some simple defaults and relatively small training sets (100MB-1GB).

Real reason for speed up in fasttext

What is the real reason for speed-up, even though the pipeline mentioned in the fasttext paper uses techniques - negative sampling and heirerchichal softmax; in earlier word2vec papers. I am not able to clearly understand the actual difference, which is making this speed up happen ?
Is there that much of a speed-up?
I don't think there are any algorithmic breakthroughs which make the word2vec-equivalent word-vector training in FastText significantly faster. (And if you're using the character-ngrams option in FastText, to allow post-training synthesis of vectors for unseen words based on substrings shared with training-words, I'd expect the training to be slower, because every word requires training of its substring vectors as well.)
Any speedups in FastText are likely just because the code is well-tuned, with the benefit of more implementation experience.
To be efficient on datasets with a very large number of categories, Fast text uses a hierarchical classifier instead of a flat structure, in which the different categories are organized in a tree (think binary tree instead of list). This reduces the time complexities of training and testing text classifiers from linear to logarithmic with respect to the number of classes. FastText also exploits the fact that classes are imbalanced (some classes appearing more often than other) by using the Huffman algorithm to build the tree used to represent categories. The depth in the tree of very frequent categories is, therefore, smaller than for infrequent ones, leading to further computational efficiency.
Reference link: https://research.fb.com/blog/2016/08/fasttext/

Which one is faster? Logistic regression or SVM with linear kernel?

I am doing machine learning with python (scikit-learn) using the same data but with different classifiers. When I use 500k of data, LR and SVM (linear kernel) take about the same time, SVM (with polynomial kernel) takes forever. But using 5 million data, it seems LR is faster than SVM (linear) by a lot, I wonder if this is what people normally find?
Faster is a bit of a weird question, in part because it is hard to compare apples to apples on this, and it depends on context. LR and SVM are very similar in the linear case. The TLDR for the linear case is that Logistic Regression and SVMs are both very fast and the speed difference shouldn't normally be too large, and both could be faster/slower in certain cases.
From a mathematical perspective, Logistic regression is strictly convex [its loss is also smoother] where SVMs are only convex, so that helps LR be "faster" from an optimization perspective, but that doesn't always translate to faster in terms of how long you wait.
Part of this is because, computationally, SVMs are simpler. Logistic Regression requires computing the exp function, which is a good bit more expensive than just the max function used in SVMs, but computing these doesn't make the majority of the work in most cases. SVMs also have hard zeros in the dual space, so a common optimization is to perform "shrinkage", where you assume (often correctly) that a data point's contribution to the solution won't change in the near future and stop visiting it / checking its optimality. The hard zero of the SVM loss and the C regularization term in the soft margin form allow for this, where LR has no hard zeros to exploit like that.
However, when you want something to be fast, you usually don't use an exact solver. In this case, the issues above mostly disappear, and both tend to learn just as quick as the other in this scenario.
In my own experience, I've found Dual Coordinate Descent based solvers to be the fastest for getting exact solutions to both, with Logistic Regression usually being faster in wall clock time than SVMs, but not always (and never by more than a 2x factor). However, if you try and compare different solver methods for LRs and SVMs you may get very different numbers on which is "faster", and those comparisons won't necessarily be fair. For example, the SMO solver for SVMs can be used in the linear case, but will be orders of magnitude slower because it is not exploiting the fact that you only care are Linear solutions.

choosing kernel for digit recognition in C

I'm trying to classify digits read on images at known positions in C++, using SVM.
for that, I sample over a rectangle at the known position of the digit, I train with a ground_truth.
I wonder how to choose the kernel of the SVM. I use the default linear kernel but my intuition tell me that it might not be the best choice.
How could I choose the kernel?
You will need to tune the kernel (if you use a nonlinear one). This guide may be useful for you: A practical guide to SVM classification
Unfortunately there is not a magic bullet for this, so experimentation is your best friend.
Probably I would start with RBF which tends to work decently in most cases, and I am agreed with your intuition that probably linear is not the best, although some times (especially when you have tons of data) it can give you good surprises :)
The problem I have found with RBF is that it tends to overfit the training set, this stop to be an issue if you have a lot of data but then a new problem raises because it tends to scale poorly and having slow training time for big data.

How many samples are optimal in one class using k-nearest neighbor?

I have implemented k-nearest algorithm in my system. It consists from 26 classes, each of 100 samples. In my case, K=7 and it was completely trial and error to get the best classification result.
I know that K should be chosen wisely to reduce the noise on the classification. But what about the number of samples? Is there any general rule such as "the more samples the better result"? Does it depend on something?
Thank you for all your responses.
You could try considering whatever underlying mechanism is generating your data, or whatever background knowledge you have on the problem, which might give you an idea of the relative size of noise and true underlying variation. E.g. predicting favourite sports team from location I would expect more change than predicting favourite sport, so would use smaller k. However I don't know of much general guidance, except to use cross-validation.