Difference between empirical naive bayes & parametric bayes classifiers - data-mining

Im trying to understand the difference between each of these.
What is the difference between empirical naive bayes classifiers and parametric bayes classifiers?

The emperical part means that the distribution is estimated from the data, rather than being fixed before analysis begins
Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum marginal likelihood,[1] represents one approach for setting hyperparameters.
http://en.wikipedia.org/wiki/Empirical_Bayes_method
Naive means that the value of features being analyzed are independent of each other
A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".
http://en.wikipedia.org/wiki/Naive_Bayes_classifier

Related

Choosing the best subset of features

I want to choose the best feature subset available that distinguish two classes to be fed into a statistical framework that I have built, where features are not independent.
After looking at the feature selection methods on machine learning it seems that it fall into three different categories: Filter,wrapper and Embedded methods. and the filter methods can be either: univariate or multivariate. It does make sense to use either Filter(multivariate) or wrapper methods because both -as I understood- looks for the best subset, however, as I am not using a classifier how can use it ?
Does it make sense to apply such methods (e.g. Recursive feature
elimination ) to DT or Random Forest classifier where the features
have rules there, and then take the resulted best subset and fed it
into my framework ?**
Also, as most of the algorithms provided by Scikit-learn are
univariate algorithms, Is there any other python-based libraries
that provide more subset feature selection algorithms ?
I think the statement that "most of the algorithms provided by Scikit-learn are univariate algorithms" is false. Scikit-learn handles multi-dimensional data very nicely. The RandomForestClassifier that they provide will give you an estimate of feature importance.
Another way to estimate feature importance is to choose any classifier that you like, train it and estimate performance on a validation set. Record the accuracy and this will be your baseline. Then take that same train/validation split and randomly permute all values along one feature dimension. Then train and validate again. Record the difference of this accuracy from your baseline. Repeat this for all feature dimensions. The results will be a list of numbers for each feature dimension that indicates its importance.
You can extend this to use pairs, or triples of feature dimensions, but the computational cost will grow quickly. If you're features are highly correlated you may benefit from doing this for at least the pairwise case.
Here is the source document of where I learned that trick: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
(It should work for classifiers other than Random Forests.)

Which one is faster? Logistic regression or SVM with linear kernel?

I am doing machine learning with python (scikit-learn) using the same data but with different classifiers. When I use 500k of data, LR and SVM (linear kernel) take about the same time, SVM (with polynomial kernel) takes forever. But using 5 million data, it seems LR is faster than SVM (linear) by a lot, I wonder if this is what people normally find?
Faster is a bit of a weird question, in part because it is hard to compare apples to apples on this, and it depends on context. LR and SVM are very similar in the linear case. The TLDR for the linear case is that Logistic Regression and SVMs are both very fast and the speed difference shouldn't normally be too large, and both could be faster/slower in certain cases.
From a mathematical perspective, Logistic regression is strictly convex [its loss is also smoother] where SVMs are only convex, so that helps LR be "faster" from an optimization perspective, but that doesn't always translate to faster in terms of how long you wait.
Part of this is because, computationally, SVMs are simpler. Logistic Regression requires computing the exp function, which is a good bit more expensive than just the max function used in SVMs, but computing these doesn't make the majority of the work in most cases. SVMs also have hard zeros in the dual space, so a common optimization is to perform "shrinkage", where you assume (often correctly) that a data point's contribution to the solution won't change in the near future and stop visiting it / checking its optimality. The hard zero of the SVM loss and the C regularization term in the soft margin form allow for this, where LR has no hard zeros to exploit like that.
However, when you want something to be fast, you usually don't use an exact solver. In this case, the issues above mostly disappear, and both tend to learn just as quick as the other in this scenario.
In my own experience, I've found Dual Coordinate Descent based solvers to be the fastest for getting exact solutions to both, with Logistic Regression usually being faster in wall clock time than SVMs, but not always (and never by more than a 2x factor). However, if you try and compare different solver methods for LRs and SVMs you may get very different numbers on which is "faster", and those comparisons won't necessarily be fair. For example, the SMO solver for SVMs can be used in the linear case, but will be orders of magnitude slower because it is not exploiting the fact that you only care are Linear solutions.

dimension reduction in spam filtering

I'm performing an experiment in which I need to compare classification performance of several classification algorithms for spam filtering, viz. Naive Bayes, SVM, J48, k-NN, RandomForests, etc. I'm using the WEKA data mining tool. While going through the literature I came to know about various dimension reduction methods which can be broadly classified into two types-
Feature Reduction: Principal Component Analysis, Latent Semantic Analysis, etc.
Feature Selection: Chi-Square, InfoGain, GainRatio, etc.
I have also read this tutorial of WEKA by Jose Maria in his blog: http://jmgomezhidalgo.blogspot.com.es/2013/02/text-mining-in-weka-revisited-selecting.html
In this blog he writes, "A typical text classification problem in which dimensionality reduction can be a big mistake is spam filtering". So, now I'm confused whether dimensionality reduction is of any use in case of spam filtering or not?
Further, I have also read in the literature about Document Frequency and TF-IDF as being one of feature reduction techniques. But I'm not sure how does it work and come into play during classification.
I know how to use weka, chain filters and classifiers, etc. The problem I'm facing is since I don't have enough idea about feature selection/reduction (including TF-IDF) I am unable to decide how and what feature selection techniques and classification algorithms I should combine to make my study meaningful. I also have no idea about optimal threshold value that I should use with chi-square, info gain, etc.
In StringToWordVector class, I have an option of IDFTransform, so does it makes sence to set it to TRUE and also use a feature selection technique, say InfoGain?
Please guide me and if possible please provide links to resources where I can learn about dimension reduction in detail and can plan my experiment meaningfully!
Well, Naive Bayes seems to work best for spam filtering, and it doesn't play nicely with dimensionality reduction.
Many dimensionality reduction methods try to identify the features of the highest variance. This of course won't help a lot with spam detection, you want discriminative features.
Plus, there is not only one type of spam, but many. Which is likely why naive Bayes works better than many other methods that assume there is only one type of spam.

Why does the C4.5 algorithm use pruning in order to reduce the decision tree and how does pruning affect the predicion accuracy?

I have searched on google about this issue and I can't find something that explains this algorithm in a simple yet detailed way.
For instance, I know the id3 algorithm doesn't use pruning at all, so if you have a continuous characteristic, the prediction success rates will be very low.
So the C4.5 in order to support continuous characteristics it uses pruning, but is this the only reason?
Also I can't really understand in the WEKA application, how exactly the confidence factor affects the efficiency of the predictions. The smaller the confidence factor the more pruning the algorithm will do, however what is the correlation between pruning and the prediction's accuracy? The more you prune, the better the predictions or the worse?
Thanks
Pruning is a way of reducing the size of the decision tree. This will reduce the accuracy on the training data, but (in general) increase the accuracy on unseen data. It is used to mitigate overfitting, where you would achieve perfect accuracy on training data, but the model (i.e. the decision tree) you learn is so specific that it doesn't apply to anything but that training data.
In general, if you increase pruning, the accuracy on the training set will be lower. WEKA does however offer various things to estimate the accuracy better, namely training/test split or cross-validation. If you use cross-validation for example, you'll discover a "sweet spot" of the pruning confidence factor somewhere where it prunes enough to make the learned decision tree sufficiently accurate on test data, but doesn't sacrifice too much accuracy on the training data. Where this sweet spot lies however will depend on your actual problem and the only way to determine it reliably is to try.

Outlier detection in data mining [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have a few sets of questions regarding outlier detection:
Can we find outliers using k-means and is this a good approach?
Is there any clustering algorithm which does not accept any input from the user?
Can we use support vector machine or any other supervised learning algorithm for outlier detection?
What are the pros and cons of each approach?
I will limit myself to what I think is essential to give some clues about all of your questions, because this is the topic of a lot of textbooks and they might probably be better addressed in separate questions.
I wouldn't use k-means for spotting outliers in a multivariate dataset, for the simple reason that the k-means algorithm is not built for that purpose: You will always end up with a solution that minimizes the total within-cluster sum of squares (and hence maximizes the between-cluster SS because the total variance is fixed), and the outlier(s) will not necessarily define their own cluster. Consider the following example in R:
set.seed(123)
sim.xy <- function(n, mean, sd) cbind(rnorm(n, mean[1], sd[1]),
rnorm(n, mean[2],sd[2]))
# generate three clouds of points, well separated in the 2D plane
xy <- rbind(sim.xy(100, c(0,0), c(.2,.2)),
sim.xy(100, c(2.5,0), c(.4,.2)),
sim.xy(100, c(1.25,.5), c(.3,.2)))
xy[1,] <- c(0,2) # convert 1st obs. to an outlying value
km3 <- kmeans(xy, 3) # ask for three clusters
km4 <- kmeans(xy, 4) # ask for four clusters
As can be seen in the next figure, the outlying value is never recovered as such: It will always belong to one of the other clusters.
One possibility, however, would be to use a two-stage approach where one's removing extremal points (here defined as vector far away from their cluster centroids) in an iterative manner, as described in the following paper: Improving K-Means by Outlier Removal (Hautamäki, et al.).
This bears some resemblance with what is done in genetic studies to detect and remove individuals which exhibit genotyping error, or individuals that are siblings/twins (or when we want to identify population substructure), while we only want to keep unrelated individuals; in this case, we use multidimensional scaling (which is equivalent to PCA, up to a constant for the first two axes) and remove observations above or below 6 SD on any one of say the top 10 or 20 axes (see for example, Population Structure and Eigenanalysis, Patterson et al., PLoS Genetics 2006 2(12)).
A common alternative is to use ordered robust mahalanobis distances that can be plotted (in a QQ plot) against the expected quantiles of a Chi-squared distribution, as discussed in the following paper:
R.G. Garrett (1989). The chi-square plot: a tools for multivariate outlier recognition. Journal of Geochemical Exploration 32(1/3): 319-341.
(It is available in the mvoutlier R package.)
It depends on what you call user input. I interpret your question as whether some algorithm can process automatically a distance matrix or raw data and stop on an optimal number of clusters. If this is the case, and for any distance-based partitioning algorithm, then you can use any of the available validity indices for cluster analysis; a good overview is given in
Handl, J., Knowles, J., and Kell, D.B.
(2005). Computational cluster validation in post-genomic data analysis.
Bioinformatics 21(15): 3201-3212.
that I discussed on Cross Validated. You can for instance run several instances of the algorithm on different random samples (using bootstrap) of the data, for a range of cluster numbers (say, k=1 to 20) and select k according to the optimized criteria taht was considered (average silhouette width, cophenetic correlation, etc.); it can be fully automated, no need for user input.
There exist other forms of clustering, based on density (clusters are seen as regions where objects are unusually common) or distribution (clusters are sets of objects that follow a given probability distribution). Model-based clustering, as it is implemented in Mclust, for example, allows to identify clusters in a multivariate dataset by spanning a range of shape for the variance-covariance matrix for a varying number of clusters and to choose the best model according to the BIC criterion.
This is a hot topic in classification, and some studies focused on SVM to detect outliers especially when they are misclassified. A simple Google query will return a lot of hits, e.g. Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction by Thongkam et al. (Lecture Notes in Computer Science 2008 4977/2008 99-109; this article includes comparison to ensemble methods). The very basic idea is to use a one-class SVM to capture the main structure of the data by fitting a multivariate (e.g., gaussian) distribution to it; objects that on or just outside the boundary might be regarded as potential outliers. (In a certain sense, density-based clustering would perform equally well as defining what an outlier really is is more straightforward given an expected distribution.)
Other approaches for unsupervised, semi-supervised, or supervised learning are readily found on Google, e.g.
Hodge, V.J. and Austin, J. A Survey of Outlier Detection Methodologies.
Vinueza, A. and Grudic, G.Z. Unsupervised Outlier Detection and Semi-Supervised Learning.
Escalante, H.J. A Comparison of Outlier Detection Algorithms for Machine Learning.
A related topic is anomaly detection, about which you will find a lot of papers.
That really deserves a new (and probably more focused) question :-)
1) Can we find outliers using k-means, is it a good approach?
Cluster-based approaches are optimal to find clusters, and can be used to detect outliers as
by-products. In the clustering processes, outliers can affect the locations of the cluster centers, even aggregating as a micro-cluster. These characteristics make the cluster-based approaches infeasible to complicated databases.
2) Is there any clustering algorithm which does not accept any input from the user?
Maybe you can achieve some valuable knowledge on this topic:
Dirichlet Process Clustering
Dirichlet-based clustering algorithm can adaptively determine the number of clusters according to the distribution of observation data.
3) Can we use support vector machine or any other supervised learning algorithm for outlier detection?
Any Supervised learning algorithm needs enough labeled training data to construct classifiers. However, a balanced training dataset is not always available for real world problem, such as intrusion detection, medical diagnostics. According to the definition of Hawkins Outlier("Identification of Outliers". Chapman and Hall, London, 1980), the number of normal data is much larger than that of outliers. Most supervised learning algorithms can't achieve an efficient classifier on the above unbalanced dataset.
4) What is the pros and cons of each approach?
Over the past several decades, the research on outlier detection varies from the global computation to the local analysis, and the descriptions of outliers vary from the binary interpretations to probabilistic representations. According to hypotheses of outlier detection models, outlier detection algorithms can be divided into four kinds: Statistic-based algorithms, Cluster-based algorithms, Nearest Neighborhood based algorithms, and Classifier-based algorithms. There are several valuable surveys on outlier detection:
Hodge, V. and Austin, J. "A survey of outlier detection methodologies", Journal of Artificial Intelligence Review, 2004.
Chandola, V. and Banerjee, A. and Kumar, V. "Outlier detection: A survey", ACM Computing Surveys, 2007.
k-means is rather sensitive to noise in the data set. It works best when you remove the outliers beforehand.
No. Any cluster analysis algorithm that claims to be parameter-free usually is heavily restricted, and often has hidden parameters - a common parameter is the distance function, for example. Any flexible cluster analysis algorithm will at least accept a custom distance function.
one-class classifiers are a popular machine-learning approach to outlier detection. However, supervised approaches aren't always appropriate for detecting _previously_unseen_ objects. Plus, they can overfit when the data already contains outliers.
Every approach has its pros and cons, that is why they exist. In a real setting, you will have to try most of them to see what works for your data and setting. It's why outlier detection is called knowledge discovery - you have to explore if you want to discover something new ...
You may want to have a look at the ELKI data mining framework. It is supposedly the largest collection of outlier detection data mining algorithms. It's open source software, implemented in Java, and includes some 20+ outlier detection algorithms. See the list of available algorithms.
Note that most of these algorithms are not based on clustering. Many clustering algorithms (in particular k-means) will try to cluster instances "no matter what". Only few clustering algorithms (e.g. DBSCAN) actually consider the case that maybe not all instance belong into clusters! So for some algorithms, outliers will actually prevent a good clustering!