Feature importance based on extremely randomize trees and feature redundancy - python-2.7

I am using the Scikit-learn Extremely Randomized Trees algorithm to get info about the relative feature importances and I have a question about how "redundant features" are ranked.
If I have two features that are identical (redundant) and important to the classification, the extremely randomized trees cannot detect the redundancy of the features. That is, both features get a high ranking. Is there any other way to detect that two features are actualy redundant?

Maybe you could extract the top n important features and then compute pairwise Spearman's or Pearson's correlations for those in order to detect redundancy only for the top informative features as it might not be feasible to compute all pairwise feature correlations (quadratic with the number of features).
There might be more clever ways to do the same by exploiting the statistics of the relative occurrences of the features as nodes in the decision trees though.

Related

Real reason for speed up in fasttext

What is the real reason for speed-up, even though the pipeline mentioned in the fasttext paper uses techniques - negative sampling and heirerchichal softmax; in earlier word2vec papers. I am not able to clearly understand the actual difference, which is making this speed up happen ?
Is there that much of a speed-up?
I don't think there are any algorithmic breakthroughs which make the word2vec-equivalent word-vector training in FastText significantly faster. (And if you're using the character-ngrams option in FastText, to allow post-training synthesis of vectors for unseen words based on substrings shared with training-words, I'd expect the training to be slower, because every word requires training of its substring vectors as well.)
Any speedups in FastText are likely just because the code is well-tuned, with the benefit of more implementation experience.
To be efficient on datasets with a very large number of categories, Fast text uses a hierarchical classifier instead of a flat structure, in which the different categories are organized in a tree (think binary tree instead of list). This reduces the time complexities of training and testing text classifiers from linear to logarithmic with respect to the number of classes. FastText also exploits the fact that classes are imbalanced (some classes appearing more often than other) by using the Huffman algorithm to build the tree used to represent categories. The depth in the tree of very frequent categories is, therefore, smaller than for infrequent ones, leading to further computational efficiency.
Reference link: https://research.fb.com/blog/2016/08/fasttext/

Choosing the best subset of features

I want to choose the best feature subset available that distinguish two classes to be fed into a statistical framework that I have built, where features are not independent.
After looking at the feature selection methods on machine learning it seems that it fall into three different categories: Filter,wrapper and Embedded methods. and the filter methods can be either: univariate or multivariate. It does make sense to use either Filter(multivariate) or wrapper methods because both -as I understood- looks for the best subset, however, as I am not using a classifier how can use it ?
Does it make sense to apply such methods (e.g. Recursive feature
elimination ) to DT or Random Forest classifier where the features
have rules there, and then take the resulted best subset and fed it
into my framework ?**
Also, as most of the algorithms provided by Scikit-learn are
univariate algorithms, Is there any other python-based libraries
that provide more subset feature selection algorithms ?
I think the statement that "most of the algorithms provided by Scikit-learn are univariate algorithms" is false. Scikit-learn handles multi-dimensional data very nicely. The RandomForestClassifier that they provide will give you an estimate of feature importance.
Another way to estimate feature importance is to choose any classifier that you like, train it and estimate performance on a validation set. Record the accuracy and this will be your baseline. Then take that same train/validation split and randomly permute all values along one feature dimension. Then train and validate again. Record the difference of this accuracy from your baseline. Repeat this for all feature dimensions. The results will be a list of numbers for each feature dimension that indicates its importance.
You can extend this to use pairs, or triples of feature dimensions, but the computational cost will grow quickly. If you're features are highly correlated you may benefit from doing this for at least the pairwise case.
Here is the source document of where I learned that trick: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
(It should work for classifiers other than Random Forests.)

dimension reduction in spam filtering

I'm performing an experiment in which I need to compare classification performance of several classification algorithms for spam filtering, viz. Naive Bayes, SVM, J48, k-NN, RandomForests, etc. I'm using the WEKA data mining tool. While going through the literature I came to know about various dimension reduction methods which can be broadly classified into two types-
Feature Reduction: Principal Component Analysis, Latent Semantic Analysis, etc.
Feature Selection: Chi-Square, InfoGain, GainRatio, etc.
I have also read this tutorial of WEKA by Jose Maria in his blog: http://jmgomezhidalgo.blogspot.com.es/2013/02/text-mining-in-weka-revisited-selecting.html
In this blog he writes, "A typical text classification problem in which dimensionality reduction can be a big mistake is spam filtering". So, now I'm confused whether dimensionality reduction is of any use in case of spam filtering or not?
Further, I have also read in the literature about Document Frequency and TF-IDF as being one of feature reduction techniques. But I'm not sure how does it work and come into play during classification.
I know how to use weka, chain filters and classifiers, etc. The problem I'm facing is since I don't have enough idea about feature selection/reduction (including TF-IDF) I am unable to decide how and what feature selection techniques and classification algorithms I should combine to make my study meaningful. I also have no idea about optimal threshold value that I should use with chi-square, info gain, etc.
In StringToWordVector class, I have an option of IDFTransform, so does it makes sence to set it to TRUE and also use a feature selection technique, say InfoGain?
Please guide me and if possible please provide links to resources where I can learn about dimension reduction in detail and can plan my experiment meaningfully!
Well, Naive Bayes seems to work best for spam filtering, and it doesn't play nicely with dimensionality reduction.
Many dimensionality reduction methods try to identify the features of the highest variance. This of course won't help a lot with spam detection, you want discriminative features.
Plus, there is not only one type of spam, but many. Which is likely why naive Bayes works better than many other methods that assume there is only one type of spam.

Why does the C4.5 algorithm use pruning in order to reduce the decision tree and how does pruning affect the predicion accuracy?

I have searched on google about this issue and I can't find something that explains this algorithm in a simple yet detailed way.
For instance, I know the id3 algorithm doesn't use pruning at all, so if you have a continuous characteristic, the prediction success rates will be very low.
So the C4.5 in order to support continuous characteristics it uses pruning, but is this the only reason?
Also I can't really understand in the WEKA application, how exactly the confidence factor affects the efficiency of the predictions. The smaller the confidence factor the more pruning the algorithm will do, however what is the correlation between pruning and the prediction's accuracy? The more you prune, the better the predictions or the worse?
Thanks
Pruning is a way of reducing the size of the decision tree. This will reduce the accuracy on the training data, but (in general) increase the accuracy on unseen data. It is used to mitigate overfitting, where you would achieve perfect accuracy on training data, but the model (i.e. the decision tree) you learn is so specific that it doesn't apply to anything but that training data.
In general, if you increase pruning, the accuracy on the training set will be lower. WEKA does however offer various things to estimate the accuracy better, namely training/test split or cross-validation. If you use cross-validation for example, you'll discover a "sweet spot" of the pruning confidence factor somewhere where it prunes enough to make the learned decision tree sufficiently accurate on test data, but doesn't sacrifice too much accuracy on the training data. Where this sweet spot lies however will depend on your actual problem and the only way to determine it reliably is to try.

Outlier detection in data mining [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have a few sets of questions regarding outlier detection:
Can we find outliers using k-means and is this a good approach?
Is there any clustering algorithm which does not accept any input from the user?
Can we use support vector machine or any other supervised learning algorithm for outlier detection?
What are the pros and cons of each approach?
I will limit myself to what I think is essential to give some clues about all of your questions, because this is the topic of a lot of textbooks and they might probably be better addressed in separate questions.
I wouldn't use k-means for spotting outliers in a multivariate dataset, for the simple reason that the k-means algorithm is not built for that purpose: You will always end up with a solution that minimizes the total within-cluster sum of squares (and hence maximizes the between-cluster SS because the total variance is fixed), and the outlier(s) will not necessarily define their own cluster. Consider the following example in R:
set.seed(123)
sim.xy <- function(n, mean, sd) cbind(rnorm(n, mean[1], sd[1]),
rnorm(n, mean[2],sd[2]))
# generate three clouds of points, well separated in the 2D plane
xy <- rbind(sim.xy(100, c(0,0), c(.2,.2)),
sim.xy(100, c(2.5,0), c(.4,.2)),
sim.xy(100, c(1.25,.5), c(.3,.2)))
xy[1,] <- c(0,2) # convert 1st obs. to an outlying value
km3 <- kmeans(xy, 3) # ask for three clusters
km4 <- kmeans(xy, 4) # ask for four clusters
As can be seen in the next figure, the outlying value is never recovered as such: It will always belong to one of the other clusters.
One possibility, however, would be to use a two-stage approach where one's removing extremal points (here defined as vector far away from their cluster centroids) in an iterative manner, as described in the following paper: Improving K-Means by Outlier Removal (Hautamäki, et al.).
This bears some resemblance with what is done in genetic studies to detect and remove individuals which exhibit genotyping error, or individuals that are siblings/twins (or when we want to identify population substructure), while we only want to keep unrelated individuals; in this case, we use multidimensional scaling (which is equivalent to PCA, up to a constant for the first two axes) and remove observations above or below 6 SD on any one of say the top 10 or 20 axes (see for example, Population Structure and Eigenanalysis, Patterson et al., PLoS Genetics 2006 2(12)).
A common alternative is to use ordered robust mahalanobis distances that can be plotted (in a QQ plot) against the expected quantiles of a Chi-squared distribution, as discussed in the following paper:
R.G. Garrett (1989). The chi-square plot: a tools for multivariate outlier recognition. Journal of Geochemical Exploration 32(1/3): 319-341.
(It is available in the mvoutlier R package.)
It depends on what you call user input. I interpret your question as whether some algorithm can process automatically a distance matrix or raw data and stop on an optimal number of clusters. If this is the case, and for any distance-based partitioning algorithm, then you can use any of the available validity indices for cluster analysis; a good overview is given in
Handl, J., Knowles, J., and Kell, D.B.
(2005). Computational cluster validation in post-genomic data analysis.
Bioinformatics 21(15): 3201-3212.
that I discussed on Cross Validated. You can for instance run several instances of the algorithm on different random samples (using bootstrap) of the data, for a range of cluster numbers (say, k=1 to 20) and select k according to the optimized criteria taht was considered (average silhouette width, cophenetic correlation, etc.); it can be fully automated, no need for user input.
There exist other forms of clustering, based on density (clusters are seen as regions where objects are unusually common) or distribution (clusters are sets of objects that follow a given probability distribution). Model-based clustering, as it is implemented in Mclust, for example, allows to identify clusters in a multivariate dataset by spanning a range of shape for the variance-covariance matrix for a varying number of clusters and to choose the best model according to the BIC criterion.
This is a hot topic in classification, and some studies focused on SVM to detect outliers especially when they are misclassified. A simple Google query will return a lot of hits, e.g. Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction by Thongkam et al. (Lecture Notes in Computer Science 2008 4977/2008 99-109; this article includes comparison to ensemble methods). The very basic idea is to use a one-class SVM to capture the main structure of the data by fitting a multivariate (e.g., gaussian) distribution to it; objects that on or just outside the boundary might be regarded as potential outliers. (In a certain sense, density-based clustering would perform equally well as defining what an outlier really is is more straightforward given an expected distribution.)
Other approaches for unsupervised, semi-supervised, or supervised learning are readily found on Google, e.g.
Hodge, V.J. and Austin, J. A Survey of Outlier Detection Methodologies.
Vinueza, A. and Grudic, G.Z. Unsupervised Outlier Detection and Semi-Supervised Learning.
Escalante, H.J. A Comparison of Outlier Detection Algorithms for Machine Learning.
A related topic is anomaly detection, about which you will find a lot of papers.
That really deserves a new (and probably more focused) question :-)
1) Can we find outliers using k-means, is it a good approach?
Cluster-based approaches are optimal to find clusters, and can be used to detect outliers as
by-products. In the clustering processes, outliers can affect the locations of the cluster centers, even aggregating as a micro-cluster. These characteristics make the cluster-based approaches infeasible to complicated databases.
2) Is there any clustering algorithm which does not accept any input from the user?
Maybe you can achieve some valuable knowledge on this topic:
Dirichlet Process Clustering
Dirichlet-based clustering algorithm can adaptively determine the number of clusters according to the distribution of observation data.
3) Can we use support vector machine or any other supervised learning algorithm for outlier detection?
Any Supervised learning algorithm needs enough labeled training data to construct classifiers. However, a balanced training dataset is not always available for real world problem, such as intrusion detection, medical diagnostics. According to the definition of Hawkins Outlier("Identification of Outliers". Chapman and Hall, London, 1980), the number of normal data is much larger than that of outliers. Most supervised learning algorithms can't achieve an efficient classifier on the above unbalanced dataset.
4) What is the pros and cons of each approach?
Over the past several decades, the research on outlier detection varies from the global computation to the local analysis, and the descriptions of outliers vary from the binary interpretations to probabilistic representations. According to hypotheses of outlier detection models, outlier detection algorithms can be divided into four kinds: Statistic-based algorithms, Cluster-based algorithms, Nearest Neighborhood based algorithms, and Classifier-based algorithms. There are several valuable surveys on outlier detection:
Hodge, V. and Austin, J. "A survey of outlier detection methodologies", Journal of Artificial Intelligence Review, 2004.
Chandola, V. and Banerjee, A. and Kumar, V. "Outlier detection: A survey", ACM Computing Surveys, 2007.
k-means is rather sensitive to noise in the data set. It works best when you remove the outliers beforehand.
No. Any cluster analysis algorithm that claims to be parameter-free usually is heavily restricted, and often has hidden parameters - a common parameter is the distance function, for example. Any flexible cluster analysis algorithm will at least accept a custom distance function.
one-class classifiers are a popular machine-learning approach to outlier detection. However, supervised approaches aren't always appropriate for detecting _previously_unseen_ objects. Plus, they can overfit when the data already contains outliers.
Every approach has its pros and cons, that is why they exist. In a real setting, you will have to try most of them to see what works for your data and setting. It's why outlier detection is called knowledge discovery - you have to explore if you want to discover something new ...
You may want to have a look at the ELKI data mining framework. It is supposedly the largest collection of outlier detection data mining algorithms. It's open source software, implemented in Java, and includes some 20+ outlier detection algorithms. See the list of available algorithms.
Note that most of these algorithms are not based on clustering. Many clustering algorithms (in particular k-means) will try to cluster instances "no matter what". Only few clustering algorithms (e.g. DBSCAN) actually consider the case that maybe not all instance belong into clusters! So for some algorithms, outliers will actually prevent a good clustering!