Ridge estimator in Weka's Logistic function - weka

I'm reading the article "Ridge Estimators in Logistic Regression" by le Cessie and van Houwelingen, which is cited in Weka's documentation on the logistic regression function. I have to say, my maths are shaky in this area (it's been a while). In particular, I'm trying to work out the logic behind how the ridge parameter works, and what it's main purpose is.
The authors say that ridge estimators improve parameter estimates and reduce error in future predictions (this is in the abstract). I'm not exactly sure what an "ill posed problem" is, but as I understand it, the ridge estimator is meant to be a method of regularisation for this type of problem.
What do different values of the ridge parameter in Weka's Logistic regression do to change the performance of the logistic regression?
Does the ridge paramter involve the computation a Tikhonov Matrix that favours minimum residuals?
I'm sorry if I combined too many questions into one post. I think I understand what the ridge parameter is meant to do, but not how it does it.

Related

Ridge penalizing l2 norm of coefficients

What does this mean?:
Ridge penalizes the l2 norm of the coefficients, or the Euclidean length of w.
It is from the textbook : Introduction to Machine Learning with Python
Ridge regression penalizes the model coefficients in order to correct for overfitting. As you have probably learned, overfitting occurs when you have too many parameters in your model which causes it to memorize the data (which leads to poor generalization). In ridge regression you add a bias to the regression estimates in order to reduce the variance. This works by adding a penalty term to the loss function in order to decrease the contribution of each feature to the model outcome. I hope this helps!

scikit learn averaged perceptron classifier

I am a new learner to machine learning and I want to do a 2-class classification with only a few attributes. I have learned by researching online that two-class averaged perceptron algorithm is good for two-class classification with a linear model.
However, I have been reading through the documentation of Scikit-learn, and I am a bit confused if Scikit-learn is providing a averaged perceptron algorithm.
I wonder if the sklearn.linear_model.Perceptron class can be implemented as the two-class averaged perceptron algorithm by setting up the parameters correctly.
I appreciate it very much for your kind help.
I'm sure someone will correct me if I'm wrong but I do not believe Averaged Perceptron is implemented in sklearn. If I recall correctly, Perceptron in sklearn is simply SGD with certain default parameters.
With that said, have you tried good old logistic regression? While it may not be the sexiest algorithm around, it often does provide good results and can serve as a baseline to see if you need to explore more complicated methods.

Logistic Regression with pymc3 - what's the prior for build in glm?

I could not find good explanation for what's going on exactly by using glm with pymc3 in case of logistic regression. So I compared the GLM version to an explicit pymc3 model. I started to write an ipython notebook for documentation, see:
http://christianherta.de/lehre/dataScience/machineLearning/mcmc/logisticRegressionPymc3.slides.php
What I don't understand is:
What prior is used for the Parameters in GLM? I assume they are also Normal distributed. I got different results with my explicit model in comparison to the build in GLM. (see link above)
With less data the sampling get's stuck and/or I got really poor results. With more training data I could not observe this behaviour. Is this normal for mcmc?
There are more issue in the notebook.
Thanks for your answer.
What prior is used for the Parameters in GLM
GLM is name for family of methods. Two popular priors: gaussian (corresponds to l2 regularization) and laplacian (corresponds to l1), usually the first one.
With less data the sampling get's stuck and/or I got really poor results. With more training data I could not observe this behaviour. Is this normal for mcmc?
Did you play with prior parameter? If model behaves badly with small amount of data, this may be due to strong prior (= too high regularization), which becomes the main term in optimization.

Stata - Tobit - Lagrange Multiplier Test

Community,
I am running a left- and right-censored tobit regression model. The dependent variable is the proportion of cash used in M&A transactions running from 0 to 1.
I assume heteroskedasticity to be prevalent due to the characteristics of my cross-sectional sample as well as the BPCW test for the LS regression model. In order to test the tobit specifications, I used bctobit. However, bctobit is not applicable for right-censored data.
This gives rise to the following question:
- Is there another user-written command to test for the tobit specifications with right- and left-censored data?
Thanks a lot for your efforts!
The most immediate question to me here is statistical. From what you say this approach is inadvisable, so how to implement it is immaterial.
I don't think Tobit makes much sense for variables that are defined to lie in an interval. Censoring to me implies that some high or low values might have been observed in principle, but in practice are recorded as less extreme values. It seems to me that logit or probit are the appropriate link functions for proportional responses, and in that Stata that means glm with e.g. logit link.
Regardless of that, do you regard linear dependence as expected here?
For an excellent concise review making this point, see http://www.stata-journal.com/sjpdf.html?articlenum=st0147

Outlier detection in data mining [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have a few sets of questions regarding outlier detection:
Can we find outliers using k-means and is this a good approach?
Is there any clustering algorithm which does not accept any input from the user?
Can we use support vector machine or any other supervised learning algorithm for outlier detection?
What are the pros and cons of each approach?
I will limit myself to what I think is essential to give some clues about all of your questions, because this is the topic of a lot of textbooks and they might probably be better addressed in separate questions.
I wouldn't use k-means for spotting outliers in a multivariate dataset, for the simple reason that the k-means algorithm is not built for that purpose: You will always end up with a solution that minimizes the total within-cluster sum of squares (and hence maximizes the between-cluster SS because the total variance is fixed), and the outlier(s) will not necessarily define their own cluster. Consider the following example in R:
set.seed(123)
sim.xy <- function(n, mean, sd) cbind(rnorm(n, mean[1], sd[1]),
rnorm(n, mean[2],sd[2]))
# generate three clouds of points, well separated in the 2D plane
xy <- rbind(sim.xy(100, c(0,0), c(.2,.2)),
sim.xy(100, c(2.5,0), c(.4,.2)),
sim.xy(100, c(1.25,.5), c(.3,.2)))
xy[1,] <- c(0,2) # convert 1st obs. to an outlying value
km3 <- kmeans(xy, 3) # ask for three clusters
km4 <- kmeans(xy, 4) # ask for four clusters
As can be seen in the next figure, the outlying value is never recovered as such: It will always belong to one of the other clusters.
One possibility, however, would be to use a two-stage approach where one's removing extremal points (here defined as vector far away from their cluster centroids) in an iterative manner, as described in the following paper: Improving K-Means by Outlier Removal (Hautamäki, et al.).
This bears some resemblance with what is done in genetic studies to detect and remove individuals which exhibit genotyping error, or individuals that are siblings/twins (or when we want to identify population substructure), while we only want to keep unrelated individuals; in this case, we use multidimensional scaling (which is equivalent to PCA, up to a constant for the first two axes) and remove observations above or below 6 SD on any one of say the top 10 or 20 axes (see for example, Population Structure and Eigenanalysis, Patterson et al., PLoS Genetics 2006 2(12)).
A common alternative is to use ordered robust mahalanobis distances that can be plotted (in a QQ plot) against the expected quantiles of a Chi-squared distribution, as discussed in the following paper:
R.G. Garrett (1989). The chi-square plot: a tools for multivariate outlier recognition. Journal of Geochemical Exploration 32(1/3): 319-341.
(It is available in the mvoutlier R package.)
It depends on what you call user input. I interpret your question as whether some algorithm can process automatically a distance matrix or raw data and stop on an optimal number of clusters. If this is the case, and for any distance-based partitioning algorithm, then you can use any of the available validity indices for cluster analysis; a good overview is given in
Handl, J., Knowles, J., and Kell, D.B.
(2005). Computational cluster validation in post-genomic data analysis.
Bioinformatics 21(15): 3201-3212.
that I discussed on Cross Validated. You can for instance run several instances of the algorithm on different random samples (using bootstrap) of the data, for a range of cluster numbers (say, k=1 to 20) and select k according to the optimized criteria taht was considered (average silhouette width, cophenetic correlation, etc.); it can be fully automated, no need for user input.
There exist other forms of clustering, based on density (clusters are seen as regions where objects are unusually common) or distribution (clusters are sets of objects that follow a given probability distribution). Model-based clustering, as it is implemented in Mclust, for example, allows to identify clusters in a multivariate dataset by spanning a range of shape for the variance-covariance matrix for a varying number of clusters and to choose the best model according to the BIC criterion.
This is a hot topic in classification, and some studies focused on SVM to detect outliers especially when they are misclassified. A simple Google query will return a lot of hits, e.g. Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction by Thongkam et al. (Lecture Notes in Computer Science 2008 4977/2008 99-109; this article includes comparison to ensemble methods). The very basic idea is to use a one-class SVM to capture the main structure of the data by fitting a multivariate (e.g., gaussian) distribution to it; objects that on or just outside the boundary might be regarded as potential outliers. (In a certain sense, density-based clustering would perform equally well as defining what an outlier really is is more straightforward given an expected distribution.)
Other approaches for unsupervised, semi-supervised, or supervised learning are readily found on Google, e.g.
Hodge, V.J. and Austin, J. A Survey of Outlier Detection Methodologies.
Vinueza, A. and Grudic, G.Z. Unsupervised Outlier Detection and Semi-Supervised Learning.
Escalante, H.J. A Comparison of Outlier Detection Algorithms for Machine Learning.
A related topic is anomaly detection, about which you will find a lot of papers.
That really deserves a new (and probably more focused) question :-)
1) Can we find outliers using k-means, is it a good approach?
Cluster-based approaches are optimal to find clusters, and can be used to detect outliers as
by-products. In the clustering processes, outliers can affect the locations of the cluster centers, even aggregating as a micro-cluster. These characteristics make the cluster-based approaches infeasible to complicated databases.
2) Is there any clustering algorithm which does not accept any input from the user?
Maybe you can achieve some valuable knowledge on this topic:
Dirichlet Process Clustering
Dirichlet-based clustering algorithm can adaptively determine the number of clusters according to the distribution of observation data.
3) Can we use support vector machine or any other supervised learning algorithm for outlier detection?
Any Supervised learning algorithm needs enough labeled training data to construct classifiers. However, a balanced training dataset is not always available for real world problem, such as intrusion detection, medical diagnostics. According to the definition of Hawkins Outlier("Identification of Outliers". Chapman and Hall, London, 1980), the number of normal data is much larger than that of outliers. Most supervised learning algorithms can't achieve an efficient classifier on the above unbalanced dataset.
4) What is the pros and cons of each approach?
Over the past several decades, the research on outlier detection varies from the global computation to the local analysis, and the descriptions of outliers vary from the binary interpretations to probabilistic representations. According to hypotheses of outlier detection models, outlier detection algorithms can be divided into four kinds: Statistic-based algorithms, Cluster-based algorithms, Nearest Neighborhood based algorithms, and Classifier-based algorithms. There are several valuable surveys on outlier detection:
Hodge, V. and Austin, J. "A survey of outlier detection methodologies", Journal of Artificial Intelligence Review, 2004.
Chandola, V. and Banerjee, A. and Kumar, V. "Outlier detection: A survey", ACM Computing Surveys, 2007.
k-means is rather sensitive to noise in the data set. It works best when you remove the outliers beforehand.
No. Any cluster analysis algorithm that claims to be parameter-free usually is heavily restricted, and often has hidden parameters - a common parameter is the distance function, for example. Any flexible cluster analysis algorithm will at least accept a custom distance function.
one-class classifiers are a popular machine-learning approach to outlier detection. However, supervised approaches aren't always appropriate for detecting _previously_unseen_ objects. Plus, they can overfit when the data already contains outliers.
Every approach has its pros and cons, that is why they exist. In a real setting, you will have to try most of them to see what works for your data and setting. It's why outlier detection is called knowledge discovery - you have to explore if you want to discover something new ...
You may want to have a look at the ELKI data mining framework. It is supposedly the largest collection of outlier detection data mining algorithms. It's open source software, implemented in Java, and includes some 20+ outlier detection algorithms. See the list of available algorithms.
Note that most of these algorithms are not based on clustering. Many clustering algorithms (in particular k-means) will try to cluster instances "no matter what". Only few clustering algorithms (e.g. DBSCAN) actually consider the case that maybe not all instance belong into clusters! So for some algorithms, outliers will actually prevent a good clustering!