plagiarism detection using damerau levenshtein algorithm - levenshtein-distance

how will i simulate the damerau leveshtein distance algorithm so as to detect plagiarism in documents? thanks!

Levenshtein distance is primarily used to compare two strings, such as comparing names or finding alternates in a spell checker. Using this algorithm for a whole document to detect plagiarism is not typical.
There is some work in the area though. Everything points to this article, which requires subscription:
Plagiarism Detection Using the Levenshtein Distance and Smith-Waterman Algorithm
http://www.computer.org/portal/web/csdl/doi/10.1109/ICICIC.2008.422
Plagiarism in texts is issues of increasing concern to the academic community. Now most common text plagiarism occurs by making a variety of minor alterations that include the insertion, deletion, or substitution of words. Such simple changes, however, require excessive string comparisons. In this paper, we present a hybrid plagiarism detection method. We investigate the use of a diagonal line, which is derived from Levenshtein distance, and simplified SmithWaterman algorithm that is a classical tool in the identification and quantification of local similarities in biological sequences, with a view to the application in the plagiarism detection. Our approach avoids globally involved string comparisons and considers psychological factors, which can yield significant speed-up by experiment results. Based on the results, we indicate the practicality of such improvement using Levenshtein distance and Smith-Waterman algorithm and to illustrate the efficiency gains. In the future, it would be interesting to explore appropriate heuristics in the area of text comparison

Related

How to write tests for mathematical optimization procedures?

I'm working on project where I need to minimize functions by several variables like func(input_parameters, variable_parameters) -> min(variable_parameters).
I use optimizing functions from SciPy, so minimization process is a grey box: I can see the code on GitHub and read about used algorithms, but I'd like to think that it's okay and aim to testing of my own project.
Though, particular libraries shouldn't matter in this question.
At the moment I use few approaches:
Create simple examples and find global/local minima by hand and create test that performs optimization and compares its solution with the right one
If method needs gradients, compare analytically calculated gradients with their numerical approximation in tests
For iterative algorithms built upon ones provided by SciPy check that sequence of function values is monotonically nonincreasing in tests
Is there a book or an article about testing of mathematical optimization procedures?
P. S. I'm not talking about Test functions for optimization
, I'm asking about approaches used to test optimization procedure to find bugs faster.
I find the hypothesis library really useful for testing optimisation algorithms in development.
You can set it up to generate random test cases (functions, linear programs, etc) according to some specification. The idea is that you pass these to your algorithm and test for known invariants. For example you could have it throw random problems or subproblems at your algorithm and check that (for example):
Gradient descent methods produce a series of nonincreasing objectives
Local search finds a solution with no better neighbours
Heuristics maintain feasibility
There's a useful PyCon talk here explaining the idea of property based testing. It focuses more on testing APIs than algorithms, but I think the ideas transfer. I've found this approach does a pretty good job finding cases of unexpected behaviour as I'm writing a new algorithm.

Real reason for speed up in fasttext

What is the real reason for speed-up, even though the pipeline mentioned in the fasttext paper uses techniques - negative sampling and heirerchichal softmax; in earlier word2vec papers. I am not able to clearly understand the actual difference, which is making this speed up happen ?
Is there that much of a speed-up?
I don't think there are any algorithmic breakthroughs which make the word2vec-equivalent word-vector training in FastText significantly faster. (And if you're using the character-ngrams option in FastText, to allow post-training synthesis of vectors for unseen words based on substrings shared with training-words, I'd expect the training to be slower, because every word requires training of its substring vectors as well.)
Any speedups in FastText are likely just because the code is well-tuned, with the benefit of more implementation experience.
To be efficient on datasets with a very large number of categories, Fast text uses a hierarchical classifier instead of a flat structure, in which the different categories are organized in a tree (think binary tree instead of list). This reduces the time complexities of training and testing text classifiers from linear to logarithmic with respect to the number of classes. FastText also exploits the fact that classes are imbalanced (some classes appearing more often than other) by using the Huffman algorithm to build the tree used to represent categories. The depth in the tree of very frequent categories is, therefore, smaller than for infrequent ones, leading to further computational efficiency.
Reference link: https://research.fb.com/blog/2016/08/fasttext/

Outlier detection in small sets

Is there a good algorithm for detecting outliers in small sets of decimal numbers? The best idea I have come up with so far is a kind of recursive standard deviation based approach, but it seems a bit computationally expensive.
I'm using c++, so any existing functionality in say Boost or other maths helper libraries is welcome in your answers.
Thanks.
You can do it in O(n) time with an online variance algorithm (http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm) and then a second pass to mark outliers.

Outlier detection in data mining [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have a few sets of questions regarding outlier detection:
Can we find outliers using k-means and is this a good approach?
Is there any clustering algorithm which does not accept any input from the user?
Can we use support vector machine or any other supervised learning algorithm for outlier detection?
What are the pros and cons of each approach?
I will limit myself to what I think is essential to give some clues about all of your questions, because this is the topic of a lot of textbooks and they might probably be better addressed in separate questions.
I wouldn't use k-means for spotting outliers in a multivariate dataset, for the simple reason that the k-means algorithm is not built for that purpose: You will always end up with a solution that minimizes the total within-cluster sum of squares (and hence maximizes the between-cluster SS because the total variance is fixed), and the outlier(s) will not necessarily define their own cluster. Consider the following example in R:
set.seed(123)
sim.xy <- function(n, mean, sd) cbind(rnorm(n, mean[1], sd[1]),
rnorm(n, mean[2],sd[2]))
# generate three clouds of points, well separated in the 2D plane
xy <- rbind(sim.xy(100, c(0,0), c(.2,.2)),
sim.xy(100, c(2.5,0), c(.4,.2)),
sim.xy(100, c(1.25,.5), c(.3,.2)))
xy[1,] <- c(0,2) # convert 1st obs. to an outlying value
km3 <- kmeans(xy, 3) # ask for three clusters
km4 <- kmeans(xy, 4) # ask for four clusters
As can be seen in the next figure, the outlying value is never recovered as such: It will always belong to one of the other clusters.
One possibility, however, would be to use a two-stage approach where one's removing extremal points (here defined as vector far away from their cluster centroids) in an iterative manner, as described in the following paper: Improving K-Means by Outlier Removal (Hautamäki, et al.).
This bears some resemblance with what is done in genetic studies to detect and remove individuals which exhibit genotyping error, or individuals that are siblings/twins (or when we want to identify population substructure), while we only want to keep unrelated individuals; in this case, we use multidimensional scaling (which is equivalent to PCA, up to a constant for the first two axes) and remove observations above or below 6 SD on any one of say the top 10 or 20 axes (see for example, Population Structure and Eigenanalysis, Patterson et al., PLoS Genetics 2006 2(12)).
A common alternative is to use ordered robust mahalanobis distances that can be plotted (in a QQ plot) against the expected quantiles of a Chi-squared distribution, as discussed in the following paper:
R.G. Garrett (1989). The chi-square plot: a tools for multivariate outlier recognition. Journal of Geochemical Exploration 32(1/3): 319-341.
(It is available in the mvoutlier R package.)
It depends on what you call user input. I interpret your question as whether some algorithm can process automatically a distance matrix or raw data and stop on an optimal number of clusters. If this is the case, and for any distance-based partitioning algorithm, then you can use any of the available validity indices for cluster analysis; a good overview is given in
Handl, J., Knowles, J., and Kell, D.B.
(2005). Computational cluster validation in post-genomic data analysis.
Bioinformatics 21(15): 3201-3212.
that I discussed on Cross Validated. You can for instance run several instances of the algorithm on different random samples (using bootstrap) of the data, for a range of cluster numbers (say, k=1 to 20) and select k according to the optimized criteria taht was considered (average silhouette width, cophenetic correlation, etc.); it can be fully automated, no need for user input.
There exist other forms of clustering, based on density (clusters are seen as regions where objects are unusually common) or distribution (clusters are sets of objects that follow a given probability distribution). Model-based clustering, as it is implemented in Mclust, for example, allows to identify clusters in a multivariate dataset by spanning a range of shape for the variance-covariance matrix for a varying number of clusters and to choose the best model according to the BIC criterion.
This is a hot topic in classification, and some studies focused on SVM to detect outliers especially when they are misclassified. A simple Google query will return a lot of hits, e.g. Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction by Thongkam et al. (Lecture Notes in Computer Science 2008 4977/2008 99-109; this article includes comparison to ensemble methods). The very basic idea is to use a one-class SVM to capture the main structure of the data by fitting a multivariate (e.g., gaussian) distribution to it; objects that on or just outside the boundary might be regarded as potential outliers. (In a certain sense, density-based clustering would perform equally well as defining what an outlier really is is more straightforward given an expected distribution.)
Other approaches for unsupervised, semi-supervised, or supervised learning are readily found on Google, e.g.
Hodge, V.J. and Austin, J. A Survey of Outlier Detection Methodologies.
Vinueza, A. and Grudic, G.Z. Unsupervised Outlier Detection and Semi-Supervised Learning.
Escalante, H.J. A Comparison of Outlier Detection Algorithms for Machine Learning.
A related topic is anomaly detection, about which you will find a lot of papers.
That really deserves a new (and probably more focused) question :-)
1) Can we find outliers using k-means, is it a good approach?
Cluster-based approaches are optimal to find clusters, and can be used to detect outliers as
by-products. In the clustering processes, outliers can affect the locations of the cluster centers, even aggregating as a micro-cluster. These characteristics make the cluster-based approaches infeasible to complicated databases.
2) Is there any clustering algorithm which does not accept any input from the user?
Maybe you can achieve some valuable knowledge on this topic:
Dirichlet Process Clustering
Dirichlet-based clustering algorithm can adaptively determine the number of clusters according to the distribution of observation data.
3) Can we use support vector machine or any other supervised learning algorithm for outlier detection?
Any Supervised learning algorithm needs enough labeled training data to construct classifiers. However, a balanced training dataset is not always available for real world problem, such as intrusion detection, medical diagnostics. According to the definition of Hawkins Outlier("Identification of Outliers". Chapman and Hall, London, 1980), the number of normal data is much larger than that of outliers. Most supervised learning algorithms can't achieve an efficient classifier on the above unbalanced dataset.
4) What is the pros and cons of each approach?
Over the past several decades, the research on outlier detection varies from the global computation to the local analysis, and the descriptions of outliers vary from the binary interpretations to probabilistic representations. According to hypotheses of outlier detection models, outlier detection algorithms can be divided into four kinds: Statistic-based algorithms, Cluster-based algorithms, Nearest Neighborhood based algorithms, and Classifier-based algorithms. There are several valuable surveys on outlier detection:
Hodge, V. and Austin, J. "A survey of outlier detection methodologies", Journal of Artificial Intelligence Review, 2004.
Chandola, V. and Banerjee, A. and Kumar, V. "Outlier detection: A survey", ACM Computing Surveys, 2007.
k-means is rather sensitive to noise in the data set. It works best when you remove the outliers beforehand.
No. Any cluster analysis algorithm that claims to be parameter-free usually is heavily restricted, and often has hidden parameters - a common parameter is the distance function, for example. Any flexible cluster analysis algorithm will at least accept a custom distance function.
one-class classifiers are a popular machine-learning approach to outlier detection. However, supervised approaches aren't always appropriate for detecting _previously_unseen_ objects. Plus, they can overfit when the data already contains outliers.
Every approach has its pros and cons, that is why they exist. In a real setting, you will have to try most of them to see what works for your data and setting. It's why outlier detection is called knowledge discovery - you have to explore if you want to discover something new ...
You may want to have a look at the ELKI data mining framework. It is supposedly the largest collection of outlier detection data mining algorithms. It's open source software, implemented in Java, and includes some 20+ outlier detection algorithms. See the list of available algorithms.
Note that most of these algorithms are not based on clustering. Many clustering algorithms (in particular k-means) will try to cluster instances "no matter what". Only few clustering algorithms (e.g. DBSCAN) actually consider the case that maybe not all instance belong into clusters! So for some algorithms, outliers will actually prevent a good clustering!

Least Squares Regression in C/C++

How would one go about implementing least squares regression for factor analysis in C/C++?
the gold standard for this is LAPACK. you want, in particular, xGELS.
When I've had to deal with large datasets and large parameter sets for non-linear parameter fitting I used a combination of RANSAC and Levenberg-Marquardt. I'm talking thousands of parameters with tens of thousands of data-points.
RANSAC is a robust algorithm for minimizing noise due to outliers by using a reduced data set. Its not strictly Least Squares, but can be applied to many fitting methods.
Levenberg-Marquardt is an efficient way to solve non-linear least-squares numerically.
The convergence rate in most cases is between that of steepest-descent and Newton's method, without requiring the calculation of second derivatives. I've found it to be faster than Conjugate gradient in the cases I've examined.
The way I did this was to set up the RANSAC an outer loop around the LM method. This is very robust but slow. If you don't need the additional robustness you can just use LM.
Get ROOT and use TGraph::Fit() (or TGraphErrors::Fit())?
Big, heavy piece of software to install just of for the fitter, though. Works for me because I already have it installed.
Or use GSL.
If you want to implement an optimization algorithm by yourself Levenberg-Marquard seems to be quite difficult to implement. If really fast convergence is not needed, take a look at the Nelder-Mead simplex optimization algorithm. It can be implemented from scratch in at few hours.
http://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
Have a look at
http://www.alglib.net/optimization/
They have C++ implementations for L-BFGS and Levenberg-Marquardt.
You only need to work out the first derivative of your objective function to use these two algorithms.
I've used TNT/JAMA for linear least-squares estimation. It's not very sophisticated but is fairly quick + easy.
Lets talk first about factor analysis since most of the discussion above is about regression. Most of my experience is with software like SAS, Minitab, or SPSS, that solves the factor analysis equations, so I have limited experience in solving these directly. That said, that the most common implementations do not use linear regression to solve the equations. According to this, the most common methods used are principal component analysis and principal factor analysis. In a text on Applied Multivariate Analysis (Dallas Johnson), no less that seven methods are documented each with their own pros and cons. I would strongly recommend finding an implementation that gives you factor scores rather than programming a solution from scratch.
The reason why there's different methods is that you can choose exactly what you're trying to minimize. There a pretty comprehensive discussion of the breadth of methods here.