ELKI: Running LOF with varying k

ELKI: Running LOF with varying k - compare

Can I run LOF with varying k through ELKI so that it is easy to compare which k is the best?
Normally you choose a k, and then you can see the ROCAUC for example. I want to take out the best k for the data set, so I need to compare multiple runs. Can I do that some way easier than manually changing the value for k and doing runs? I want to for example compare all k=[1-100].
Thanks

The Greedy Ensemble shows how to run outlier detection methods for a whole range of k at once efficiently (by only computing the nearest-neighbors once, it will be a lot faster!) using the ComputeKNNOutlierScores application included with ELKI.
The application EvaluatePrecomputedOutlierScores can be used to bulk-evaluate these results with multiple measures.
This is what we used for the publication
G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent and M. E. Houle
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study
Data Mining and Knowledge Discovery 30(4): 891-927, 2016, DOI: 10.1007/s10618-015-0444-8
On the supplementary material website, you can look up the best results for many standard data sets, as well as download the raw results.
But beware that outlier detection quality results tend to be inconclusive. On one data set, one method performs best, on another data set another method. There is no clear winner, because data sets are very diverse.

Related

Query re. how to set up an SVM, which SVM variation … and how to define a metric

I’d like to learn how best set up an SVM in openCV (or other C++ library) for my particular problem (or if indeed there is a more appropriate algorithm).
My goal is to receive a weighting of how well an input set of labeled points on a 2D plane compares or fits with a set of ‘ideal’ sets of labeled 2D points.
I hope my illustrations make this clear – the first three boxes labeled A through C, indicate different ideal placements of 3 points, in my illustrations the labelling is managed by colour:
The second graphic gives examples of possible inputs:
If I then pass for instance example input set 1 to the algorithm it will compare that input set with each ideal set, illustrated here:
I would suggest that most observers would agree that the example input 1 is most similar to ideal set A, then B, then C.
My problem is to get not only this ordering out of an algorithm, but also ideally a weighting of by how much proportion is the input like A with respect to B and C.
For the example given it might be something like:
A:60%, B:30%, C:10%
Example input 3 might yield something such as:
A:33%, B:32%, C:35% (i.e. different order, and a less 'determined' result)
My end goal is to interpolate between the ideal settings using these weights.
To get the ordering I’m guessing the ‘cost’ involved of fitting the inputs to each set maybe have simply been compared anyway (?) … if so, could this cost be used to find the weighting? or maybe was it non-linear and some kind of transformation needs to happen? (but still obviously, relative comparisons were ok to determine the order).
Am I on track?
Direct question>> is the openCV SVM appropriate? - or more specifically:
A series of separated binary SVM classifiers for each ideal state and then a final ordering somehow ? (i.e. what is the metric?)
A version of an SVM such as multiclass, structured and so on from another library? (...that I still find hard to conceptually grasp as the examples seem so unrelated)
Also another critical component I’m not fully grasping yet is how to define what determines a good fit between any example input set and an ideal set. I was thinking Euclidian distance, and I simply sum the distances? What about outliers? My vector calc needs a brush up, but maybe dot products could nose in there somewhere?
Direct question>> How best to define a metric that describes a fit in this case?
The real case would have 10~20 points per set, and time permitting as many 'ideal' sets of points as possible, lets go with 30 for now. Could I expect to get away with ~2ms per iteration on a reasonable machine? (macbook pro) or does this kind of thing blow up ?
(disclaimer, I have asked this question more generally on Cross Validated, but there isn't much activity there (?))

How to create a vector containing a (artificially generated) Guassian (normal) distribution?

If I have data (a daily stock chart is a good example but it could be anything) in which I only know the range (high - low) that X units sold within but I don't know the exact price at which any given item sold. Assume for simplicity that the price range contains enough buckets (e.g. forty one-cent increments for a 40 cent range) to make such a distribution practical. How can I go about distributing those items to form a normal bell curve stored in a vector? It doesn't have to be perfect but realistic.
My (very) naive thinking has been to assume that since random numbers should form a normal distribution I can do something like have a binary RNG. If, for example, there are forty buckets then if a '0' comes up 40 times the 0th bucket gets incremented and if a '1' comes up for times in a row then the 39th bucket gets incremented. If '1' comes up 20 times then it is in the middle of the vector. Do this for each item until X units have been accounted for. This may or may not be right and in any case seems way more inefficient than necessary. I am looking for something more sensible.
This isn't homework, just a problem that has been bugging me and my statistics is not up to snuff. Most literature seems to be about analyzing the distribution after it already exists but not much about how to artificially create one.
I want to write this in c++ so pre-packaged solutions in R or matlab or whatnot are not too useful for me.
Thanks. I hope this made sense.

Most literature seems to be about analyzing the distribution after it already exists but not much about how to artificially create one.
There's tons of literature on how to create one. The Box–Muller transform, the Marsaglia polar method (a variant of Box-Muller), and the Ziggurat algorithm are three. (Google those terms). Both Box-Muller methods are easy to implement.
Better yet, just use a random generator that already exists that implements one of these algorithms. Both boost and the new C++11 have such packages.

The algorithm that you describe relies on the Central Limit Theorem that says that a random variable defined as the sum of n random variables that belong to the same distribution tends to approach a normal distribution when n grows to infinity. Uniformly distributed pseudorandom variables that come from a computer PRNG make a special case of this general theorem.
To get a more efficient algorithm you can view probability density function as a some sort of space warp that expands the real axis in the middle and shrinks it to the ends.
Let F: R -> [0:1] be the cumulative function of the normal distribution, invF be its inverse and x be a random variable uniformly distributed on [0:1] then invF(x) will be a normally distributed random variable.
All you need to implement this is be able to compute invF(x). Unfortunately this function cannot be expressed with elementary functions. In fact, it is a solution of a nonlinear differential equation. However you can efficiently solve the equation x = F(y) using the Newton method.
What I have described is a simplified presentation of the Inverse transform method. It is a very general approach. There are specialized algorithms for sampling from the normal distribution that are more efficient. These are mentioned in the answer of David Hammen.

Compare similarity algorithms

I want to use string similarity functions to find corrupted data in my database.
I came upon several of them:
Jaro,
Jaro-Winkler,
Levenshtein,
Euclidean and
Q-gram,
I wanted to know what is the difference between them and in what situations they work best?

Expanding on my wiki-walk comment in the errata and noting some of the ground-floor literature on the comparability of algorithms that apply to similar problem spaces, let's explore the applicability of these algorithms before we determine if they're numerically comparable.
From Wikipedia, Jaro-Winkler:
In computer science and statistics, the Jaro–Winkler distance
(Winkler, 1990) is a measure of similarity between two strings. It is
a variant of the Jaro distance metric (Jaro, 1989, 1995) and
mainly[citation needed] used in the area of record linkage (duplicate
detection). The higher the Jaro–Winkler distance for two strings is,
the more similar the strings are. The Jaro–Winkler distance metric is
designed and best suited for short strings such as person names. The
score is normalized such that 0 equates to no similarity and 1 is an
exact match.
Levenshtein distance:
In information theory and computer science, the Levenshtein distance
is a string metric for measuring the amount of difference between two
sequences. The term edit distance is often used to refer specifically
to Levenshtein distance.
The Levenshtein distance between two strings is defined as the minimum
number of edits needed to transform one string into the other, with
the allowable edit operations being insertion, deletion, or
substitution of a single character. It is named after Vladimir
Levenshtein, who considered this distance in 1965.
Euclidean distance:
In mathematics, the Euclidean distance or Euclidean metric is the
"ordinary" distance between two points that one would measure with a
ruler, and is given by the Pythagorean formula. By using this formula
as distance, Euclidean space (or even any inner product space) becomes
a metric space. The associated norm is called the Euclidean norm.
Older literature refers to the metric as Pythagorean metric.
And Q- or n-gram encoding:
In the fields of computational linguistics and probability, an n-gram
is a contiguous sequence of n items from a given sequence of text or
speech. The items in question can be phonemes, syllables, letters,
words or base pairs according to the application. n-grams are
collected from a text or speech corpus.
The two core
advantages of n-gram models (and algorithms that use
them) are relative simplicity and the ability to scale up – by simply
increasing n a model can be used to store more context with a
well-understood space–time tradeoff, enabling small experiments to
scale up very efficiently.
The trouble is these algorithms solve different problems that have different applicability within the space of all possible algorithms to solve the longest common subsequence problem, in your data or in grafting a usable metric thereof. In fact, not all of these are even metrics, as some of them don't satisfy the triangle inequality.
Instead of going out of your way to define a dubious scheme to detect data corruption, do this properly: by using checksums and parity bits for your data. Don't try to solve a much harder problem when a simpler solution will do.

String similarity helps in a lot of different ways. For example
google's did you mean results are calculated using string similarity.
string similarity is used to correct OCR errors.
string similarity is used to correct keyboard entering errors.
string similarity is used to find most matching sequence of two DNAs in bioinformatics.
But as one size does not fit all. Every string similarity algorithm is designed for a specific usage though most of them are similar. For example Levenshtein_distance is about how many char you change to make two strings equal.
kitten → sitten
Here distance is 1 character change. You may give different weights to deletion, addition and substitution. For example OCR errors and keyboard errors give less weight for some changes. OCR ( some chars are very similar to others ), keyboard some chars are very near to each other. Bioinformatic string similarity allows a lot of insertion.
Your second example of "Jaro–Winkler distance metric is designed and best suited for short strings such as person names"
Therefore you should keep in your mind about your problem.
I want to use string similarity functions to find corrupted data in my database.
How your data is corrupted? Is it a user error , similar to keyboard input error? Or is it similar to OCR errors? Or something else entirely?

Identifying local minima in a histogram

I'm interested in finding the local minima in a histogram that roughly resembles
I'd want to find the local minimum at 109.258, and the easiest way to do so would be to identify whether the number of counts at 109.258 is lower than the average number of counts around in some interval around (and including 109.258). It's identifying this interval that's the most difficult part for me.
As for the source of this data, it's a histogram with 100 bins of non-uniform width. Each bin has a value (shown on the x-axis), and a count of the samples falling into that bin (shown on the y-axis). What I'm trying to do is find the "best" place to split the histogram. Each side of the split is propagated down a binary tree, as part of a classification algorithm.
I'm thinking that my best course of action would be to try to fit a curve to this histogram, using something like the Levenberg-Marquardt algorithm and then to compare the local minima to find the "best" one. A proper measure of "best" would include some indication of the significance of that split, which is measured as the difference between the average counts in the interval to the left and the average of the counts in the interval to the right, and then maybe weight each difference with the number of counts included to get a composite measurement of "best," if that makes sense.
Either way, computational complexity of the algorithm isn't a huge issue, 100 bins is about the maximum number I'd expect to encounter. However, this calculation will be performed once for each sample, so keeping it linear with respect to the number of bins would, of course, be ideal.
By the way, I'm doing everything in C++, and make use of the boost libraries and STL, so nothing is off-limits in that regard.
Any thoughts or insights concerning best practices would be greatly appreciated!

If I understand correctly kmore wants to partition two "peaks" based on the largest separation (product of histogram count and bin distance). If this is true:
Find all maxs.
for each max build rectangles like in Fig.
Find rectangle with max white area, which gives you the x range to find desirable bin 109.258

Levenberg–Marquardt is not so good a choice in a rugged optimization terrain -- and yours is pretty rugged. There are lots of local minima there. Levenberg–Marquardt might well find the local minimum at about 100. Or it might find one the two global minima at the extremes of the graph where the function tails off to zero.
You want something that finds the most significant local minimum. For example, some kind of clustering algorithm. Here is a very simple one:
Step 1:
Find the local extrema in your data set. These are the extrema at the extremes of the range plus the internal local minima and maxima. With your histogram you should have an odd number of such extrema, alternating between minima and maxima.
Step 2:
Find the pair with the smallest delta. This will either be a (local max, local min) or a (local min, local max) pair.
Step 3:
Find a pair of elements to remove, one of
The pair found by step 2
The pair comprising the first element of the pair from step 2 and its predecessor
The pair comprising the last element of the pair from step 2 and its successor
When the found pair includes a boundary point you should use option 2 or 3, as appropriate. For an internal pair, you might want to use some heuristics in choosing between the three choices. Or you could just do the simple thing and use the found pair.
Step 4:
Delete the pair of elements from step 3, keeping track of the deleted pair.
Step 5:
Repeat steps 2 to 4 until there are only three elements left in the extrema data set (the extremes of the range plus a local max, hopefully the global max).
The last-removed minima is what you want.
There are lots of other clustering algorithms. The one I presented is rather crude and obviously isn't particularly fast. One that extends nicely to a lot more data, and higher dimensional data is the Expectation Maximization algorithm. Simulated annealing (Metropolis-Hastings) could also be adapted to this problem.

The problem can, of course be transformed into one of peak finding by functional manipulation of the data (inversion or negation are obvious candidates).
Alternatively, if the example is typical, one might begin with peak-finding in the untransformed data and seek regions where the peaks are (relatively) widely separated as candidates for containing a good local minima.
I am forever recommending the method used by the ROOT TSpectrum classes for peak finding.
The underling algorithm is discussed in detail in
M.Morhac et al.: Background elimination methods for multidimensional coincidence gamma-ray spectra. Nuclear Instruments and Methods in Physics Research A 401 (1997) 113-132.
M.Morhac et al.: Efficient one- and two-dimensional Gold deconvolution and its application to gamma-ray spectra decomposition. Nuclear Instruments and Methods in Physics Research A 401 (1997) 385-408.
M.Morhac et al.: Identification of peaks in multidimensional coincidence gamma-ray spectra. Nuclear Instruments and Methods in Research Physics A 443(2000), 108-125.
Copies of these papers are maintained on the ROOT web site and linked in the TSpectrum documentation for those that do not have a subscription to NIM A.

What you want seems to be more complicated than just a local minimum. Also, the local minimum concept depends strongly on your choice of bins.
Have you heard about Otsu's method? It might be more along the lines of what you want.
Here's another Otsu's method link.

How to select an unlike number in an array in C++?

I'm using C++ to write a ROOT script for some task. At some point I have an array of doubles in which many are quite similar and one or two are different. I want to average all the number except those sore thumbs. How should I approach it? For an example, lets consider:
x = [2.3, 2.4, 2.11, 10.5, 1.9, 2.2, 11.2, 2.1]
I want to somehow average all the numbers except 10.5 and 11.2, the dissimilar ones. This algorithm is going to repeated several thousand times and the array of doubles has 2000 entries, so optimization (while maintaining readability) is desired. Thanks SO!
Check out:
http://tinypic.com/r/111p0ya/3
The "dissimilar" numbers of the y-values of the pulse.
The point of this to determine the ground value for the waveform. I am comparing the most negative value to the ground and hoped to get a better method for grounding than to average the first N points in the sample.

Given that you are using ROOT you might consider looking at the TSpectrum classes which have support for extracting backgrounds from under an unspecified number of peaks...
I have never used them with so much baseline noise, but they ought to be robust.
BTW: what is the source of this data. The peak looks like a particle detector pulse, but the high level of background jitter suggests that you could really improve things by some fairly minor adjustments in the DAQ hardware, which might be better than trying to solve a difficult software problem.
Finally, unless you are restricted to some very primitive hardware (in which case why and how are you running ROOT?), if you only have a couple thousand such spectra you can afford a pretty slow algorithm. Or is that 2000 spectra per event and a high event rate?

If you can, maintain a sorted list; then you can easily chop off the head and the tail of the list each time you work out the average.
This is much like removing outliers based on the median (ie, you're going to need two passes over the data, one to find the median - which is almost as slow as sorting for floating point data, the other to calculate the average), but requires less overhead at the time of working out the average at the cost of maintaining a sorted list. Which one is fastest will depend entirely on your circumstances. It may be, of course, that what you really want is the median anyway!
If you had discrete data (say, bytes=256 possible values), you could use 256 histogram 'bins' with a single pass over your data putting counting the values that go in each bin, then it's really easy to find the median / approximate the mean / remove outliers, etc. This would be my preferred option, if you could afford to lose some of the precision in your data, followed by maintaining a sorted list, if that is appropriate for your data.

A quick way might be to take the median, and then take the averages of number not so far off from the median.
"Not so far off," being dependent of your project.

A good rule of thumb for determining likely outliers is to calculate the Interquartile Range (IQR), and then any values that are 1.5*IQR away from the nearest quartile are outliers.
This is the basic method many statistics systems (like R) use to automatically detect outliers.

Any method that is statistically significant and a good way to approach it (Dark Eru, Daniel White) will be too computationally intense to repeat, and I think I've found a work around that will allow later correction (meaning, leave it un-grounded).
Thanks for the suggestions. I'll look into them if I have time and want to see if their gain is worth the slowdown.

Here's a quick and dirty method that I've used before (works well if there are very few outliers at the beginning, and you don't have very complicated conditions for what constitutes an outlier)
The algorithm is O(N). The only really expensive part is the division.
The real advantage here is that you can have it up and running in a couple minutes.
avgX = Array[0] // initialize array with the first point
N = length(Array)
percentDeviation = 0.3 // percent deviation acceptable for non-outliers
count = 1
foreach x in Array[1..N-1]
if x < avgX + avgX*percentDeviation
and x > avgX - avgX*percentDeviation
count++
sumX =+ x
avgX = sumX / count
endif
endfor
return avgX

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js