ELKI: LOF score as infinite - data-mining

What is the generally used and accepted way to handle LOF scores as inifinite in ELKI, due to duplicate points? If LOF scores of ELKI to be used, should such scores be considered as maximum-scores, zeros, or inliers?

The LOF score of a point is infinite if at least one neighbor of a point has reachability distance 0 (because they are duplicate points).
If the point itself has a non-zero reachability, the value is thus infinitely higher than the lrd of the neighbors (or in terms of density: the point is infinitely less dense than the neighbors), so it is an outlier.
The proper way of handling this is to increase k (minpts) to be larger than the maximum number of duplicate points. If you have too many duplicate points, this usually indicates that using LOF may not be a good idea for this data set. LOF requires that a nearest-neighbor density estimation makes sense on the data, and if you have this kind of problems, the cause usually is the input data, not the algorithm.

Related

What is the fastest algorithm to find the point from a set of points, which is closest to a line?

I have:
- a set of points of known size (in my case, only 6 points)
- a line characterized by x = s + t * r, where x, s and r are 3D vectors
I need to find the point closest to the given line. The actual distance does not matter to me.
I had a look at several different questions that seem related (including this one) and know how to solve this on paper from my highschool math classes. But I cannot find a solution without calculating every distance, and I am sure there has to be a better/faster way. Performance is absolutely crucial in my application.
One more thing: All numbers are integers (coordinates of points and elements of s and r vectors). Again, for performance reasons I would like to keep the floating-point math to a minimum.
You have to process every point at least once to know their distance. Unless you want to repeat the process many times with different lines, simply computing the distance of every point is unavoidable. So the algorithm has to be O(n).
Since you don't care about the actual distance, we can make some simplification to the point-distance computation. The exact distance is computed by (source):
d^2 = |r⨯(p-s)|^2 / |r|^2
where ⨯ is the cross product and |r|^2 is the squared length of vector r. Since |r|^2 is constant for all points, we can omit it from the distance computation without changing result:
d^2 = |r⨯(p-s)|^2
Compare the approximated square distances and keep the minimum. The advantage of this formula is that you can do everything with integers since you mentioned that all coordinates are integers.
I'm afraid you can't get away with computing less than 6 distances (if you could, at least one point would be left out -- including the nearest one).
See if it makes sense to preprocess: Is the line fixed and the points vary? Consider rotating coordinates to make the line horizontal.
As there are few points, it is doubtful that this is your bottleneck. Measure where the hot spots are, redesign algorithms/data representation, spice up compiler optimization, compile to assembly and bum that. Strictly in that order.
Jon Bentley's "Writing Efficient Programs" (sadly long out of print) and "Programming Pearls" (2nd edition) are full of advise on practical programming.

Calculating quantiles without storing

I wrote c++ code to calculate 119 quantiles (from 10^-7 to 1 - 10^-7) of 100 millions of double precision numbers.
My current implementation stores the numbers in a vector and then it sorts the vector.
Is there any way to calculate the quantiles without storing the numbers?
Thank you
ADDENDUM (sorry for my English):
Here is what I'm doing:
1) generate 20 uniformly distributed random numbers in [0, 1)
2) I feed those numbers into an algorithm that outputs a random number with unknown mean and unknown variance
3) store the number at step 2
repeat 1, 2 and 3 100 millions of times (now I collected 10^8 random numbers with unknown mean and unknown variance).
Now I sort those numbers to calculate 119 quantiles from 10^-7 to 1 - 10^-7 using the formula "R-2, SAS-5":
https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
Since the program is multi-threaded, the memory allocation is too big and I can only use 5 threads instead of 8.
This is a problem from the field of streaming algorithms (where you need to operate on a stream of data without storing each element).
There are well known algorithms for quantile stream algorithms (e.g., here), but if you are willing to use quantile approximations, it's a fairly easy problem. Simply use reservoir sampling to uniformly sample m out of n elements, and calculate the quantiles on the sample (by the method you did: storing the m samples in a vector, and sorting it). The size m influences the approximation's precision (see, e.g., here).
You need to know the set of numbers before you can calculate the quantiles.
This can either be done by storing the numbers, but you can also make/use a multi-pass algorithm, that learns a little part each run.
There are also approximate one-pass algorithms for this problem, if some inaccuracy on the quantiles is acceptable. Here is an example: http://www.cs.umd.edu/~samir/498/manku.pdf
EDIT** Forgot, if your numbers have many duplicates, you just need to store the number and how many times it appears, not each duplicate. Depending on the input data this can be a significant difference.

Handle very large distance matrix in C (or C++ if it could help)

I am implementing this clustering algorithm http://www.sciencemag.org/content/344/6191/1492.full (free access version) in C in my software and I need to build a distance matrix, but in some cases, the size of the dataset (after redundancy removal) is huge (n > 1 500 000 and it is even larger, going up to 4 000 000 on more complex cases). My problem is, even allocating the upper triangular matrix would be ( (1500000*1500000) - 1500000) * 0.5 * sizeof(float) =~ 5.5e12 Bytes. So, memory allocation fails (even on our computing nodes with 256 GB of RAM) and writing to disk is not an option in this case.
Beside cutting down the size (which I will look) of the dataset to cluster, anybody has an idea of a technique I could use to approximate and store this amount of information ?
N.B. Like I said in the title, I am using C and I can also use C++. Also, if anybody has another clustering algorithm (where the number of clusters is determined with the algorithm itself) to use, please suggest it to me.
Thanks in advance for your time,
You probably have to step back and reconsider your algorithm.
First, perhaps you don't need to have distance matrix between all pairs of data points. Perhaps you could group together similar data points into data bins and then create a matrix of distances between bins.
That is, start by computing pairwise distances between points, but keep only relatively small distances and pointers to "the other" point. Kind of a very sparse matrix of shorter distances. This is straightforward to do in parallel.
Then create data bins that contain groups of points with mutually small distances between them. For example, if you threshold "short" distances in such manner that bins would hold on average, say, 50 data points you'd get 1500000/50=30000 bins.
Then go through your data again and compute distances between bins. That would produce 30000^2 distances, which is a matrix of about 4GB. In addition you still have 30000 with 50^2 distances within bins, which is another 300MB. This amount of data is quite manageable.
If replacing the distance between data points with a distance between the corresponding bins is sufficient precision for your application that would work. It all depends on the kind of data you are dealing with and the precision requirements of your application.

How to create a vector containing a (artificially generated) Guassian (normal) distribution?

If I have data (a daily stock chart is a good example but it could be anything) in which I only know the range (high - low) that X units sold within but I don't know the exact price at which any given item sold. Assume for simplicity that the price range contains enough buckets (e.g. forty one-cent increments for a 40 cent range) to make such a distribution practical. How can I go about distributing those items to form a normal bell curve stored in a vector? It doesn't have to be perfect but realistic.
My (very) naive thinking has been to assume that since random numbers should form a normal distribution I can do something like have a binary RNG. If, for example, there are forty buckets then if a '0' comes up 40 times the 0th bucket gets incremented and if a '1' comes up for times in a row then the 39th bucket gets incremented. If '1' comes up 20 times then it is in the middle of the vector. Do this for each item until X units have been accounted for. This may or may not be right and in any case seems way more inefficient than necessary. I am looking for something more sensible.
This isn't homework, just a problem that has been bugging me and my statistics is not up to snuff. Most literature seems to be about analyzing the distribution after it already exists but not much about how to artificially create one.
I want to write this in c++ so pre-packaged solutions in R or matlab or whatnot are not too useful for me.
Thanks. I hope this made sense.
Most literature seems to be about analyzing the distribution after it already exists but not much about how to artificially create one.
There's tons of literature on how to create one. The Box–Muller transform, the Marsaglia polar method (a variant of Box-Muller), and the Ziggurat algorithm are three. (Google those terms). Both Box-Muller methods are easy to implement.
Better yet, just use a random generator that already exists that implements one of these algorithms. Both boost and the new C++11 have such packages.
The algorithm that you describe relies on the Central Limit Theorem that says that a random variable defined as the sum of n random variables that belong to the same distribution tends to approach a normal distribution when n grows to infinity. Uniformly distributed pseudorandom variables that come from a computer PRNG make a special case of this general theorem.
To get a more efficient algorithm you can view probability density function as a some sort of space warp that expands the real axis in the middle and shrinks it to the ends.
Let F: R -> [0:1] be the cumulative function of the normal distribution, invF be its inverse and x be a random variable uniformly distributed on [0:1] then invF(x) will be a normally distributed random variable.
All you need to implement this is be able to compute invF(x). Unfortunately this function cannot be expressed with elementary functions. In fact, it is a solution of a nonlinear differential equation. However you can efficiently solve the equation x = F(y) using the Newton method.
What I have described is a simplified presentation of the Inverse transform method. It is a very general approach. There are specialized algorithms for sampling from the normal distribution that are more efficient. These are mentioned in the answer of David Hammen.

How to select an unlike number in an array in C++?

I'm using C++ to write a ROOT script for some task. At some point I have an array of doubles in which many are quite similar and one or two are different. I want to average all the number except those sore thumbs. How should I approach it? For an example, lets consider:
x = [2.3, 2.4, 2.11, 10.5, 1.9, 2.2, 11.2, 2.1]
I want to somehow average all the numbers except 10.5 and 11.2, the dissimilar ones. This algorithm is going to repeated several thousand times and the array of doubles has 2000 entries, so optimization (while maintaining readability) is desired. Thanks SO!
Check out:
http://tinypic.com/r/111p0ya/3
The "dissimilar" numbers of the y-values of the pulse.
The point of this to determine the ground value for the waveform. I am comparing the most negative value to the ground and hoped to get a better method for grounding than to average the first N points in the sample.
Given that you are using ROOT you might consider looking at the TSpectrum classes which have support for extracting backgrounds from under an unspecified number of peaks...
I have never used them with so much baseline noise, but they ought to be robust.
BTW: what is the source of this data. The peak looks like a particle detector pulse, but the high level of background jitter suggests that you could really improve things by some fairly minor adjustments in the DAQ hardware, which might be better than trying to solve a difficult software problem.
Finally, unless you are restricted to some very primitive hardware (in which case why and how are you running ROOT?), if you only have a couple thousand such spectra you can afford a pretty slow algorithm. Or is that 2000 spectra per event and a high event rate?
If you can, maintain a sorted list; then you can easily chop off the head and the tail of the list each time you work out the average.
This is much like removing outliers based on the median (ie, you're going to need two passes over the data, one to find the median - which is almost as slow as sorting for floating point data, the other to calculate the average), but requires less overhead at the time of working out the average at the cost of maintaining a sorted list. Which one is fastest will depend entirely on your circumstances. It may be, of course, that what you really want is the median anyway!
If you had discrete data (say, bytes=256 possible values), you could use 256 histogram 'bins' with a single pass over your data putting counting the values that go in each bin, then it's really easy to find the median / approximate the mean / remove outliers, etc. This would be my preferred option, if you could afford to lose some of the precision in your data, followed by maintaining a sorted list, if that is appropriate for your data.
A quick way might be to take the median, and then take the averages of number not so far off from the median.
"Not so far off," being dependent of your project.
A good rule of thumb for determining likely outliers is to calculate the Interquartile Range (IQR), and then any values that are 1.5*IQR away from the nearest quartile are outliers.
This is the basic method many statistics systems (like R) use to automatically detect outliers.
Any method that is statistically significant and a good way to approach it (Dark Eru, Daniel White) will be too computationally intense to repeat, and I think I've found a work around that will allow later correction (meaning, leave it un-grounded).
Thanks for the suggestions. I'll look into them if I have time and want to see if their gain is worth the slowdown.
Here's a quick and dirty method that I've used before (works well if there are very few outliers at the beginning, and you don't have very complicated conditions for what constitutes an outlier)
The algorithm is O(N). The only really expensive part is the division.
The real advantage here is that you can have it up and running in a couple minutes.
avgX = Array[0] // initialize array with the first point
N = length(Array)
percentDeviation = 0.3 // percent deviation acceptable for non-outliers
count = 1
foreach x in Array[1..N-1]
if x < avgX + avgX*percentDeviation
and x > avgX - avgX*percentDeviation
count++
sumX =+ x
avgX = sumX / count
endif
endfor
return avgX