ELKI GUI output and parameter k (LOF) - data-mining

I am suspicious about: Materializing k nearest neighbors (k=3) in the following output.
Verbose output from ELKI GUI, running LOFalgorithm, lof.k=2.
LOF #1/3: Materializing LOF neighborhoods.
de.lmu.ifi.dbs.elki.index.preprocessed.knn.MaterializeKNNPreprocessor.**k: 3**
Materializing k nearest neighbors **(k=3)**: 198 [100%]
de.lmu.ifi.dbs.elki.index.preprocessed.knn.MaterializeKNNPreprocessor.precomputation-time: 3 ms
LOF #2/3: Computing LRDs.
LOF #3/3: Computing LOFs.
LOF: complete.
Does that mean that ELKI looks at points' 3nn, when I set lof.k=2?

This is correct behaviour.
To compute LOF fast, you need to precompute the k nearest neighbors.
Since the k nearest neighbors in ELKI - as common in a databases - usually includes the query point, you need the k+1 nearest neighbors for LOF, to get k other points.

Related

How to code for a probability map in opencv?

Hope I am posting in correct forum.
Just want to sound out my ideas and approach to solve problem. Would welcome any pointers, help (if given code would definitely be ideal :) )
Problem:
I want to code for the probability distribution (in a 400 x 400 map) in order to find the spatial location (x,y) of another line (let us call it fL) based upon the probability, in the probability map.
I have gotten a nearly horizontal line cue (let call it lC) from prior processing to calculate the probability to determine fL. fL is estimated to lie at D distance away from this horizontal line cue. My task is to calculate this probability map
Approach:
1) I would take the probability map distribution as Gaussian and to be
P(fL | point ) = exp( ( x-D )^2 /sigma^2 )
which is giving probability of the line fL given the point in line cue lC is at D distance away, pending on sigma (which defines how fast the probability decrease)
2) I would use a LineIterator to find every single pixel that lie on the line cue lC (given that I know the start and end point of line). Let say I have gotten n pixel in this line
3) For every single pixel in the 400 x 400 image, I would calculate the probability using 1) as described above for all n points that I have gotten for the line. I would sum up each line point contribution
4) After finishing all the pixel calculation in the 400x400 image, I would then normalize the probability based the largest pixel probability value. This part I am not unsure that I should normalize by the sum of all pixel probability or by using the step above.
5) After this I would multiply this probability map with other probability map. So I would get
P(fL | Cuefromthisline, Cuefromsomeother....) = P( fL | Cuefromthisline)P( fL | Cuefromsomeother).....
And I would set pixel with near 0 probability to be 0.001
6) That outlines my approach
Question
1) Is this workable? Or if there is any better method to doing this? ie getting the probability map
2) How do I normalize the map. by normalizing with the sum of all pixel probability or by normalizing with the max value
Thanks in advance for reading out this long post

Measure variation of data points from a line; To Catch a Dip

How can I measure this area in C++?
(update: I posted the solution and code as an answer rather than edit the question again)
The ideal line (dashed red) is the plot from starting point with the average rise added with each angle of measurement; this I obtain via average. I measured the test data in black. How can I quantify the area of the dip in blue? X-axis is unitized, so slopes and math are simplified.
I could determine a cutoff for the size of areas like this and then flag this part for retesting or failure. Rarely, there is another dip that appears closer to the right, but setting a cutoff value for standard deviation usually fails those parts.
Update
Diego's answer helped me visualize this. Now that I can see what I'm trying to do, I'll work on the algorithm to implement the "homemade dip detector". :)
Why?
I created a test bench to test throttle position sensors I'm selling. I'm trying to programatically quantify how straight the plot is by analyzing the data collected. This one particular model is vexing me.
Sample plot of a part I prefer not to sell:
The X axis are evenly spaced angles of throttle opening. The stepper motor turns the input shaft, stopping every 0.75° to measure the output on a 10 bit ADC, which gets translated to the Y axis. The plot is the translation of data[idx] to idx,value mapped to (x,y) bitmap coordinates. Then I draw lines between the points within the bitmap using Bresenham's algorithm.
My other TPS products produce amazingly linear output.
The lower (left) portion of the plot is crucial to normal usage of any motor vehicle; it's when you're driving around town, entering parking lots, etc. This particular part has a tendency to develop a dip around 15° opening and I wish to use the program to quantify this "dip" in the curve and rely less upon the tester's intuition. In the above example, the plot dips but doesn't return to what an ideal line might be.
Even though this is an embedded application, printing the report takes 10 seconds, thus I do not consider stepping through an array of 120 points of data multiple times a waste of cycles. Also, since I'm using a uC32 PIC32 microcontroller, there's plenty of memory, so I have the luxury of being able to ponder this problem within the controller.
What I'm trying already
Array of rise between test points: I dismiss the X-axis entirely, considering it unitized, and then make an array of change from one reading to the next. This array is what contributes to the report's "Min rise between points: 0 Max: 14". I call this array deltas.
I've tried using standard deviation on deltas, however, during testing I have found that a low Std Dev is not a reliable measure for this part. If the dip quickly returns to the original line implied by early data points, the Std Dev can be deceptively low (observed to be as low as 2.3) but the part is still something I wouldn't want to use. I tried setting a cutoff at 2.6, but it failed too many parts with great plots. The other, more linear part linked to above can reliably count on Std Dev for quality.
Kurtosis seems not to apply for this situation at all. I learned of Kurtosis today and found a Statistics Library which includes Kurtosis and Skewness. During continued testing, I found that of these two measures, there was not a trend of positive, negative, or amplitude which would correspond to either passing or failing. That same gentleman has shared a linear regression library, but I believe Lin Reg is unrelated to my situation, as I am comfortable with the assumption of the AVG of deltas being my ideal line. Linear Regression and R^2 are more for finding a line from less ideal data or much larger sets.
Comparing each delta to AVG and Std Dev I set up a monitor to check each delta against final average of the deltas's data. Here, too, I couldn't find a reliable metric. Too many good parts would not pass a test restricting any delta to within 2x Std Dev away from the Average. Ultimately, the only variation from AVG I could settle on is to be within AVG+Std Dev difference from the AVG itself. Anything more restrictive would fail otherwise good parts. And the elusive dip around 15° opening can sneak through this test.
Homemade dip detector When feeding deltas to the serial monitor of the computer, I observed consecutive negative deltas during the dip, so I programmed in a dip detector, but it feels very crude to me. If there are 5 or more negative deltas in a row, I sum them. I have seen that if I take that sum the dip's differences from AVG then divide by the number of negative deltas, a value over 2.9 or 3 could mean a fail. I have observed dips lasting from 6 to 15 deltas. Readily observable dips would have their differences from AVG sum up to -35.
Trending accumulated variation from the AVG The above made me think watching the summation of deltas as it wanders away from AVG could be the answer. Meaning, I step through the array and sum the differences of each delta from AVG. I thought I was on to something until a good part blew this theory. I was seeing a trend of the fewer times the running sum varied from AVG by less than 2x AVG, the more straight the line appeared. Many ideal parts would only show 8 or less delta points where the sumOfDiffs would stray from the AVG very far.
float sumOfDiffs=0.0;
for( int idx=0; idx<stop; idx++ ){
float spread = deltas[idx] - line->AdcAvgRise;
sumOfDiffs = sumOfDiffs + spread;
...
testVal = 2*line->AdcAvgRise;
if( sumOfDiffs > testVal || sumOfDiffs < -testVal ){
flag = 'S';
}
...
}
And then a part with a fantastic linear plot came through with 58 data points where sumOfDiffs was more than twice the AVG! I find this amazing, as at the end of the ~120 data points, sumOfDiffs value is -0.000057.
During testing, the final sumOfDiffs result would often register as 0.000000 and only on exceptionally bad parts would it be greater than .000100. I found this quite surprising, actually: how a "bad part" can have accumulated great accuracy.
Sample output from monitoring sumOfDiffs This below output shows a dip happening. The test watches as the running sumOfDiffs is more than 2x the AVG away from the AVG for the whole test. This dip lasts from deltas idx of 23 through 49; starts at 17.25° and lasts for 19.5°.
Avg rise: 6.75 Std dev: 2.577
idx: delta diff from avg sumOfDiffs Flag
23: 5 -1.75 -14.05 S
24: 6 -0.75 -14.80 S
25: 7 0.25 -14.55 S
26: 5 -1.75 -16.30 S
27: 3 -3.75 -20.06 S
28: 3 -3.75 -23.81 S
29: 7 0.25 -23.56 S
30: 4 -2.75 -26.31 S
31: 2 -4.75 -31.06 S
32: 8 1.25 -29.82 S
33: 6 -0.75 -30.57 S
34: 9 2.25 -28.32 S
35: 8 1.25 -27.07 S
36: 5 -1.75 -28.82 S
37: 15 8.25 -20.58 S
38: 7 0.25 -20.33 S
39: 5 -1.75 -22.08 S
40: 9 2.25 -19.83 S
41: 10 3.25 -16.58 S
42: 9 2.25 -14.34 S
43: 3 -3.75 -18.09 S
44: 6 -0.75 -18.84 S
45: 11 4.25 -14.59 S
47: 3 -3.75 -16.10 S
48: 8 1.25 -14.85 S
49: 8 1.25 -13.60 S
Final Sum of diffs: 0.000030
RunningStats analysis:
NumDataValues= 125
Mean= 6.752
StandardDeviation= 2.577
Skewness= 0.251
Kurtosis= -0.277
Sobering note about quality: what started me on this journey was learning how major automotive OEM suppliers consider a 4 point test to be the standard measure for these parts. My first test bench used an Arduino with 8k of RAM, didn't have a TFT display nor a printer, and a mechanical resolution of only 3°! Back then I simply tested deltas being within arbitrary total bounds and choosing a limit of how big any single delta could be. My 120+ point test feels high class compared to that 30 point test from before, but that test had no idea about these dips.
Premises
the mean of a set of data has the mathematical property that the sum of the deviations from the mean is 0.
this explains why both bad and good datasets alwais give almost 0.
basically the result when differs from zero is essentially an accumulations of rounding errors in the diffs and that's why unfortunately cannot hold useful informations
the thing that most clearly define what you're looking for is your image: you're looking for an AREA and this is why you're not finding the solution in this ways:
looking to a metric in the single points is too local to extract that information
looking to global accumulations or parameters (global standard deviation) is too global and you lose the data among too much information and source of variations
kurtosis (you've already told I know but is for completeness) is out of its field of applications since this is not a probability distribution
in the end the more suitable approach of your already tryied ones is the "Homemade dip detector" because thinks in a way that is local but not too much.
Last but not least:
Any Algorithm you're going to choose has its tacit points on which it stands.
So maybe one is looking for a super clever algorithm that with no parametrization and tuning automatically adapts to the problem and self define thereshods and other.
On the other side there is an algorithm that will stand on the knowledge by the writer of the tipical data behavior (good and bad) and that is itself stupid in the way that if there is another different and unespected behavior the results are unpredictable
Ok, the right way is one of this two or is in-between them depending on the application. So if it works also the "Homemade dip detectors" can be a solution. There is not reason to define it crude but it could be that is not sufficient based on applicaton needs and that's an other thing.
How to find the area
Once you have the data the first thing is to clearly define the "theoretical straight line". I give some options:
use RANSAC algorithm (formally the best option IMHO)
this give you the best fit to the aligned points disregarding the not aligned ones
it is quite difficult and maybe oversized for this work (IMHO)
consider the line defined by the first and last point
you told that the dip is almost always in the same position that is not near boundaries so first and last points can be thought as affordable
very easy to implement
this is an example of using the knowledge about expected behaviors as I told before so you need to think if and how much confidence you give to this assumption
consider a linear fit to the first 10 points and last 10 points
is only a more affordable version of previous since using more points you can be less worried that maybe just the first point or the last were affected by any measure problem and so all fails because of this
also quite easy to implement
if I were you I will use this or something inspired to this
calculate the Y value given by the straight line for each X
calculate the area between the two curves (or the areas under the function Y_dev = Y_data - Y_straight that is mathematically the same) with this procedure:
PositiveMax = 0; NegativeMax = 0;
start from first point (value can be positive or negative) and put in a temporary area accumulator tmp_Area
for each next point
if the sign is the same then accumulate the value
if it is different
stop accumulating
check if the accumulated value is the greater than PositiveMax or below NegativeMax and if it is than store as new PositiveMax or NegativeMax
in any case reset the accumulator with tmp_Area = Y_dev; to the current value starting this way a new accumulation
in the end you will have the values of the maximum overvalued contiguous area and maximum undervalued contiguous area that I think are the scores you're looking for.
if you want you can only manage the NegativeMax based on observed and expected data behaviors
you may find useful to put a thereshold so that if a value Y_dev is lower than the thereshold you do not accumulate it.
this in order to not obtain large accumulations from many points close to the straight line that can be similar to the accumulations of few points far from the line
the need of this and and the proper thereshold needs to be evaluated on some sample data
you need to find an appropriate thereshold for this contiguous area and you can have it only from observation of sample data.
again: it can be you observing and deciding the thereshold or you can build a repository of good and bad samples and write a program that automatically learn which thereshold to use. But his is not the algorithm, this is how to find its operative parameters and there is nothing wrong to do by human brain.. ..it only depends if we're looking for a method to separate bad and good things or if we're looking for and autoadaptive algorithm that does this.. ..you decide the target.
It turns out the result of my gut feeling and Diego's method is an average of the integral. I still don't like that name, so I have described the algorithm and have asked on Math.SE what to call this, which got migrated to "Cross Validated", Stats.SE .
I Updated graphs after a massive edit of my Math.SE question. It turns out I'm taking the average of a closed integral of the derivative of the data. :P First, we gather the data:
Next is the "derivative": step through the original data array to form the deltas array which is the rise of ADC values from one 0.75° step to the next. "Rise" or "slope" is what the derivative is: dy/dx.
With the "slope" or average leveled out, I can find multiple negative deltas in a row, sum them, then divide by the count at the end of the dip. The sum is an integral of the area between average and the deltas and when the dip goes back positive, I can divide the sum by the count of the dips.
During testing, I came up with a cutoff value for this average of the integral at 2.6. That was a great measure of my "gut instinct" looking at the plot thinking a part was good or bad.
In case someone else finds themselves trying to quantify this, here's the code I implemented. Note that it is only looking for negative dips. Also, dipCountLimit is defined elsewhere as 5. In addition to the dip detector/accumulator (ie Numerical Integrator) I also have a spike detector that arbitrarily flags the test as bad if any data points stray from the average by the amount of average + standard deviation. AVG+STD DEV as a spike limit was chosen arbitrarily based on the observed plots of the parts it would fail.
int dipdx=0;
// inDipFlag also counts the length of this dip
int inDipFlag=0;
float dips[140] = { 0.0 };
for( int idx=0; idx<stop; idx++ ){
const float diffFromAvg = deltas[idx] - line->AdcAvgRise;
// state machine to monitor dips
const int _stop = stop-1;
if( diffFromAvg < 0 && idx < _stop ) {
// check NEXT data point for negative diff & set dipFlag to put state in dip
const float nextDiff = deltas[idx+1] - line->AdcAvgRise;
if( nextDiff < 0 && inDipFlag == 0 )
inDipFlag = 1;
// already IN a dip, and next diff is negative
if( nextDiff < 0 && inDipFlag > 0 ) {
inDipFlag++;
}
// accumulate this dip
dips[dipdx]+= diffFromAvg;
// next data point ends this dip and we advance dipdx to next dip
if( inDipFlag > 0 && nextDiff > 0 ) {
if( inDipFlag < dipCountLimit ){
// reset the accumulator, do not advance dipdx to next entry
dips[dipdx]=0.0;
} else {
// change this entry's value from dip sum to its ratio
dips[dipdx] = -dips[dipdx]/inDipFlag;
// advance dipdx to next entry
dipdx++;
}
// Next diff isn't negative, so the dip is done
inDipFlag = 0;
}
}
}

My neural net learns sin x but not cos x

I have build my own neural net and I have a weird problem with it.
The net is quite a simple feed-forward 1-N-1 net with back propagation learning. Sigmoid is used as activation function.
My training set is generated with random values between [-PI, PI] and their [0,1]-scaled sine values (This is because the "Sigmoid-net" produces only values between [0,1] and unscaled sine -function produces values between [-1,1]).
With that training-set, and the net set to 1-10-1 with learning rate of 0.5, everything works great and the net learns sin-function as it should. BUT.. if I do everything exately the same way for COSINE -function, the net won't learn it. Not with any setup of hidden layer size or learning rate.
Any ideas? Am I missing something?
EDIT: My problem seems to be similar than can be seen with this applet. It won't seem to learn sine-function unless something "easier" is taught for the weights first (like 1400 cycles of quadratic function). All the other settings in the applet can be left as they initially are. So in the case of sine or cosine it seems that the weights need some boosting to atleast partially right direction before a solution can be found. Why is this?
I'm struggling to see how this could work.
You have, as far as I can see, 1 input, N nodes in 1 layer, then 1 output. So there is no difference between any of the nodes in the hidden layer of the net. Suppose you have an input x, and a set of weights wi. Then the output node y will have the value:
y = Σi w_i x
= x . Σi w_i
So this is always linear.
In order for the nodes to be able to learn differently, they must be wired differently and/or have access to different inputs. So you could supply inputs of the value, the square root of the value (giving some effect of scale), etc and wire different hidden layer nodes to different inputs, and I suspect you'll need at least one more hidden layer anyway.
The neural net is not magic. It produces a set of specific weights for a weighted sum. Since you can derive a set weights to approximate a sine or cosine function, that must inform your idea of what inputs the neural net will need in order to have some chance of succeeding.
An explicit example: the Taylor series of the exponential function is:
exp(x) = 1 + x/1! + x^2/2! + x^3/3! + x^4/4! ...
So if you supplied 6 input notes with 1, x1, x2 etc, then a neural net that just received each input to one corresponding node, and multiplied it by its weight then fed all those outputs to the output node would be capable of the 6-term taylor expansion of the exponential:
in hid out
1 ---- h0 -\
x -- h1 --\
x^2 -- h2 ---\
x^3 -- h3 ----- y
x^4 -- h4 ---/
x^5 -- h5 --/
Not much of a neural net, but you get the point.
Further down the wikipedia page on Taylor series, there are expansions for sin and cos, which are given in terms of odd powers of x and even powers of x respectively (think about it, sin is odd, cos is even, and yes it is that straightforward), so if you supply all the powers of x I would guess that the sin and cos versions will look pretty similar with alternating zero weights. (sin: 0, 1, 0, -1/6..., cos: 1, 0, -1/2...)
I think you can always compute sine and then compute cosine externally. I think your concern here is why the neural net is not learning the cosine function when it can learn the sine function. Assuming that this artifact if not because of your code; I would suggest the following:
It definitely looks like an error in the learning algorithm. Could be because of your starting point. Try starting with weights that gives the correct result for the first input and then march forward.
Check if there is heavy bias in your learning - more +ve than -ve
Since cosine can be computed by sine 90 minus angle, you could find the weights and then recompute the weights in 1 step for cosine.

generating a pseduo-random positive definite matrix

I wanted to test a simple Cholesky code I wrote in C++. So I am generating a random lower-triangular L and multiplying by its transpose to generate A.
A = L * Lt;
But my code fails to factor A. So I tried this in Matlab:
N=200; L=tril(rand(N, N)); A=L*L'; [lc,p]=chol(A,'lower'); p
This outputs non-zero p which means Matlab also fails to factor A. I am guessing the randomness generates rank-deficient matrices. Am I right?
Update:
I forgot to mention that the following Matlab code seems to work as pointed out by Malife below:
N=200; L=rand(N, N); A=L*L'; [lc,p]=chol(A,'lower'); p
The difference is L is lower-triangular in the first code and not the second one. Why should that matter?
I also tried the following with scipy after reading A simple algorithm for generating positive-semidefinite matrices:
from scipy import random, linalg
A = random.rand(100, 100)
B = A*A.transpose()
linalg.cholesky(B)
But it errors out with:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/scipy/linalg/decomp_cholesky.py", line 66, in cholesky
c, lower = _cholesky(a, lower=lower, overwrite_a=overwrite_a, clean=True)
File "/usr/lib/python2.7/dist-packages/scipy/linalg/decomp_cholesky.py", line 24, in _cholesky
raise LinAlgError("%d-th leading minor not positive definite" % info)
numpy.linalg.linalg.LinAlgError: 2-th leading minor not positive definite
I don't understand why that's happening with scipy. Any ideas?
Thanks,
Nilesh.
The problem is not with the cholesky factorization. The problem is with the random matrix L.
rand(N,N) is much better conditioned than tril(rand(N,N)). To see this, compare cond(rand(N,N)) to cond(tril(rand(N,N))). I got something like 1e3 for the first and 1e19 for the second, so the conditioning number of the second matrix is much higher and computations will be less stable numerically.
This will result in getting some small negative eigenvalues in the ill-conditioned case - to see this look at the eigenvalues using eig(), some small ones will be negative.
So I would suggest to use rand(N,N) to generate a numerically stable random matrix.
BTW if you are interested in the theory of why this happens, you can look at this paper:
http://epubs.siam.org/doi/abs/10.1137/S0895479896312869
As has been said before, eigen values of a triangular matrix lie on the diagonal. Hence, by doing
L=tril(rand(n))
you made sure that eig(L) only yield positive values. You can improve the condition number of L*L' by adding a large enough positive number to the diagonal, e.g.
L=L+n*eye(n)
and L*L' is positive definite and well conditioned:
> cond(L*L')
ans =
1.8400
To generate a random positive definite matrix in MATLAB your code should read:
N=200;
L=rand(N, N);
A=L*transpose(L);
[lc,p]=chol(A,'lower');
eig(A)
p
And you should indeed have the eigenvalues be greater than zero and p be zero.
You ask about the lower triangular case. Lets see what happens, and why there are problems. This is often a good thing to do, to look at a test case.
For a simple 5x5 matrix,
L = tril(rand(5))
L =
0.72194 0 0 0 0
0.027804 0.78422 0 0 0
0.26607 0.097189 0.77554 0 0
0.96157 0.71437 0.98738 0.66828 0
0.024571 0.046486 0.94515 0.38009 0.087634
eig(L)
ans =
0.087634
0.66828
0.77554
0.78422
0.72194
Of course, the eigenvalues of a triangular matrix are just the diagonal elements. Since the elements generated by rand are always between 0 and 1, on average they will be roughly 1/2. Perhaps looking at the distribution of the determinant of L will help. Better is to consider the distribution of log(det(L)). Since the determinant will be simply the product of the diagonal elements, the log is the sum of the logs of the diagonal elements. (Yes, I know the determinant is a poor measure of singularity, but the distribution of log(det(L)) is easily computed and I'm feeling too lazy to think about the distribution of the condition number.)
Ah, but the negative log of a uniform random variable is an exponential variate, in this case an exponential with lambda = 1. The sum of the logs of a set of n uniform random numbers from the interval (0,1) will by the central limit theorem be Gaussian. The mean of that sum will be -n. Therefore the determinant of a lower triangular nxn matrix generated by such a scheme will be exp(-n). When n is 200, MATLAB tells me that
exp(-200)
ans =
1.3839e-87
Thus for a matrix of any appreciable size, we can see that it will be poorly conditioned. Worse, when you form the product L*L', it will generally be numerically singular. The same arguments apply to the condition number. Thus, for even a 20x20 matrix, see that the condition number of such a lower triangular matrix is fairly large. Then when we form the matrix L*L', the condition will be squared as expected.
L = tril(rand(20));
cond(L)
ans =
1.9066e+07
cond(L*L')
ans =
3.6325e+14
See how much better things are for a full matrix.
A = rand(20);
cond(A)
ans =
253.74
cond(A*A')
ans =
64384

Calculation of accumulation area

I'm looking for a GIS/Geometric algorithm:
I have 1000 points randomly distributed in a large area(such as a city), How can I find out all the small areas which have more than 15 points? Like this picture below:
Each point has its own latitude and longitude coordinates. The small area less than 200m x 200m.
You should take a look at RTREE structures.
See http://en.wikipedia.org/wiki/R-tree
You've such algorithms implemented e.g. in the SQlite3 engine.
See http://www.sqlite.org/rtree.html
Our Open Source version already includes the RTREE extension for Delphi 6 up to XE, compiled by default since rev. 1.8.
Not sure what your performance requirements are. But a naive implementation would be, for each point, to sum up the inverse of the distance to all other points:
for i := 0 to 999 do
for j := 0 to 999 do
if i<>j then
Point[i].Score := Point[i].Score + ( 1 / Distance(Point[i], Point[j]) );
The points near the center of each accumulation area will have the highest score.