Using the Weighted Euclidean Distance Function for DBSCAN in ELKI

Using the Weighted Euclidean Distance Function for DBSCAN in ELKI - euclidean-distance

I'm experimenting with ELKI (which is awesome, btw) and would like to try the weighted Euclidean distance function as a metric for the DBSCAN algorithm.
First of all, I don't know how it works, except for this.
I tried a few values for the parameter distance.weights. If I insert for example 3,1 in its field, this is what I get after I hit ENTER:
I'm still able to run the task but I see no difference in the results, no matter the values I insert.
Could anyone please provide a short explanation on how the function works and eventually what I'm doing wrong?
Thanks in advance.

Related

Integrate RANSAC to compute essential matrix

I have calculated the essential matrix using the 5 point algorithm. I'm not sure how to integrate it with ransac so it gives me a better outcome.
Here is the source code. https://github.com/lunzhang/openar/blob/master/src/utils/5point/computeEssential.js
Currently, I was thinking about computing the essential matrix for 5 random points then convert the essential matrix to fundamental and see the error threshold using this equation x'Fx = 0. But then I'm not sure, what to do after.
How do I know which points to set as outliners? If the errors too big, do I set them as outliners right away? Could it be possible that one point could produce different essential matrices depending on what the other 4 points are?

Well, here is a short explanation, in pseudo-code, of how you can integrate this with ransac. Basically, all Ransac does is compute your model (here the Essential) using a subset of the data, and then sees if the rest of data "is happy" with that result. It keeps the result for which a highest portion of the dataset "is happy".
highest_number_of_happy_points=-1;
best_estimated_essential_matrix=Identity;
for iter=1 to max_iter_number:
n_pts=get_n_random_pts(P);//get a subset of n points from the set of points P. You can use 5, but you can also use more.
E=compute_essential(n_pts);
number_of_happy_points=0;
for pt in P:
//we want to know if pt is happy with the computed E
err=cost_function(pt,E);//for example x^TFx as you propose, or X^TEX with the essential.
if(err<some_threshold):
number_of_happy_points+=1;
if(number_of_happy_points>highest_number_of_happy_points):
highest_number_of_happy_points=number_of_happy_points;
best_estimated_essential_matrix=E;
This should do the trick. Usually, you set some_threshold experimentally to a low value. There are of course more sophisticated Ransacs, you can easily find them by googling.
Your idea of using x^TFx is fine in my opinion.
Once this Ransac completes, you will have best_estimated_essential_matrix. The outliers are those that have a x^TFx value that is greater than your optional threshold.
To answer your final question, yes, a point could produce a different matrix given 4 different points, because their spatial configuration is different (you can have degenerate situations). In an ideal settings this wouldn't be the case, but we always have noise, matching errors and so on, so what happens in the end is that the equations you obtain with 5 points wont produce the exact same results as for 5 other points.
Hope this helps.

Clustering a list of dates

I have a list of dates I'd like to cluster into 3 clusters. Now, I can see hints that I should be looking at k-means, but all the examples I've found so far are related to coordinates, in other words, pairs of list items.
I want to take this list of dates and append them to three separate lists indicating whether they were before, during or after a certain event. I don't have the time for this event, but that's why I'm guessing it by breaking the date/times into three groups.
Can anyone please help with a simple example on how to use something like numpy or scipy to do this?

k-means is exclusively for coordinates. And more precisely: for continuous and linear values.
The reason is the mean functions. Many people overlook the role of the mean for k-means (despite it being in the name...)
On non-numerical data, how do you compute the mean?
There exist some variants for binary or categorial data. IIRC there is k-modes, for example, and there is k-medoids (PAM, partitioning around medoids).
It's unclear to me what you want to achieve overall... your data seems to be 1-dimensional, so you may want to look at the many questions here about 1-dimensional data (as the data can be sorted, it can be processed much more efficiently than multidimensional data).
In general, even if you projected your data into unix time (seconds since 1.1.1970), k-means will likely only return mediocre results for you. The reason is that it will try to make the three intervals have the same length.
Do you have any reason to suspect that "before", "during" and "after" have the same duration? If not, don't use k-means.
You may however want to have a look at KDE; and plot the estimated density. Once you have understood the role of density for your task, you can start looking at appropriate algorithms (e.g. take the derivative of your density estimation, and look for the largest increase / decrease, or estimate an "average" level, and look for the longest above-average interval).

Here are some workaround methods that may not be the best answer but should help.
You can plot the dates as converted durations from a starting date (such as one week)
and convert the dates to number representations for time in minutes or hours from the starting point.
These would all graph along an x-axis but Kmeans should still be possible and clustering still visible on a graph.
Here are more examples of numpy:Python k-means algorithm

Finding the Distance Between Two Lines that represent GPS routes (MATLAB, Java, C++, or Python)

I have been researching and trying to figure this one out to no avail. I have found many ways not to solve this...
The gist of the problem: I am looking for a method to calculate the deviance from an original path traveled by way of GPS coordinates. I have multiple csv files that contain latitude, longitude, and UTC time. I have created KML files from this information for a visual viewing of the deviance and now would like to put a value on this deviation. I ahve chosen a route as a reference and would like to measure the other routes against the reference route. There are multiple routes each having it's own reference route, each of which has many runs. No two runs are the same, and some of the routes deviate more than the next. I cannot use time, only lat and lon since the runs were completed over many weeks of data collection.
What I have tried thus far:
Haversine and Equirectangular formulas (looping through and measuring point to point).
Outcome: The coordinates only line up for a short period of time and the difference in the number of points varies greatly.
Area under each curve: was going to find the difference of the two routes by this method.
Outcome: Really unsure how to proceed, nor find equations suitable for this calculation.
There were a couple more feeble attempts, but have been working on this for a few weeks now, with not much to show for and still unsure on how to proceed.
Any help or ideas would be greatly appreciated.

Possible solution 1: Instead of calculating the "sideways" deviation between the two routes, just compare the respective arc lengths (Matlab: arclength).
Possible solution 2: To compare two routes, each going from the same start A to the same end point B: Draw a straight line between A and B, place a number of equidistant points along AB, and then average the perpendicular distance from these points on AB to the paths you want to compare. The absolute difference between the cumulative deviations from the straight-line reference is your deviation.
Possible solution 3: Calculate the arc length of each route. Place a number of equidistant points along each route. Average the distance between these points.
Both solution 2 and 3 will depend on the number of points you place, but with a higher number of points, the average deviation will converge. Note that these solutions are both related to calculating the area under each curve.

Estimate color distribution with Gaussian mixture model

I am trying to use two Gaussian mixtures with EM algorithm to estimate color distribution of a video frame. For that, I want to use two separate peaks in the color distribution as the two Gaussian means to facilitate the EM calculation. I have several difficulties with the implementation of these in OpenCV.
My first question is: how can I determine the two peaks? I've searched about peak estimation in OpenCV, but still couldn't find any seperate function. So I am going to determine two regions, then find their maximum values as peaks. Is this way correct?
My second question is: how to perform Gaussian mixture model with EM in OpenCV? As far as I know, the "cv::EM::predict" function could give me the index of the most probable mixture component. But I have difficulties with training EM. I've searched and found some other codes, but finding the correct parameters is too much difficult for. Could someone provide me any example code for this? Thank you in advance.

#ederman, try {OpenCV library location}\opencv\samples\cpp\em.cpp instead of the web link. I think the sample code in the link is out of date now. I have successfully compiled the sample code in OpenCV 2.3.1. It shouldn't be a problem for 2.4.2.
Good luck:)

My first question is: how can I determine the two peaks?
I would iterate through the range of sample values possible, and test when the does EM.predict(sample)[0] peaks.

Finding the spread of each cluster from Kmeans

I'm trying to detect how well an input vector fits a given cluster centre. I can find the best match quite easily (the centre with the minimum euclidean distance to the input vector is the best), however, I now need to work how good a match that is.
To do this I need to find the spread (standard deviation?) of the vectors which build up the centroid, then see if the distance from my input vector to the centre is less than the spread. If it's more than the spread than I should be able to say that I have no clusters to fit it (given that the best doesn't fit the input vector well).
I'm not sure how to find the spread per cluster. I have all the centre vectors, and all the training vectors are labelled with their closest cluster, I just can't quite fathom exactly what I need to do to get the spread.
I hope that's clear? If not I'll try to reword it!
TIA
Ian

Use the distance function and calculate the distance from your center point to each labeled point, then figure out the mean of those distances. That should give you the standard deviation.

If you switch to using a different algorithm, such as Mixture of Gaussians, you get the spread (e.g., std. deviation) as part of the model (clustering result).
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/mixture.html
http://en.wikipedia.org/wiki/Mixture_model

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using the Weighted Euclidean Distance Function for DBSCAN in ELKI - euclidean-distance

Related

Integrate RANSAC to compute essential matrix

Clustering a list of dates

Finding the Distance Between Two Lines that represent GPS routes (MATLAB, Java, C++, or Python)

Estimate color distribution with Gaussian mixture model

Finding the spread of each cluster from Kmeans

Categories

Resources