How could I get outlier instance with k-means clustering in WEKA? - weka

I have used SimpleKmeans class in WEKA, so I do clustering instance as well. But I have a problem during getting outlier instances.
I supposed, each cluster in this class has a center(or centroid) and a radius, so I could find outliers by checking all clusters' circle with its centroid and its radius. Although I couldn't find any variables or functions that get cluster's radius.
Now, Do you know any other way for finding outliers at SimpleKmeans class in WEKA? Or Any variables that shows radius for each cluster?

I couldn't find any radius variable in SimpleKmeans class, however I use alternative solution.
Clustering find best cluster that has minimum Euclidean distance(or manhattan ,..), so It will outlier instance if the nearest cluster is more than a specific threshold.

Related

Generating anchor boxes using K-means clustering , YOLO

I am trying to understand working of YOLO and how it detects object in an image. My question is, what role does k-means clustering play in detecting bounding box around the object? Thanks.
K -means clustering algorithm is very famous algorithm in data science. This algorithm aims to partition n observation to k clusters. Mainly it includes :
Initialization : K means (i.e centroid) are generated at random.
Assignment : Clustering formation by associating the each observation with nearest centroid.
Updating Cluster : Centroid of a newly created cluster becomes mean.
Assignment and Update are repitatively occurs untill convergence.
The final result is that the sum of squared errors is minimized between points and their respective centroids.
EDIT :
Why use K means
K-means is computationally faster and more efficient compare to other unsupervised learning algorithms. Don't forget time complexity is linear.
It produces a higher cluster then the hierarchical clustering. More number of cluster helps to get more accurate end result.
Instance can change cluster (move to another cluster) when centroid are re-computed.
Works well even if some of your assumption are broken.
what it really does in determining anchor box
It will create a thouasands of anchor box (i.e Clusters in k-means) for each predictor that represent shape, location, size etc.
For each anchor box, calculate which object’s bounding box has the highest overlap divided by non-overlap. This is called Intersection Over Union or IOU.
If the highest IOU is greater than 50% ( This can be customized), tell the anchor box that it should detect the object it has highest IOU.
Otherwise if the IOU is greater than 40%, tell the neural network that the true detection is ambiguous and not to learn from that example.
If the highest IOU is less than 40%, then it should predict that there is no object.
Thanks!
In general, bounding boxes for objects are given by tuples of the form
(x0,y0,x1,y1) where x0,y0 are the coordinates of the lower left corner and x1,y1 are the coordinates of the upper right corner.
Need to extract width and height from these coordinates, and normalize data with respect to image width and height.
Metric for K-mean
Euclidean distance
IoU (Jaccard index)
IoU turns out to better than former
Jaccard index = (Intersection between selected box and cluster head box)/(Union between selected box and cluster head box)
At initialization we can choose k random boxes as our cluster heads. Assign anchor boxes to respective clusters based on IoU value > threshold and calculate mean IoU of cluster.
This process can be repeated until convergence.

What is the fastest way to calculate position cluster centers constriant by a concave polygon

I have a distribution of weighted 2D pose estimates (position + orientation) that are samples of an unknown PDF of a systems pose. All estimates and the underlying real position are constrained by a concave polygon.
The picture shows an exemplary distribution. The magenta colored circles are the estimates, the radius line indicates the estimated direction. The weights are indicated by the circles diameter. The red dot is the weighted mean, the yellow cirlce indicates the variance and the direction but is of no importance for the following problem:
From all estimates I want to derive the most likely position of the system.
Up to now I have evaluated the following approaches:
Using the estimate with the highest weight: Gives poor results since one estimate with a high weight outperforms several coinciding estimates with slightly lower weights.
Weighted Mean: Not applicable since the mean might lie outside the polygon as in the picture (red dot with yellow circle).
Weighted Median: Would work but does neglect potential clusters. E.g. in the image below two clusters are prominent of which one is more likely than the other.
Additionally I have looked into K-Means and K-Medoids. For K-Means I do not know the most efficient way to constrain the centers to the polygon. K-Medoids seems to work, but has poor performance (O(n^2)), which is important since I have a high number of estimates (contrary to explanatory picture)
What would be the ideal algorithm to solve this kind of problem ?
What complexity can be achieved ?
Are there readily available algorithms in c++ that solve this problem, or can be easily adapted to solve it?
k-means may also yield an estimate outside your polygons.
Such constraints are beyond the clustering use case. But nothing prevents you from devising a method to correct the estimates afterwards.
For non-convex data, DBSCAN may be worth a try. You could even incorporate line-of-sight into Generalized DBSCAN easily. But I'm not convinced that clustering will help for your overall objective.

Can we compare clusters with C-Index average?

I use K-Means algorithm to create clusters. As you know K-means algorithm needs cluster count as parameter. I try cluster counts as starting two from eight and calculate all C-Index of clusters in every looping then get the avegare of these C-Indexes. Then compare C-Index avegares and choose the minimum C-Index average as best quality cluster count. Is that true way for detecting cluster count?
There is no one correct way to detect cluster count. See following google search, this is still an active research area. Wikipedia articles says that:
The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user.
Only you can determine if using c-index in this way is a good way to determine cluster numbers in your domain. See another question of using c-index in clustering.

Mahout K-Means - Getting the Mean Square Error?

I have been trying to check the quality of the clusters in Mahout K-Means and I was wondering if it is possible to get the Mean Square Error from the clusterDumper or using another command in mahout?
With ClusterDumper I was only able to get the points assigned to each clusters, the cluster values and the radius values but not the Within cluster sum of squared errors (MSE).
http://mahout.apache.org/users/clustering/cluster-dumper.html

Finding the spread of each cluster from Kmeans

I'm trying to detect how well an input vector fits a given cluster centre. I can find the best match quite easily (the centre with the minimum euclidean distance to the input vector is the best), however, I now need to work how good a match that is.
To do this I need to find the spread (standard deviation?) of the vectors which build up the centroid, then see if the distance from my input vector to the centre is less than the spread. If it's more than the spread than I should be able to say that I have no clusters to fit it (given that the best doesn't fit the input vector well).
I'm not sure how to find the spread per cluster. I have all the centre vectors, and all the training vectors are labelled with their closest cluster, I just can't quite fathom exactly what I need to do to get the spread.
I hope that's clear? If not I'll try to reword it!
TIA
Ian
Use the distance function and calculate the distance from your center point to each labeled point, then figure out the mean of those distances. That should give you the standard deviation.
If you switch to using a different algorithm, such as Mixture of Gaussians, you get the spread (e.g., std. deviation) as part of the model (clustering result).
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/mixture.html
http://en.wikipedia.org/wiki/Mixture_model