Can we compare clusters with C-Index average? - data-mining

I use K-Means algorithm to create clusters. As you know K-means algorithm needs cluster count as parameter. I try cluster counts as starting two from eight and calculate all C-Index of clusters in every looping then get the avegare of these C-Indexes. Then compare C-Index avegares and choose the minimum C-Index average as best quality cluster count. Is that true way for detecting cluster count?

There is no one correct way to detect cluster count. See following google search, this is still an active research area. Wikipedia articles says that:
The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user.
Only you can determine if using c-index in this way is a good way to determine cluster numbers in your domain. See another question of using c-index in clustering.

Related

Django spatial calculations efficiency in PostGIS (RDS) vs. in object manager (EC2)?

From this question, I'd like to decide whether I should use GeoDjango, or roll my own with Python to filter Points within a certain radius of another Point.
There are two excellent answers that take different approaches to the question of how to perform such a calculation here: Django sort by distance
One of them uses GeoDjango to perform the distance calculation in PostGIS. I'm guessing that the compute would be done on the RDS instance?
The other uses a custom manager to implement the Great Circle distance formula. The compute would obviously be done on the EC2 instance.
I would imagine that the PostGIS implementation is more efficient because it's likely that people much smarter than I have optimized it. To what extent have they optimized it? Is there anything special about their implementation?
Assuming I am correct in assuming GeoDjango performs the distance compute using PostGIS on the RDS instance, I would imagine that RDS is not suited for heavy compute tasks, and may end up being slower or more expensive in the end. Are my assumptions correct?
What if I don't need a precise distance, where an octaggon or even a square would suffice? In the case of a square, it would be simply a matter of filtering Points with latitude and longitude within a certain range. Is GeoDjango/PostGIS able to perform estimates like this?
If I do need a precise distance, I could calculate the furthest bounds that can be reached with the given radius, and only perform precise distance calculations on Points within those bounds. Does GeoDjango/PostGIS do this?
I'll try to address you questions:
One of them uses GeoDjango to perform the distance calculation in
PostGIS. I'm guessing that the compute would be done on the RDS
instance?
If you are bringing two django models to memory, and doing the calculation using Django, such as
model_a = Foo.objects.get(id=1)
model_b = Bar.objects.get(id=1)
distance = model_a.geometry.distance(model_b.geometry)
This will be done in Python, using GEOS.
https://docs.djangoproject.com/en/1.9/ref/contrib/gis/geos/#django.contrib.gis.geos.GEOSGeometry.distance
There are distance lookups on Django, such as
foos = Foo.objects.filter(geometry__distance_lte=(Point(0,0,srid=4326), km1))
This calculation will be done by the backend (aka database).
The other uses a custom manager to implement the Great Circle distance
formula. The compute would obviously be done on the EC2 instance.
I would imagine that the PostGIS implementation is more efficient because it's likely that people much smarter than I have optimized it.
To what extent have they optimized it? Is there anything special about
their implementation?
Django has methods to use GCD in queries. This requires a transformation on the PostGIS, if you geometry field, to geography fields. Only EPSG:4326 is supported for now. If that's all you need, I bet the PostGIS implementation is good enough for almost all applications (if not all).
Assuming I am correct in assuming GeoDjango performs the distance compute using PostGIS on the RDS instance, I would imagine that RDS is
not suited for heavy compute tasks, and may end up being slower or
more expensive in the end. Are my assumptions correct?
I don't know much about amazon products, but without an estimate of size (number of rows, types of calculations (cross-product, for example), etc), it's hard to help further.
What if I don't need a precise distance, where an octaggon or even a square would suffice? In the case of a square, it would be simply a
matter of filtering Points with latitude and longitude within a
certain range. Is GeoDjango/PostGIS able to perform estimates like
this?
What kind of data do you have? There are several components in calculating distances and areas, mainly the spatial reference that you use (datum, ellipsoid, projection).
IF you need to do accurate or more accurate distance measurements between two distance sides of the globe, the geography side is more precise and it will yield good results. If you need to do that kind of measurements in a Cartesian plane, your data will yield bad results.
If your data is local, like a few sq km, consider using a more local spatial reference. WGS84 4326 is more suitable for global data. Local spatial references can give you precise results, but in much smaller extents.
If I do need a precise distance, I could calculate the furthest bounds that can be reached with the given radius, and only perform
precise distance calculations on Points within those bounds. Does
GeoDjango/PostGIS do this?
I think you are optimizing too early. I know your question is a bit old, but this is something that you should only care when it starts to hurt. PostGIS and Django have been grinding a lot of data for a long time for me in a govn. system that checks land registry parcels and does tons of queries to check several parameters. It's working for a few years without a hitch.

How to merge Point Cloud Cluster of different size

I am working with 3D point cloud using PCL. I am using Fast Point Feature histogram (FPFH) as a descriptors which is 33 Dimensional for a single point. In my work I want to do clustering of point cloud data using FPFH where clusters are defined this feature.
However, I am confused as if I compute the FPFH of a cluster containing say 200 points than my feature vector of each point in a cluster is 200 x 33. Since two clusters will have different size I cannot use the feature vector of size like above. My question is how can I appropriately compute the features and use it to describe the cluster using single 1 x 33 dimension vector?
I was thinking of using mean but than it mean does not capture relative information of all distinct point.
The FPFH descriptor is calculated around a point (from the points neighbouring that point - using either a k nearest neighbour or a fixed radius selection typically), not from the point. So no matter what the size of the cluster, the FPFH calculated from it will only be 33 dimensional. So for each cluster you just need to feed all the points in the cluster to the FPFH calculation routine and get the 33 dimensional feature vector out. You may also need to specify a point cloud containing the points around which to calculate the feature vector. If you do this per cluster, just send the centroid of the cluster (a single point) - and make sure the radius/k is big enough so that all points in the cluster are selected.

Clustering a list of dates

I have a list of dates I'd like to cluster into 3 clusters. Now, I can see hints that I should be looking at k-means, but all the examples I've found so far are related to coordinates, in other words, pairs of list items.
I want to take this list of dates and append them to three separate lists indicating whether they were before, during or after a certain event. I don't have the time for this event, but that's why I'm guessing it by breaking the date/times into three groups.
Can anyone please help with a simple example on how to use something like numpy or scipy to do this?
k-means is exclusively for coordinates. And more precisely: for continuous and linear values.
The reason is the mean functions. Many people overlook the role of the mean for k-means (despite it being in the name...)
On non-numerical data, how do you compute the mean?
There exist some variants for binary or categorial data. IIRC there is k-modes, for example, and there is k-medoids (PAM, partitioning around medoids).
It's unclear to me what you want to achieve overall... your data seems to be 1-dimensional, so you may want to look at the many questions here about 1-dimensional data (as the data can be sorted, it can be processed much more efficiently than multidimensional data).
In general, even if you projected your data into unix time (seconds since 1.1.1970), k-means will likely only return mediocre results for you. The reason is that it will try to make the three intervals have the same length.
Do you have any reason to suspect that "before", "during" and "after" have the same duration? If not, don't use k-means.
You may however want to have a look at KDE; and plot the estimated density. Once you have understood the role of density for your task, you can start looking at appropriate algorithms (e.g. take the derivative of your density estimation, and look for the largest increase / decrease, or estimate an "average" level, and look for the longest above-average interval).
Here are some workaround methods that may not be the best answer but should help.
You can plot the dates as converted durations from a starting date (such as one week)
and convert the dates to number representations for time in minutes or hours from the starting point.
These would all graph along an x-axis but Kmeans should still be possible and clustering still visible on a graph.
Here are more examples of numpy:Python k-means algorithm

assigning new observation to a cluster

Suppose I have a user/item feature matrix in Mahout and I have derived the users' loglikelihood similarity and have identified three user clusters. Now I have a new user with a set of items (same format and same set of items), how can I assign the new user one of these three clusters without recalculating a similarity matrix and reclustering procedure?
The problem is if I use the current cluster centroids and calculate the loglikelihood similarity or any distance measure, the centroids are not binary anymore. If i use k-medians, there is a risk of them being all zeros. What is a good way to approach this? Is there any model base clustering that you recommend using, specially in MAhout?
How about training classifiers for the clusters?
To avoid the zeros, you could use k-medoids instead. The key difference here is that k-medoids will choose the most central object from your dataset, so it will actually have the same sparsity as your data objects.
As I don't use Mahout, I do not know if this is available in Mahout. As far as I know it is much more computationally intensive than k-means or k-medians.

Finding the spread of each cluster from Kmeans

I'm trying to detect how well an input vector fits a given cluster centre. I can find the best match quite easily (the centre with the minimum euclidean distance to the input vector is the best), however, I now need to work how good a match that is.
To do this I need to find the spread (standard deviation?) of the vectors which build up the centroid, then see if the distance from my input vector to the centre is less than the spread. If it's more than the spread than I should be able to say that I have no clusters to fit it (given that the best doesn't fit the input vector well).
I'm not sure how to find the spread per cluster. I have all the centre vectors, and all the training vectors are labelled with their closest cluster, I just can't quite fathom exactly what I need to do to get the spread.
I hope that's clear? If not I'll try to reword it!
TIA
Ian
Use the distance function and calculate the distance from your center point to each labeled point, then figure out the mean of those distances. That should give you the standard deviation.
If you switch to using a different algorithm, such as Mixture of Gaussians, you get the spread (e.g., std. deviation) as part of the model (clustering result).
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/mixture.html
http://en.wikipedia.org/wiki/Mixture_model