silhouette value increasing while the number of clusters increasing - python-2.7

I have a matrix which the row are the brands and the columns are the features of each brand.
First, I calculate the affinity matrix with scikit learn and then apply the spectral clustering on the affinity matrix to do the clustering.
When I calculate the silhouette value with respect to each number of clusters, as long as the number of clusters increasing, the silhouette value is also increasing.
In the end when the number of clusters get bigger and bigger, to calculate the silhouette value, it will give NaN result
#coding utf-8
import pandas as pd
import sklearn.cluster as sk
from sklearn.cluster import SpectralClustering
from sklearn.metrics import silhouette_score
data_event = pd.DataFrame.from_csv('\Data\data_of_events.csv', header=0,index_col=0, parse_dates=True, encoding=None, tupleize_cols=False, infer_datetime_format=False)
data_event_matrix = data_event.as_matrix(columns = ['Furniture','Food & Drinks','Technology','Architecture','Show','Fashion','Travel','Art','Graphics','Product Design'])
#compute the affinity matrix
data_event_affinitymatrix = SpectralClustering().fit(data_event_matrix).affinity_matrix_
#clustering
for n_clusters in range(2,100,2):
print n_clusters
labels = sk.spectral_clustering(data_event_affinitymatrix, n_clusters=n_clusters, n_components=None,
eigen_solver=None, random_state=None, n_init=10, eigen_tol=0.0, assign_labels='kmeans')
silhouette_avg = silhouette_score(data_event_affinitymatrix, labels)
print("For n_clusters =", n_clusters, "The average silhouette_score of event clustering is :", silhouette_avg)

If your intention is to find the optimal number of cluster then you can try using the Elbow method. Multiple variations exists for this method, but the main idea is that for different values of K (no. of clusters) you find the cost function that is most appropriate for you application (Example, Sum of Squared distance of all the points in a cluster to it's centroid for all values of K say 1 to 8, or any other error/cost/variance function. In your case if it is a distance function, then after a certain point number of clusters, you will notice that the difference in values along the y-axis becomes negligible. Based on the graph plotted for number of clusters along x-axis and your metric along y-axis, you choose the value 'k' on x-axis at such a point where the value at y-axis changes abruptly.
You can see in this , that the optimal value of 'K' is 4.
Image Source : Wikipedia.
Another measure that you can use to validate your clusters is V-measure Score. It is a symmetric measure and if often used when the ground truth is unknown. It is defined as the Harmonic mean of Homogenity and Completeness. Here is an example in scikit-learn for your reference.
EDIT: V-measure is basically used to compare two different cluster assignments to each other.
Finally, if you are interested, you can take a look at Normalized Mutual Information Score to validate your results as well.
References :
Biclustering Scikit-Learn
Elbow Method : Coursera
Research Paper on V-Measure
Choosing the right number of clusters
Update : I recently came across this Self Tuning Spectral Clustering. You can give it a try.

Related

Generating anchor boxes using K-means clustering , YOLO

I am trying to understand working of YOLO and how it detects object in an image. My question is, what role does k-means clustering play in detecting bounding box around the object? Thanks.
K -means clustering algorithm is very famous algorithm in data science. This algorithm aims to partition n observation to k clusters. Mainly it includes :
Initialization : K means (i.e centroid) are generated at random.
Assignment : Clustering formation by associating the each observation with nearest centroid.
Updating Cluster : Centroid of a newly created cluster becomes mean.
Assignment and Update are repitatively occurs untill convergence.
The final result is that the sum of squared errors is minimized between points and their respective centroids.
EDIT :
Why use K means
K-means is computationally faster and more efficient compare to other unsupervised learning algorithms. Don't forget time complexity is linear.
It produces a higher cluster then the hierarchical clustering. More number of cluster helps to get more accurate end result.
Instance can change cluster (move to another cluster) when centroid are re-computed.
Works well even if some of your assumption are broken.
what it really does in determining anchor box
It will create a thouasands of anchor box (i.e Clusters in k-means) for each predictor that represent shape, location, size etc.
For each anchor box, calculate which object’s bounding box has the highest overlap divided by non-overlap. This is called Intersection Over Union or IOU.
If the highest IOU is greater than 50% ( This can be customized), tell the anchor box that it should detect the object it has highest IOU.
Otherwise if the IOU is greater than 40%, tell the neural network that the true detection is ambiguous and not to learn from that example.
If the highest IOU is less than 40%, then it should predict that there is no object.
Thanks!
In general, bounding boxes for objects are given by tuples of the form
(x0,y0,x1,y1) where x0,y0 are the coordinates of the lower left corner and x1,y1 are the coordinates of the upper right corner.
Need to extract width and height from these coordinates, and normalize data with respect to image width and height.
Metric for K-mean
Euclidean distance
IoU (Jaccard index)
IoU turns out to better than former
Jaccard index = (Intersection between selected box and cluster head box)/(Union between selected box and cluster head box)
At initialization we can choose k random boxes as our cluster heads. Assign anchor boxes to respective clusters based on IoU value > threshold and calculate mean IoU of cluster.
This process can be repeated until convergence.

Input one fixed cluster centroid, find N others (python)

I have a table of shipment destinations in lat, long. I have one fixed origination point (also lat, long). I would like to find other optimal origin locations using clustering. In other words, I want to assign one cluster centroid (keep it fixed) and find 1, 2, 3 . . . N other cluster centroids. Is this possible with the scikit learn cluster module?
Rather than recycling clustering for this, treat it as a regular optimization problem. You don't want to "discover structure", but optimize cost.
Beware that earth is not flat, and Euclidean distance (i.e. k-means) is a bad idea. 1 degree north is only at the equator approximately the same distance to 1 degree east. If your data is e.g. in New York, you have a non-neglibile distortion, and your solution will not even be a local optimum.
If you absolutely insist on abusing kmeans, it's easy to do.
Choose n-1 centers at random and the predefined one.
Then run 1 iteration of k-means only. Then replace that center with the desired center again. Repeat with the next iteration.

Spatial pyramid matching (SPM) for SIFT then input to SVM in C++

I am trying to classify MRI images of brain tumors into benign and malignant using C++ and OpenCV. I am planning on using bag-of-words (BoW) method after clustering SIFT descriptors using kmeans. Meaning, I will represent each image as a histogram with the whole "codebook"/dictionary for the x-axis and their occurrence count in the image for the y-axis. These histograms will then be my input for my SVM (with RBF kernel) classifier.
However, the disadvantage of using BoW is that it ignores the spatial information of the descriptors in the image. Someone suggested to use SPM instead. I read about it and came across this link giving the following steps:
Compute K visual words from the training set and map all local features to its visual word.
For each image, initialize K multi-resolution coordinate histograms to zero. Each coordinate histogram consist of L levels and each level
i has 4^i cells that evenly partition the current image.
For each local feature (let's say its visual word ID is k) in this image, pick out the k-th coordinate histogram, and then accumulate one
count to each of the L corresponding cells in this histogram,
according to the coordinate of the local feature. The L cells are
cells where the local feature falls in in L different resolutions.
Concatenate the K multi-resolution coordinate histograms to form a final "long" histogram of the image. When concatenating, the k-th
histogram is weighted by the probability of the k-th visual word.
To compute the kernel value over two images, sum up all the cells of the intersection of their "long" histograms.
Now, I have the following questions:
What is a coordinate histogram? Doesn't a histogram just show the counts for each grouping in the x-axis? How will it provide information on the coordinates of a point?
How would I compute the probability of the k-th visual word?
What will be the use of the "kernel value" that I will get? How will I use it as input to SVM? If I understand it right, is the kernel value is used in the testing phase and not in the training phase? If yes, then how will I train my SVM?
Or do you think I don't need to burden myself with the spatial info and just stick with normal BoW for my situation(benign and malignant tumors)?
Someone please help this poor little undergraduate. You'll have my forever gratefulness if you do. If you have any clarifications, please don't hesitate to ask.
Here is the link to the actual paper, http://www.csd.uwo.ca/~olga/Courses/Fall2014/CS9840/Papers/lazebnikcvpr06b.pdf
MATLAB code is provided here http://web.engr.illinois.edu/~slazebni/research/SpatialPyramid.zip
Co-ordinate histogram (mentioned in your post) is just a sub-region in the image in which you compute the histogram. These slides explain it visually, http://web.engr.illinois.edu/~slazebni/slides/ima_poster.pdf.
You have multiple histograms here, one for each different region in the image. The probability (or the number of items would depend on the sift points in that sub-region).
I think you need to define your pyramid kernel as mentioned in the slides.
A Convolutional Neural Network may be better suited for your task if you have enough training samples. You can probably have a look at Torch or Caffe.

Estimating/Choosing optimal Hyperparameters for DBSCAN

I need to find naturally occurring classes of nouns based on their distribution with different preposition (like agentive, instrumental, time, place etc.). I tried using k-means clustering but of less help, it didn't work well, there was a lot of overlap over the classes that I was looking for (probably because of non-globular shape of classes and random initialisation in k-means).
I am now working on using DBSCAN, but I have trouble understanding the epsilon value and mini-points value in this clustering algorithm. Can I use random values or do I need to compute them. Can anybody help? Particularly with epsilon, at least how to compute it if I need to?
Use your domain knowledge to choose the parameters. Epsilon is a radius. You can think of it as a minimum cluster size.
Obviously random values won't work very well. As a heuristic, you can try to look at a k-distance plot; but it's not automatic either.
The first thing to do either way is to choose a good distance function for your data. And perform appropriate normalization.
As for "minPts" it again depends on your data and needs. One user may want a very different value than another. And of course minPts and Epsilon are coupled. If you double epsilon, you will roughly need to increase your minPts by 2^d (for Euclidean distance, because that is how the volume of a hypersphere increases!)
If you want lots of small and fine detailed clusters, choose a low minpts. If you want larger and fewer clusters (and more noise), use a larger minpts. If you don't want any clusters at all, choose minpts larger than your data set size...
It is highly important to select the hyperparameters of DBSCAN algorithm rightly for your dataset and the domain in which it belongs.
eps hyperparameter
In order to determine the best value of eps for your dataset, use the K-Nearest Neighbours approach as explained in these two papers: Sander et al. 1998 and Schubert et al. 2017 (both papers from the original DBSCAN authors).
Here's a condensed version of their approach:
If you have N-dimensional data to begin, then choose n_neighbors in sklearn.neighbors.NearestNeighbors to be equal to 2xN - 1, and find out distances of the K-nearest neighbors (K being 2xN - 1) for each point in your dataset. Sort these distances out and plot them to find the "elbow" which separates noisy points (with high K-nearest neighbor distance) from points (with relatively low K-nearest neighbor distance) which will most likely fall into a cluster. The distance at which this "elbow" occurs is your point of optimal eps.
Here's some python code to illustrate how to do this:
def get_kdist_plot(X=None, k=None, radius_nbrs=1.0):
nbrs = NearestNeighbors(n_neighbors=k, radius=radius_nbrs).fit(X)
# For each point, compute distances to its k-nearest neighbors
distances, indices = nbrs.kneighbors(X)
distances = np.sort(distances, axis=0)
distances = distances[:, k-1]
# Plot the sorted K-nearest neighbor distance for each point in the dataset
plt.figure(figsize=(8,8))
plt.plot(distances)
plt.xlabel('Points/Objects in the dataset', fontsize=12)
plt.ylabel('Sorted {}-nearest neighbor distance'.format(k), fontsize=12)
plt.grid(True, linestyle="--", color='black', alpha=0.4)
plt.show()
plt.close()
k = 2 * X.shape[-1] - 1 # k=2*{dim(dataset)} - 1
get_kdist_plot(X=X, k=k)
Here's an example resultant plot from the code above:
From the plot above, it can be inferred that the optimal value for eps can be assumed at around 22 for the given dataset.
NOTE: I would strongly advice the reader to refer to the two papers cited above (especially Schubert et al. 2017) for additional tips on how to avoid several common pitfalls when using DBSCAN as well as other clustering algorithms.
There are a few articles online –– DBSCAN Python Example: The Optimal Value For Epsilon (EPS) and CoronaVirus Pandemic and Google Mobility Trend EDA –– which basically use the same approach but fail to mention the crucial choice of the value of K or n_neighbors as 2xN-1 when performing the above procedure.
min_samples hyperparameter
As for the min_samples hyperparameter, I agree with the suggestions in the accepted answer. Also, a general guideline for choosing this hyperparameter's optimal value is that it should be set to twice the number of features (Sander et al. 1998). For instance, if each point in the dataset has 10 features, a starting point to consider for min_samples would be 20.

Is there a way to identify regions that are not very similar from a set of images?

Given an image, I would like to extract more subimages from it, but the resulting subimages must not be overly similar to each other. If the center of each ROI should be chosen randomly, then we must make sure that each subimage has at most only a small percentage of area in common with other subimages.
Or we could decompose the image into small regions over a regular grid, then I randomly choose a subimage within each region. This option, however, does not ensure that all subimages are sufficiently different from each other. Obviously I have to choose a good way to compare the resulting subimages, but also a similarity threshold.
The above procedure must be performed on many images: all the extracted subimages should not be too similar. Is there a way to identify regions that are not very similar from a set of images (for eg by inspecting all histograms)?
One possible way is to split your image into n x n squares (save edge cases) as you pointed out, reduce each of them to a single value and group them according to k-nearest values (pertaining to the other pieces). After you group them, then you can select, for example, one image from each group. Something that is potentially better is to use a more relevant metric inside each group, see Comparing image in url to image in filesystem in python for two such metrics. By using this metric, you can select more than one piece from each group.
Here is an example using some duck I found around. It considers n = 128. To reduce each piece to a single number, it calculates the euclidean distance to a pure black piece of n x n.
f = Import["http://fohn.net/duck-pictures-facts/mallard-duck.jpg"];
pieces = Flatten[ImagePartition[ColorConvert[f, "Grayscale"], 128]]
black = Image[ConstantArray[0, {128, 128}]];
dist = Map[ImageDistance[#, black, DistanceFunction -> EuclideanDistance] &,
pieces];
nf = Nearest[dist -> pieces];
Then we can see the grouping by considering k = 2:
GraphPlot[
Flatten[Table[
Thread[pieces[[i]] -> nf[dist[[i]], 2]], {i, Length[pieces]}]],
VertexRenderingFunction -> (Inset[#2, #, Center, .4] &),
SelfLoopStyle -> None]
Now you could use a metric (better than the distance to black) inside each of these groups to select the pieces you want from there.
Since you would like to apply this to a large number of images, and you already suggested it, let's discuss how to solve this problem by selecting different tiles.
The first step could be to define what "similar" is, so a similarity metric is needed. You already mentioned the tiles' histogram as one source of metric, but there may be many more, for example:
mean intensity,
90th percentile of intensity,
10th percentile of intensity,
mode of intensity, as in peak of the histogram,
variance of pixel intensity in the whole tile,
granularity, which you could quickly approximate by the difference between the raw and the Gaussian-filtered image, or by calculating the average variance in small sub-tiles.
If your image has two channels, the above list leaves you already with 12 metric components. Moreover, there are characteristics that you can obtain from the combination of channels, for example the correlation of pixel intensities between channels. With two channels that's only one characteristic, but with three channels it's already three.
To pick different tiles from this high-dimensional cloud, you could consider that some if not many of these metrics will be correlated, so a principal component analysis (PCA) would be a good first step. http://en.wikipedia.org/wiki/Principal_component_analysis
Then, depending on how many sample tiles you would like to chose, you could look at the projection. For seven tiles, for example, I would look at the first three principal components, and chose from the two extremes of each, and then also pick the one tile closest to the center (3 * 2 + 1 = 7).
If you are concerned that chosing from the very extremes of each principal component may not be robust, the 10th and 90th percentiles may be. Alternatively, you could use a clustering algorithm to find separated examples, but this would depend on how your cloud looks like. Good luck.