Generating anchor boxes using K-means clustering , YOLO - computer-vision

I am trying to understand working of YOLO and how it detects object in an image. My question is, what role does k-means clustering play in detecting bounding box around the object? Thanks.

K -means clustering algorithm is very famous algorithm in data science. This algorithm aims to partition n observation to k clusters. Mainly it includes :
Initialization : K means (i.e centroid) are generated at random.
Assignment : Clustering formation by associating the each observation with nearest centroid.
Updating Cluster : Centroid of a newly created cluster becomes mean.
Assignment and Update are repitatively occurs untill convergence.
The final result is that the sum of squared errors is minimized between points and their respective centroids.
EDIT :
Why use K means
K-means is computationally faster and more efficient compare to other unsupervised learning algorithms. Don't forget time complexity is linear.
It produces a higher cluster then the hierarchical clustering. More number of cluster helps to get more accurate end result.
Instance can change cluster (move to another cluster) when centroid are re-computed.
Works well even if some of your assumption are broken.
what it really does in determining anchor box
It will create a thouasands of anchor box (i.e Clusters in k-means) for each predictor that represent shape, location, size etc.
For each anchor box, calculate which object’s bounding box has the highest overlap divided by non-overlap. This is called Intersection Over Union or IOU.
If the highest IOU is greater than 50% ( This can be customized), tell the anchor box that it should detect the object it has highest IOU.
Otherwise if the IOU is greater than 40%, tell the neural network that the true detection is ambiguous and not to learn from that example.
If the highest IOU is less than 40%, then it should predict that there is no object.
Thanks!

In general, bounding boxes for objects are given by tuples of the form
(x0,y0,x1,y1) where x0,y0 are the coordinates of the lower left corner and x1,y1 are the coordinates of the upper right corner.
Need to extract width and height from these coordinates, and normalize data with respect to image width and height.
Metric for K-mean
Euclidean distance
IoU (Jaccard index)
IoU turns out to better than former
Jaccard index = (Intersection between selected box and cluster head box)/(Union between selected box and cluster head box)
At initialization we can choose k random boxes as our cluster heads. Assign anchor boxes to respective clusters based on IoU value > threshold and calculate mean IoU of cluster.
This process can be repeated until convergence.

Related

Is there a way to convert an undirect graph to an (x,y) coordinate system?

For a project I am working on I have some txt files that have from id's, to id's, and weight. The id's are used to identify each vertex and the weight represents the distance between the vertices. The graph is undirected and completely connected and I am using c++ and openFrameworks. How can I translate this data into (x,y) coordinate points for a graph this 1920x1080, while maintaining the weight specified in the text files?
You can only do this if the dimension of the graph is 2 or less.
You therefore need to determine the dimension of the graph--this is a measure of its connectivity.
If the dimension is 2 or less, then you will always be able to plot the graph on a Euclidian plane while preserving relative edge lengths, as long as you allow the edges to intersect. If you prohibit intersecting edges, then you can only plot the graph on a Euclidian plane if the ratio of cycle size to density of cycles in the graph is sufficiently low throughout the graph (quite hard to compute). I can tell you how to plot the trickiest bit--cycles--and give you a general approach, but you actually have some freedom in how you plot this, so please, drop a comment or edit the question if you have a more specific request.
If you don't know whether you have cycles, then find out! Here are some efficient algorithms.
Plotting cycles. Give the first node in the cycle arbitrary coordinates.
Give the second node in the cycle coordinates bounded by the distance from the first node.
For example, if you plot the first node at (0,0) and the edge between the first and second nodes has length 1, then plot the second node at (1, 0).
Now it gets tricky because you need to calculate angles.
Count up the number n of nodes in the cycle.
In order for the cycle to form a polygon, the sum of the angles formed by the cycle must be (n - 2) * 180 degrees, where n is the number of sides (i.e., the number of nodes).
Now you won't have a regular polygon unless all the edge lengths are the same, so you can't just use (n - 2) / n * 180 degrees as your angle. If you make the angles too sharp, then the edges will be forced to intersect; and if you make them too large, then you will not be able to close the polygon. Compute the internal angles as explained on StackExchange Math.
Repeat this process to plot every cycle in the graph, in arbitrary positions if needed. You can always translate the cycles later if needed.
Plotting everything else. My naive, undoubtedly inefficient approach is to traverse each node in each cycle and plot the adjacent nodes (the 'branches') layer by layer. Then rotate and translate entire cycles (including connected branches) to ensure every edge can reach both of its nodes. Finally, if possible, rotate branches (relative to their connected cycles) as needed to resolve intersecting edges.
Again, please modify the question or drop a comment if you are looking for something more specific. A full algorithm (or full code, in any common language) would make a very long answer.

Clustering Points Algorithm

I've applied three different methods of getting sets of points as follows.
Every method produces a vector of Points. Each method is in a different color, red, blue, and green.
Here is the combined image, overlaying all 3 of the sets of points
As you can see in the combined image there are spots in which all three sets "agree" on (i.e are generally in the exact same spot). I would like to find these particular spots and combine them into a single coordinate. I'm not sure where to start with approaching this problem. I've looked into K-means clustering, but to me it seems the problem is that K-means will cluster all the points and take the average with surrounding points, shifting the cluster center from the original position. I could loop through all the points in all the vectors that store the points, but as these images get larger with more points, it becomes very costly and inefficient.
Does anybody have any tips on how to approach this problem? I've been using OpenCV with C++.
Notionally, what you want to do is consider the complete tripartite graph on the three sets of points with edges weighted by distance. Then select edges in order of weight until a triangle appears; call those points a corresponding set, choose (say) their centroid to represent them, and remove them from the graph. Stop when the edge length exceeds some tolerance.
The mathematical justification for this approach is that it is independent of point ordering (except in the unlikely case of problematic ties in distances between points).
The practical implementation of this algorithm (for a significant number of points) involves a search data structure that can quickly find nearby points (not just the nearest): bins of the threshold size, a quad trie, or a k-d tree would work. Probably you would create one for each point set and use the other sets’ points as query points.

Conceptual Question Regarding the Yolo Object Detection Algorithm

My understanding is that the motivation for Anchor Boxes (in the Yolo v2 algorithm) is that in the first version of Yolo (Yolo v1) it is not possible to detect multiple objects in the same grid box. I don't understand why this is the case.
Also, the original paper by the authors (Yolo v1) has the following quote:
"Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts."
Doesn't this indicate that a grid cell can recognize more than one object? In their paper, they take B as 2. Why not take B as some arbitrarily higher number, say 10?
Second question: how are the Anchor Box dimensions tied to the Bounding Box dimensions, for detecting a particular object? Some websites say that the Anchor Box defines a shape only, and others say that it defines a shape and a size. In either case, how is the Anchor Box tied to the Bounding Box?
Thanks,
Sandeep
You're right that YOLOv1 has multiple (B) bounding boxes, but these are not assigned to ground truths in an effective or systematic way, and therefore also don't infer bounding boxes accurate enough.
As you can read on blog posts over the internet, an Anchor/Default Box is a box in the original image which corresponds to a specific cell in a specific feature map, which is assigned with specific aspect ratio and scale.
The scale is usually dictated by the feature map (deeper feature map -> large anchor scale), and the aspect ratio vary, e.g. {1:1, 1:2, 2:1} or {1:1, 1:2, 2:1, 1:3, 3:1}.
By the scale and aspect ratio, a specific shape is dictated, and this shape, with a position dictated by the position of the current cell in the feature map, is compared to ground truth bounding boxes in the original image.
Different papers have different assignment schemes, but it's usually goes like this: (1) if the IoU of the anchor on the original image with a GT is over some threshold (e.g. 0.5), then this is a positive assignment to the anchor, (2) if it's under some threshold (e.g. 0.1), then it's a negative assignment, and (3) if there's a gap between these two thresholds - then the anchors in between are ignored (in the loss computation).
This way, an anchor is in fact like a "detector head" responsible for specific cases, which are the most similar to it shape-wise. It is therefore responsible to detect objects with shape similar to it, and it infers both confidence to each class, and bounding box parameters relative to it, i.e. how much to modify the anchor's height, width, and center (in the two axes) to receive the correct bounding box.
Because of this assignment scheme, which distributes the responsibility effectively between the different anchors, the bounding box prediction is more accurate.
Another downside to YOLOv1's scheme is the fact that it decouples bounding box and classification. On one hand, this saves computation, but on the other hand - the classification is on the level of grid cell. Therefore the B options for bounding boxes all have the same class prediction. This means, for example, that if there are multiple objects of different class with the same center (e.g. person holding a cat), then the classification of at least all but one will be wrong. Note that it is theoretically possible that other predictions of adjacent grid cells will compensate on this wrong classification, but it is not promised, in particular since by the YOLOv1's scheme, the center is the assignment criteria.

silhouette value increasing while the number of clusters increasing

I have a matrix which the row are the brands and the columns are the features of each brand.
First, I calculate the affinity matrix with scikit learn and then apply the spectral clustering on the affinity matrix to do the clustering.
When I calculate the silhouette value with respect to each number of clusters, as long as the number of clusters increasing, the silhouette value is also increasing.
In the end when the number of clusters get bigger and bigger, to calculate the silhouette value, it will give NaN result
#coding utf-8
import pandas as pd
import sklearn.cluster as sk
from sklearn.cluster import SpectralClustering
from sklearn.metrics import silhouette_score
data_event = pd.DataFrame.from_csv('\Data\data_of_events.csv', header=0,index_col=0, parse_dates=True, encoding=None, tupleize_cols=False, infer_datetime_format=False)
data_event_matrix = data_event.as_matrix(columns = ['Furniture','Food & Drinks','Technology','Architecture','Show','Fashion','Travel','Art','Graphics','Product Design'])
#compute the affinity matrix
data_event_affinitymatrix = SpectralClustering().fit(data_event_matrix).affinity_matrix_
#clustering
for n_clusters in range(2,100,2):
print n_clusters
labels = sk.spectral_clustering(data_event_affinitymatrix, n_clusters=n_clusters, n_components=None,
eigen_solver=None, random_state=None, n_init=10, eigen_tol=0.0, assign_labels='kmeans')
silhouette_avg = silhouette_score(data_event_affinitymatrix, labels)
print("For n_clusters =", n_clusters, "The average silhouette_score of event clustering is :", silhouette_avg)
If your intention is to find the optimal number of cluster then you can try using the Elbow method. Multiple variations exists for this method, but the main idea is that for different values of K (no. of clusters) you find the cost function that is most appropriate for you application (Example, Sum of Squared distance of all the points in a cluster to it's centroid for all values of K say 1 to 8, or any other error/cost/variance function. In your case if it is a distance function, then after a certain point number of clusters, you will notice that the difference in values along the y-axis becomes negligible. Based on the graph plotted for number of clusters along x-axis and your metric along y-axis, you choose the value 'k' on x-axis at such a point where the value at y-axis changes abruptly.
You can see in this , that the optimal value of 'K' is 4.
Image Source : Wikipedia.
Another measure that you can use to validate your clusters is V-measure Score. It is a symmetric measure and if often used when the ground truth is unknown. It is defined as the Harmonic mean of Homogenity and Completeness. Here is an example in scikit-learn for your reference.
EDIT: V-measure is basically used to compare two different cluster assignments to each other.
Finally, if you are interested, you can take a look at Normalized Mutual Information Score to validate your results as well.
References :
Biclustering Scikit-Learn
Elbow Method : Coursera
Research Paper on V-Measure
Choosing the right number of clusters
Update : I recently came across this Self Tuning Spectral Clustering. You can give it a try.

Is there a way to identify regions that are not very similar from a set of images?

Given an image, I would like to extract more subimages from it, but the resulting subimages must not be overly similar to each other. If the center of each ROI should be chosen randomly, then we must make sure that each subimage has at most only a small percentage of area in common with other subimages.
Or we could decompose the image into small regions over a regular grid, then I randomly choose a subimage within each region. This option, however, does not ensure that all subimages are sufficiently different from each other. Obviously I have to choose a good way to compare the resulting subimages, but also a similarity threshold.
The above procedure must be performed on many images: all the extracted subimages should not be too similar. Is there a way to identify regions that are not very similar from a set of images (for eg by inspecting all histograms)?
One possible way is to split your image into n x n squares (save edge cases) as you pointed out, reduce each of them to a single value and group them according to k-nearest values (pertaining to the other pieces). After you group them, then you can select, for example, one image from each group. Something that is potentially better is to use a more relevant metric inside each group, see Comparing image in url to image in filesystem in python for two such metrics. By using this metric, you can select more than one piece from each group.
Here is an example using some duck I found around. It considers n = 128. To reduce each piece to a single number, it calculates the euclidean distance to a pure black piece of n x n.
f = Import["http://fohn.net/duck-pictures-facts/mallard-duck.jpg"];
pieces = Flatten[ImagePartition[ColorConvert[f, "Grayscale"], 128]]
black = Image[ConstantArray[0, {128, 128}]];
dist = Map[ImageDistance[#, black, DistanceFunction -> EuclideanDistance] &,
pieces];
nf = Nearest[dist -> pieces];
Then we can see the grouping by considering k = 2:
GraphPlot[
Flatten[Table[
Thread[pieces[[i]] -> nf[dist[[i]], 2]], {i, Length[pieces]}]],
VertexRenderingFunction -> (Inset[#2, #, Center, .4] &),
SelfLoopStyle -> None]
Now you could use a metric (better than the distance to black) inside each of these groups to select the pieces you want from there.
Since you would like to apply this to a large number of images, and you already suggested it, let's discuss how to solve this problem by selecting different tiles.
The first step could be to define what "similar" is, so a similarity metric is needed. You already mentioned the tiles' histogram as one source of metric, but there may be many more, for example:
mean intensity,
90th percentile of intensity,
10th percentile of intensity,
mode of intensity, as in peak of the histogram,
variance of pixel intensity in the whole tile,
granularity, which you could quickly approximate by the difference between the raw and the Gaussian-filtered image, or by calculating the average variance in small sub-tiles.
If your image has two channels, the above list leaves you already with 12 metric components. Moreover, there are characteristics that you can obtain from the combination of channels, for example the correlation of pixel intensities between channels. With two channels that's only one characteristic, but with three channels it's already three.
To pick different tiles from this high-dimensional cloud, you could consider that some if not many of these metrics will be correlated, so a principal component analysis (PCA) would be a good first step. http://en.wikipedia.org/wiki/Principal_component_analysis
Then, depending on how many sample tiles you would like to chose, you could look at the projection. For seven tiles, for example, I would look at the first three principal components, and chose from the two extremes of each, and then also pick the one tile closest to the center (3 * 2 + 1 = 7).
If you are concerned that chosing from the very extremes of each principal component may not be robust, the 10th and 90th percentiles may be. Alternatively, you could use a clustering algorithm to find separated examples, but this would depend on how your cloud looks like. Good luck.