I want to cluster some (x,y) coordinates based on distance using agglomerative hierarchical clustering as number of clusters are not known before. Is there any library that supports this task?
Iam doing in c++ using Opencv libraries.
http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_ml/py_kmeans/py_kmeans_opencv/py_kmeans_opencv.html#kmeans-opencv
This is a link for K-Means clustering in OpenCV for Python.
Shouldn't be too hard to convert this to c++ code once you understand the logic
In Gesture Recognition Toolkit (GRT) there is a simple module for hierarchical clustering. This is a "bottom up" approach as you need, where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
You can train the method by:
UnlabelledData: The only thing you really need to know about the UnlabelledData class is that you must set the number of input dimensions of your dataset before you try and add samples to the training dataset.
ClassificationData:
You must set the number of input dimensions of your dataset before you try and add samples to the training dataset,
You can not use the class label of 0 when you add a new sample to your dataset. This is because the class label of 0 is reserved for the special null gesture class.
MatrixDouble: MatrixDouble is the default datatype for storing M by N dimensional data, where M is the number of rows and N is the number of columns.
Furthermore you can save or load your models from/to a file and get the clusters by getClusters().
Related
I'm not sure that this is the right forum for this question, I'm sorry otherwise.
I'm quite new to the Bag of Features model and I'm trying to implement in order to represent an image through a vector (for a CBIR project).
From what I've understood, given a training set S of n images and supposing that we want to represent an image through a vector of size k, these are the steps for implementing BoF:
For each image i, compute the set of keypoints and from that compute the set of descriptor i-D.
Put all together the set of descriptor from all the images, so now we have D.
Run the k means (where k is defined above) algorithm on D, so now we have k clusters and each descriptor vector belongs exactly to one cluster.
Define iv as the resulting BoF vector (of size k) relative to image i. Each dimension is initialized to 0.
For each image i, and for each descriptor d belonging to i-D, find out which cluster d belongs between all the k clusters. Supposing that d belongs to the j-th cluster, then vi[j]++.
What is not clear to me is how to implement point 5., so how do we understand to which cluster a descriptor belongs to, in particular if the image that we are trying to compute the BoF vector is a query image (and so didn't belong to the initial dataset)? Should we find the nearest neighbor (1-NN) in order to understand to which cluster the query descriptor belongs to?
WHY I NEED THIS - THE APPLICATION:
I'm implementing the BoF model in order to implement a CBIR: given a query image q, find the most similar image i of q in a dataset of images. In order to do this, we need to solve the 1-approximate nearest neighbor problem, for example using LSH. The problem is that the input in LSH each image is represented as a vector, so we need the BoF in order to do that! I hope that now it's clearer why I need it :)
Please, let me know also if I did some mistake in the procedure described above.
What your algorithm is doing is generating the equivalent of words for a image. The set of "words" is not meant to be a final result, but just something that makes it simple to use with other machine learning techniques.
In this setup, you generate a set of k clusters from the initial feature (the keypoints from point 1).
Then you describe each image by the number of keypoints that fall in each cluster (just like you have a text composed of words from a dictionary of length k).
Point 3 says that you take all the keypoints from the traing set images, and run k-means algorithm, to figure out some reasonable separation between the points. This basically establishes what the words are.
So for a new image you need to compute the keypoints like you did for the training set and then using the clusters you have already computed in training you figure out the feature vector for your new image. That is you convert your image into words from the dictionary you've built.
This is all a way to generate a reasonable feature vector from images (a partial result if you want). This is not a complete machine learning algorithm. To complete it you need to know what you want to do. If you just want to find the most similar image(s), then yes a nearest neighbor search should do that. If you want to label images, then you need to train a classifier (like naive-bayes) from the feature vector and use it to figure out the label for the query.
My objective is to detected text in an image and recognize them.
I have achieved detecting characters using stroke width transform.
What to do to recognize them?
As per my knowledge, I thought of training the svm with my dataset of letters of different fonts[images] by detecting feature point and extracting feature vectors from each and every image.[I have used SIFT Feature vector,did build the dictionary using kmean clusetering and all].
I have detected a character before, i will extract the sift feature vector for this character . and i thought of feeding this into the svm prediction function.
I dont know how to recognize using svm. I am confused! Help me and correct me where ever I went wrong with concept..
I followed this turorial for recognizing part. Can this turotial can be applicable to recognize characters.
http://www.codeproject.com/Articles/619039/Bag-of-Features-Descriptor-on-SIFT-Features-with-O
SVM is a supervised classifier. To use it, you will need to have training data that is of the type of objects you are trying to recognize.
Step 1 - Prepare training data
The training data consists of pairs of feature vectors and their corresponding class labels. In your case, it appears that you have extracted a SIFT-based "Bag-of-word" (BOW) feature vector for the characters you detected. So, for your training data, you will need to find many examples of the different characters, extract this feature vector for each of them, and associate them with a label (sometimes called a class label, and typically an integer) which you will perhaps map to a textual description (for e.g., the number 0 could be mapped to the character 'a', and so on.)
Step 2 - Training the classifier
The SVM classifier takes in as input an array/Mat of feature vectors (one per row) and their associated labels. Tune the parameters of the SVM (i.e., the regularization parameter C, and if applicable, any other parameters for kernels) on a separate validation set.
Step 3 - Predict for unseen data
At test time, given a sample that was not seen by the SVM during training, you compute a feature vector (your SIFT-based BOW vector) for the sample. Pass this feature vector to the SVM's predict function, and it will return you an integer. Remember earlier when preparing your training data, you have associated an integer with each label? This is the label predicted by the SVM for this sample. You can then map this label to a character. For e.g., if you have associated 0 with 'a', 1 with 'b' etc., you can use a vector/hashmap to map the integer to its textual counterpart.
Additional Notes
You can check out OpenCV's SVM tutorial here for details.
NOTE: Often, for beginners, the hardest part (after getting the data) is tuning the classifier. My advice is first try a simple classifier (for e.g., a linear SVM) which has few parameters to tune. A decent one would be the linear SVM, which only requires you to adjust one parameter C. Once you manage to get somewhat decent results (which gives some assurance that the rest of your code is working) you can move on to more "sophisticated" classifiers.
Lastly, the training data and feature vectors you extract are very important. The training data must be "similar" to the test data you are trying to predict. For e.g., if you are predicting characters found in road signs which comes with different fonts, lighting conditions, and pose differences, then using training data consisting of characters taken from say a newspaper/book archive may not give you good results. This is an issue of domain adaptation in machine learning.
Currently I am following the caffe imagenet example but apply it on my own training data set. My dataset is about 2000 classes and about 10 ~ 50 images each class. Actually I was classifying vehicle images and the images were cropped to the front, so the images within each class have the same size, the same view angle(almost).
I've tried the imagenet schema but looks like it didn't work well and after about 3000 iterations the accuracy was down to 0. So I am wondering is there a practical guide on how to tune the schema?
You can delete the last layer in imagenet, add your own last layer with a different name(to fit the number of classes), specify it with a higher learning rate, and specify a lower overall learning rate. There does exist an official example here: http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html
However, if the accuracy was 0 you should check the model parameters first, perhaps it's an overflow
Suppose I have a user/item feature matrix in Mahout and I have derived the users' loglikelihood similarity and have identified three user clusters. Now I have a new user with a set of items (same format and same set of items), how can I assign the new user one of these three clusters without recalculating a similarity matrix and reclustering procedure?
The problem is if I use the current cluster centroids and calculate the loglikelihood similarity or any distance measure, the centroids are not binary anymore. If i use k-medians, there is a risk of them being all zeros. What is a good way to approach this? Is there any model base clustering that you recommend using, specially in MAhout?
How about training classifiers for the clusters?
To avoid the zeros, you could use k-medoids instead. The key difference here is that k-medoids will choose the most central object from your dataset, so it will actually have the same sparsity as your data objects.
As I don't use Mahout, I do not know if this is available in Mahout. As far as I know it is much more computationally intensive than k-means or k-medians.
My job is to perform gesture recognition. I want to do that by training a support vector machine using the features extracted by performing PCA(Principal component Analysis).
But I'm getting a little confused about the procedure.
After going through various articles, I've figured out these steps.
Take 'd' number of images(n*n) of the same gesture.
Convert each n*n image into a sigle row.
Form a matrix of order d*(n*n).
Compute the eigen values & eigen vectors.
Use top 'k' eigen vectors to form a subspace.
Project the image from original n*n dimension to 'k' dimension.
Question:
1) I have a set of 100 gestures and performing above 6 steps will give me 100 subspaces.My testing should be done on a realtime video to find which class a gesture falls in. Onto which supspace do I project each video frame to reduce the dimension for feeding it to the classifier?
Thank you in advance.