I'm not sure that this is the right forum for this question, I'm sorry otherwise.
I'm quite new to the Bag of Features model and I'm trying to implement in order to represent an image through a vector (for a CBIR project).
From what I've understood, given a training set S of n images and supposing that we want to represent an image through a vector of size k, these are the steps for implementing BoF:
For each image i, compute the set of keypoints and from that compute the set of descriptor i-D.
Put all together the set of descriptor from all the images, so now we have D.
Run the k means (where k is defined above) algorithm on D, so now we have k clusters and each descriptor vector belongs exactly to one cluster.
Define iv as the resulting BoF vector (of size k) relative to image i. Each dimension is initialized to 0.
For each image i, and for each descriptor d belonging to i-D, find out which cluster d belongs between all the k clusters. Supposing that d belongs to the j-th cluster, then vi[j]++.
What is not clear to me is how to implement point 5., so how do we understand to which cluster a descriptor belongs to, in particular if the image that we are trying to compute the BoF vector is a query image (and so didn't belong to the initial dataset)? Should we find the nearest neighbor (1-NN) in order to understand to which cluster the query descriptor belongs to?
WHY I NEED THIS - THE APPLICATION:
I'm implementing the BoF model in order to implement a CBIR: given a query image q, find the most similar image i of q in a dataset of images. In order to do this, we need to solve the 1-approximate nearest neighbor problem, for example using LSH. The problem is that the input in LSH each image is represented as a vector, so we need the BoF in order to do that! I hope that now it's clearer why I need it :)
Please, let me know also if I did some mistake in the procedure described above.
What your algorithm is doing is generating the equivalent of words for a image. The set of "words" is not meant to be a final result, but just something that makes it simple to use with other machine learning techniques.
In this setup, you generate a set of k clusters from the initial feature (the keypoints from point 1).
Then you describe each image by the number of keypoints that fall in each cluster (just like you have a text composed of words from a dictionary of length k).
Point 3 says that you take all the keypoints from the traing set images, and run k-means algorithm, to figure out some reasonable separation between the points. This basically establishes what the words are.
So for a new image you need to compute the keypoints like you did for the training set and then using the clusters you have already computed in training you figure out the feature vector for your new image. That is you convert your image into words from the dictionary you've built.
This is all a way to generate a reasonable feature vector from images (a partial result if you want). This is not a complete machine learning algorithm. To complete it you need to know what you want to do. If you just want to find the most similar image(s), then yes a nearest neighbor search should do that. If you want to label images, then you need to train a classifier (like naive-bayes) from the feature vector and use it to figure out the label for the query.
Related
I am trying to classify MRI images of brain tumors into benign and malignant using C++ and OpenCV. I am planning on using bag-of-words (BoW) method after clustering SIFT descriptors using kmeans. Meaning, I will represent each image as a histogram with the whole "codebook"/dictionary for the x-axis and their occurrence count in the image for the y-axis. These histograms will then be my input for my SVM (with RBF kernel) classifier.
However, the disadvantage of using BoW is that it ignores the spatial information of the descriptors in the image. Someone suggested to use SPM instead. I read about it and came across this link giving the following steps:
Compute K visual words from the training set and map all local features to its visual word.
For each image, initialize K multi-resolution coordinate histograms to zero. Each coordinate histogram consist of L levels and each level
i has 4^i cells that evenly partition the current image.
For each local feature (let's say its visual word ID is k) in this image, pick out the k-th coordinate histogram, and then accumulate one
count to each of the L corresponding cells in this histogram,
according to the coordinate of the local feature. The L cells are
cells where the local feature falls in in L different resolutions.
Concatenate the K multi-resolution coordinate histograms to form a final "long" histogram of the image. When concatenating, the k-th
histogram is weighted by the probability of the k-th visual word.
To compute the kernel value over two images, sum up all the cells of the intersection of their "long" histograms.
Now, I have the following questions:
What is a coordinate histogram? Doesn't a histogram just show the counts for each grouping in the x-axis? How will it provide information on the coordinates of a point?
How would I compute the probability of the k-th visual word?
What will be the use of the "kernel value" that I will get? How will I use it as input to SVM? If I understand it right, is the kernel value is used in the testing phase and not in the training phase? If yes, then how will I train my SVM?
Or do you think I don't need to burden myself with the spatial info and just stick with normal BoW for my situation(benign and malignant tumors)?
Someone please help this poor little undergraduate. You'll have my forever gratefulness if you do. If you have any clarifications, please don't hesitate to ask.
Here is the link to the actual paper, http://www.csd.uwo.ca/~olga/Courses/Fall2014/CS9840/Papers/lazebnikcvpr06b.pdf
MATLAB code is provided here http://web.engr.illinois.edu/~slazebni/research/SpatialPyramid.zip
Co-ordinate histogram (mentioned in your post) is just a sub-region in the image in which you compute the histogram. These slides explain it visually, http://web.engr.illinois.edu/~slazebni/slides/ima_poster.pdf.
You have multiple histograms here, one for each different region in the image. The probability (or the number of items would depend on the sift points in that sub-region).
I think you need to define your pyramid kernel as mentioned in the slides.
A Convolutional Neural Network may be better suited for your task if you have enough training samples. You can probably have a look at Torch or Caffe.
I'm grabbing face images from a camera, and storing each face frame until there are enough images to train the eigenface object in opencv. I'm able to get a mean eigenface, but i'm wondering how i could store this into a database on a server so that later when a person comes back, i could get another mean eigenface, send that to the server and find the closest match. I was thinking of either hashing that eigenface and comparing hashes, but i could just store that mean eigenface itself in the database, but i don't know how i would compare the eigenface on the client with all the eigenfaces in a database without pulling every single record down from the database.
Does anybody have any idea how i might turn the eigenvalues or mean eigenface into a string or number of some kind which i could compare the mean eigenface value with the values in the database on the server?
how i could store this into a database on a server
Does it have to be an actual database server, i.e. MySQL or the sort? Why not store the Eigenface images on disk along with a sidecar file contain meta information of that Eigenface.
I was thinking of either hashing that eigenface and comparing hashes
I would suggest against that. Two images will only have the same hash values if all pixels have identical values (leaving aside the possibility of hash collisions). So any comparison of images is either true or false. Most comparisons would be false, since most new queries won't have an exact match in the database.
but i don't know how i would compare the eigenface on the client with all the eigenfaces in a database without pulling every single record down from the database
Does anybody have any idea how i might turn the eigenvalues or mean eigenface into a string or number of some kind which i could compare the mean eigenface value with the values in the database on the server?
The trick is to use an abstract representation of the images, usually called descriptors, and define distance metric on these descriptors to evaluate their similarity, e.g. assume a metric d and the descriptors of two images A and B, DA and DB. Then d(DA, DA) = 0 and d(DA, DB) >= 0.
Given all the descriptors of the Eigenface images in your database and the metric d, you could organize all descriptors in a special data structure, e.g. using a KD-tree, in order to find the nearest neighbours of a new query image (i.e. of that image's descriptor). With this kind of matching it is no longer necessary to compare a new query to all images in the database.
If the distance of a query Q and its 1st nearest neighbour NN1 is sufficiently smaller than the distance between the query Q and its 2nd nearest neighbour NN2, d(DQ, DNN1) < a * d(DQ, DNN2) (a < 1), then Q and NN1 can be considered a match.
This is a very broad topic with an abundance of approaches and opinions. But the outline above is commonly used for similar applications.
These keywords could help your further investigation
feature extraction and descriptors (SIFT, SURF, ORB, FREAK, AKAZE, etc.)
image descriptors, s.a. https://stackoverflow.com/a/844113
keypoint and image matching
approximate nearest neighbour
face detection and recognition
and probably more ...
Generally, literature surveys are a good way to get an overview of what has been done, what works well and what doesn't.
I have an image, holding results of segmentation, like this one.
I need to build a graph of neighborhood of patches, colored in different colors.
As a result I'd like a structure, representing the following
Here numbers represent separate patches, and lines represent patches' neighborhood.
Currently I cannot figure out where to start, which keywords to google.
Could anyone suggest anything useful?
Image is stored in OpenCV's cv::Mat class, as for graph, I plan to use Boost.Graph library.
So, please, give me some links to code samples and algorithms, or keywords.
Thanks.
Update.
After a coffee-break and some discussions, the following has come to my mind.
Build a large lattice graph, where each node corresponds to each image pixel, and links connect 8 or 4 neighbors.
Label each graph node with a corresponding pixel value.
Try to merge somehow nodes with the same label.
My another problem is that I'm not familiar with the BGL (but the book is on the way :)).
So, what do you think about this solution?
Update2
Probably, this link can help.
However, the solution is still not found.
You could solve it like that:
Define regions (your numbers in the graph)
make a 2D array which stores the region number
start at (0/0) and set it to 1 (region number)
set the whole region as 1 using floodfill algorithm or something.
during floodfill you probably encounter coordinates which have different color. store those inside a queue. start filling from those coordinates and increment region number if your previous fill is done.
.
Make links between regions
iterate through your 2D array.
if you have neighbouring numbers, store the number pair (probably in a sorted manner, you also have to check whether the pair already exists or not). You only have to check the element below, right and the one diagonal to the right, if you advance from left to right.
Though I have to admit I don't know a thing about this topic.. just my simple idea..
You could use BFS to mark regions.
To expose cv::Mat to BGL you should write a lot of code. I think writeing your own bfs is much more simplier.
Than you for every two negbours write their marks to std::set<std::pair<mark_t, mark_t>>.
And than build graph from that.
I think that if your color patches are that random, you will probably need a brute force algorithm to do what you want. An idea could be:
Do a first brute force pass. This has to identify all the patches. For example, make a matrix A of the same size as the image, and initialize it to 0. For each pixel which is still zero, start from it and mark it as a new patch, and try a brute force approach to find the whole extent of the patch. Each matrix cell will then have a value equal to the number of the patch it is in it.
The patch numbers have to be 2^N, for example 1, 2, 4, 8, ...
Make another matrix B of the size of the image, but each cell holds two values. This will represent the connection between pixels. For each cell of matrix B, the first value will be the absolute difference between the patch number in the pixel and the patch number of an adjacent pixel. First value is difference with the pixel below, second with the pixel to the left.
Pick all unique values in matrix B, you have all the connections possible.
This works because each difference between patches number is unique. For example, if in B you end up with numbers 3, 6, 7 it will mean that there are contacts between patches (4,1), (8,2) and (8,1). Value 0 of course means that there are two pixels in the same patch next to each other, so you just ignore them.
I am using GPC (General Polygon Clipper) to create sets of images. I am unable to determine if the images are from disjoint sets though.
I am using a gpc_polygon struct defined at the above link, reading the vertex list from an image data (lat/lon of corners)... And adding images sequentially to a polygon.
It is important to separate images that belong to separate regions. While I can't say for sure that the intersection area will be non-zero (that would have been a perfect test), I have noticed that the num_contours of the completed polygon coincides with the number of distinct regions.
I thought that I can use num_contours to determine if an image belongs to a set.
Yet, as I add images, I can see, on one image, num_contours=1, after the second, it increases to 2 (whether the image is in the same section or not, and that makes sense)... but it doesn't increase after that, until the pattern of disjointed images is really off - so I can't really use it to test, at least not on its own.
It is the same as I remove images from the polygon, using a DIFF operator.
If anyone else has used GPC, or some other method of polygon convolution, perhaps you can give me some advice on what I can use to identify which images belong to each contour, so I can either separate them before, or after, polygon creation ?
I used num_contours, with a limiting value of 2 instead of 1, and had to go back iteratively, and try to re-add contours, until I couldn't add them anymore. The solution is suboptimal, may be very slow, and there are situations when polygons that don't belong together end up in the same contour.
Is there a feature extraction method that is scale invariant but not rotation invariant? I would like to match similar images that have been scaled but not rotated...
EDIT: Let me rephrase. How can I check whether an image is a scaled version (or close to) of an original?
Histogram and Gauss Pyramids are used for extract scale invariant features.
How can I check whether an image is a scaled version (or close to) of an original?
It's puzzle for me. Does you mean given two images, one is original and another is scaled? Or is one original and another is a fragment from the original but scaled and you want to locate the fragment in the original one?
[updated]
Given two images, a and b,
Detected their SIFT or SURF feature points and descriptions.
Get the corresponding regions between a and b. If none, return false. Refer to Pattern Matching - Find reference object in second image and Trying to match two images using sift in OpenCv, but too many matches. Name the region in a as Ra, and one in b as Rb.
Using algorithm like template match to determine if Ra is identical enough to Rb. If yes, calculate the scale ratio.