I'm currently extending an image library used to categorize images and i want to find duplicate images, transformed images, and images that contain or are contained in other images.
I have tested the SIFT implementation from OpenCV and it works very well but would be rather slow for multiple images. Too speed it up I thought I could extract the features and save them in a database as a lot of other image related meta data is already being held there.
What would be the fastest way to compare the features of a new images to the features in the database?
Usually comparison is done calculating the euclidean distance using kd-trees, FLANN, or with the Pyramid Match Kernel that I found in another thread here on SO, but haven't looked much into yet.
Since I don't know of a way to save and search a kd-tree in a database efficiently, I'm currently only seeing three options:
* Let MySQL calculate the euclidean distance to every feature in the database, although I'm sure that that will take an unreasonable time for more than a few images.
* Load the entire dataset into memory at the beginning and build the kd-tree(s). This would probably be fast, but very memory intensive. Plus all the data would need to be transferred from the database.
* Saving the generated trees into the database and loading all of them, would be the fastest method but also generate high amounts of traffic as with new images the kd-trees would have to be rebuilt and send to the server.
I'm using the SIFT implementation of OpenCV, but I'm not dead set on it. If there is a feature extractor more suitable for this task (and roughly equally robust) I'm glad if someone could suggest one.
So I basically did something very similar to this a few years ago. The algorithm you want to look into was proposed a few years ago by David Nister, the paper is: "Scalable Recognition with a Vocabulary Tree". They pretty much have an exact solution to your problem that can scale to millions of images.
Here is a link to the abstract, you can find a download link by googleing the title.
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1641018
The basic idea is to build a tree with a hierarchical k-means algorithm to model the features and then leverage the sparse distribution of features in that tree to quickly find your nearest neighbors... or something like that, it's been a few years since I worked on it. You can find a powerpoint presentation on the authors webpage here: http://www.vis.uky.edu/~dnister/Publications/publications.html
A few other notes:
I wouldn't bother with the pyramid match kernel, it's really more for improving object recognition than duplicate/transformed image detection.
I would not store any of this feature stuff in an SQL database. Depending on your application it is sometimes more effective to compute your features on the fly since their size can exceed the original image size when computed densely. Histograms of features or pointers to nodes in a vocabulary tree are much more efficient.
SQL databases are not designed for doing massive floating point vector calculations. You can store things in your database, but don't use it as a tool for computation. I tried this once with SQLite and it ended very badly.
If you decide to implement this, read the paper in detail and keep a copy handy while implementing it, as there are many minor details that are very important to making the algorithm work efficiently.
The key, I think, is that is this isn't a SIFT question. It is a question about approximate nearest neighbor search. Like image matching this too is an open research problem. You can try googling "approximate nearest neighbor search" and see what type of methods are available. If you need exact results, try: "exact nearest neighbor search".
The performace of all these geometric data structures (such as kd-trees) degrade as the number of dimensions increase, so the key I think is that you may need to represent your SIFT descriptors in a lower number of dimensions (say 10-30 instead of 256-1024) to have really efficient nearest neighbor searches (use PCA for example).
Once you have this I think it will become secondary if the data is stored in MySQL or not.
I think speed is not the main issue here. The main issue is how to use the features to get the results you want.
If you want to categorize the images (e. g. person, car, house, cat), then the Pyramid Match kernel is definitely worth looking at. It is actually a histogram of the local feature descriptors, so there is no need to compare individual features to each other. There is also a class of algorithms known as the "bag of words", which try to cluster the local features to form a "visual vocabulary". Again, in this case once you have your "visual words" you do not need to compute distances between all pairs of SIFT descriptors, but instead determine which cluster each feature belongs to. On the other hand, if you want to get point correspondences between pairs of images, such as to decide whether one image is contained in another, or to compute the transformation between the images, then you do need to find the exact nearest neighbors.
Also, there are local features other than SIFT. For example SURF are features similar to SIFT, but they are faster to extract, and they have been shown to perform better for certain tasks.
If all you want to do is to find duplicates, you can speed up your search considerably by using a global image descriptor, such as a color histogram, to prune out images that are obviously different. Comparing two color histograms is orders of magnitude faster than comparing two sets each containing hundreds of SIFT features. You can create a short list of candidates using color histograms, and then refine your search using SIFT.
I have some tools in python you can play with here . Basically its a package that uses SIFT transformed vectors, and then computes a nearest lattice hashing of each 128d sift vector. The hashing is the important part, as it is locality sensitive, simply meaning that vectors near in R^n space result in equivalent hash collision probabilities. The work I provide is an extension of Andoni that provides a query adaptive heuristic for pruning the LSH exact search lists, as well as an optimized CUDA implementation of the hashing function. I also have a small app that does image database search with nice visual feedback, all under bsd (exception is SIFT which has some additional restrictions).
Related
Given two images, e.g. two cats, is there a library that includes a "quick and dirty" way of telling by how much the two images differ regarding translation and rotation? Image registration is a big field and every application I run into seems to be tailored to medical scans and usually has certain domain specific caps on the transformation ranges. The tool I require should take two images as an input and return an angle of rotation and a translation vector, maybe even a confidence metric, it's that simple. (Most algorithms out there are heavy-duty and focus on minute details for alignment, the tool I'm looking for need not be as exact.)
If it needs not be very precise, you can probably tweak the code from PyImageSearch to better suit your application.
If you know that the two images you are going to compare do contain the same object (i.e., if there is no additional object recognition problem that comes before this step), then you can maybe try using the ORB detector to find the good keypoints, and then estimate the homography using ViSP
I have tried searching google for this for quite some time and have had no luck find what I need. What I am trying to do is I have a few different versions of the same image these versions being derived from the original by dithering them to a palette. The image is split into 8x8 tiles that use one of n palettes. I want to pick which palette looks best. There are a few questions on stackoverflow relating to image difference however they suggest a simple algorithm
Σ(R1n-R2n)^2+(G1n-G2n)^2+(B1n-B2n)^2 and picking the lowest sum. Some suggest that to improve this a better delta algorithm can be used, I have tried CIEDE2000 still found the results unsatisfactory and in many instances I can manually pick a better tile. Note that I did do a bit of verification of my implementation of CIEDE2000 using precomputed data and used this library for http://www.getreuer.info/home/colorspace rgb to CIELab so I doubt that the reason I am a bit disappointed with the results is due to a bug and most of the time it does an okay job. What I want to do is implement a better algorithm than just the delta to pick the best tile. Does anyone know of a good algorithm for this (inclusive) or better terminology so that when I search for this I find more relevant information?
Working on a small side project related to Computer Vision, mostly to try playing around with OpenCV. It lead me to an interesting question:
Using feature detection to find known objects in an image isn't always easy- objects are hard to find, especially if the features of the target object aren't great.
But if I could choose ahead of time what it is I'm looking for, then in theory I could generate for myself an optimal image for detection. Any quality that makes feature detection hard would be absent, and all the qualities that make it easy would exist.
I suspect this sort of thought went into things like QR codes, but with the limitations that they wanted QR codes to be simple, and small.
So my question for you: How would you generate an optimal image for later recognition by a camera? What if you already know that certain problems like skew, or partial obscuring would occur?
Thanks very much
I think you need something like AR markers.
Take a look at ArToolkit, ArToolkitPlus or Aruco libraries, they have marker generators and detectors.
And papeer about marker generation: http://www.uco.es/investiga/grupos/ava/sites/default/files/GarridoJurado2014.pdf
If you plan to use feature detection, than marker should be specific to used feature detector. Common practice for detector design is good response to "corners" or regions with high x,y gradients. Also you should note the scaling of target.
The simplest detection can be performed with BLOBS. It can be faster and more robust than feature points. For example you can detect circular blobs or rectangular.
Depending on the distance you want to see your markers from and viewing conditions/backgrounds you typically use and camera resolution/noise you should choose different images/targets. Under moderate perspective from a longer distance a color target is pretty unique, see this:
https://surf-it.soe.ucsc.edu/sites/default/files/velado_report.pdf
at close distances various bar/QR codes may be a good choice. Other than that any flat textured object will be easy to track using homography as opposed to 3D objects.
http://docs.opencv.org/trunk/doc/py_tutorials/py_feature2d/py_feature_homography/py_feature_homography.html
Even different views of 3d objects can be quickly learned and tracked by such systems as Predator:
https://www.youtube.com/watch?v=1GhNXHCQGsM
then comes the whole field of hardware, structured light, synchronized markers, etc, etc. Kinect, for example, uses a predefined pattern projected on the surface to do stereo. This means it recognizes and matches million of micro patterns per second creating a depth map from the matched correspondences. Note that one camera sees the pattern and while another device - a projector generates it working as a virtual camera, see
http://article.wn.com/view/2013/11/17/Apple_to_buy_PrimeSense_technology_from_the_360s_Kinect/
The quickest way to demonstrate good tracking of a standard checkerboard pattern is to use pNp function of open cv:
http://www.juergenwiki.de/work/wiki/lib/exe/fetch.php?media=public:cameracalibration_detecting_fieldcorners_of_a_chessboard.gif
this literally can be done by calling just two functions
found = findChessboardCorners(src, chessboardSize, corners, camFlags);
drawChessCornersDots(dst, chessboardSize, corners, found);
To sum up, your question is very broad and there are multiple answers and solutions. Formulate your viewing condition, camera specs, backgrounds, distances, amount of motion and perspective you expect to have indoors vs outdoors, etc. There is no such a thing as a general average case in computer vision!
I was given a project on vehicle type identification with neural network and that is how I came to know the awesomeness of neural technology.
I am a beginner with this field, but I have sufficient materials to learn it. I just want to know some good places to start for this project specifically, as my biggest problem is that I don't have very much time. I would really appreciate any help. Most importantly, I want to learn how to match patterns with images (in my case, vehicles).
I'd also like to know if python is a good language to start this in, as I'm most comfortable with it.
I am having some images of cars as input and I need to classify those cars by there model number.
Eg: Audi A4,Audi A6,Audi A8,etc
You didn't say whether you can use an existing framework or need to implement the solution from scratch, but either way Python is excellent language for coding neural networks.
If you can use a framework, check out Theano, which is written in Python and is the most complete neural network framework available in any language:
http://www.deeplearning.net/software/theano/
If you need to write your implementation from scratch, look at the book 'Machine Learning, An Algorithmic Perspective' by Stephen Marsland. It contains example Python code for implementing a basic multilayered neural network.
As for how to proceed, you'll want to convert your images into 1-D input vectors. Don't worry about losing the 2-D information, the network will learn 'receptive fields' on its own that extract 2-D features. Normalize the pixel intensities to a -1 to 1 range (or better yet, 0 mean with a standard deviation of 1). If the images are already centered and normalized to roughly the same size than a simple feed-forward network should be sufficient. If the cars vary wildly in angle or distance from the camera, you may need to use a convolutional neural network, but that's much more complex to implement (there are examples in the Theano documentation). For a basic feed-forward network try using two hidden layers and anywhere from 0.5 to 1.5 x the number of pixels in each layer.
Break your dataset into separate training, validation, and testing sets (perhaps with a 0.6, 0.2, 0.2 ratio respectively) and make sure each image only appears in one set. Train ONLY on the training set, and don't use any regularization until you're getting close to 100% of the training instances correct. You can use the validation set to monitor progress on instances that you're not training on. Performance should be worse on the validation set than the training set. Stop training when the performance on the validation set stops improving. Once you've accomplished this you can try different regularization constants and choose the one that results in the best validation set performance. The test set will tell you how well your final result is performing (but don't change anything based on test set results, or you risk overfitting to that too!).
If your car images are very complex and varied and you cannot get a basic feed-forward net to perform well, you might consider using 'deep learning'. That is, add more layers and pre-train them using unsupervised training. There's a detailed tutorial on how to do this here (though all the code examples are in MatLab/Octave):
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
Again, that adds a lot of complexity. Try it with a basic feed-forward NN first.
I'm trying to build a lightweight object recognition system using ORB for feature extraction and LDA for classification. But I'm running into an issue do to the varying size of extracted features.
These are my steps:
Extract keypoints using ORB.
Extract trainable features in the image by grouping the keypoints.
(example of whats being extracted: http://imgur.com/gaQWk)
Train the recognizer with the extracted features. (This is where problems arise)
Classify objects in an image from the wild.
If I attempt to create a generalized matrix using cv::gemm, I get an exception due to the varying sizes. My first thought was to just to normalize all the images by resizing them, but this causes a lot of accuracy issues when objects have similar small features.
Is there any solution to this? Is LDA an appropriate method for this? I know it's commonly used with facial recognition algorithms such as fisherfaces.
LDA requires fixed length features, as do most optimization and machine learning methods. You could resize the image patches to be a fixed size, but that is probably not going to be a good feature. Normally people use a scale invariant feature such as SIFT. You also might try a color histogram, or some variation of edge detection and spatial histogram binning such as a GIST vector.
It's hard to say if LDA is an appropriate method for this without knowing what you hope to accomplish. You might also look into using SVM, some form of boosting, or just plain nearest neighbor with a large training set.