I need to implement a simple image search in my app using TensorFlow.
The requirements are these:
The dataset contains around a million images, all of the same size, each containing one unique object and only that object.
The search parameter is an image taken with a phone camera of some object that is potentially in the dataset.
I've managed to extract the image from the camera picture and straighten it to rectangular form and as a result, a reverse-search image indexer like TinEye was able to find a match.
Now I want to reproduce that indexer by using TensorFlow to create a model based on my data-set (make each image's file name a unique index).
Could anyone point me to tutorials/code that would explain how to achieve such thing without diving too much into computer vision terminology?
Much appreciated!
The Wikipedia article on TinEye says that Perceptual Hashing will yield results similar to TinEye's. They reference this detailed description of the algorithm. But TinEye refuses to comment.
The biggest issue with the Perceptual Hashing approach is that while it's efficient for identifying the same image (subject to skews, contrast changes, etc.), it's not great at identifying a completely different image of the same object (e.g. the front of a car vs. the side of a car).
TensorFlow has great support for deep neural nets which might give you better results. Here's a high level description of how you might use a deep neural net in TensorFlow to solve this problem:
Start with a pre-trained NN (such as GoogLeNet) or train one yourself on a dataset like ImageNet. Now we're given a new picture we're trying to identify. Feed that into the NN. Look at the activations of a fairly deep layer in the NN. This vector of activations is like a 'fingerprint' for the image. Find the picture in your database with the closest fingerprint. If it's sufficiently close, it's probably the same object.
The intuition behind this approach is that unlike Perceptual Hashing, the NN is building up a high-level representation of the image including identifying edges, shapes, and important colors. For example, the fingerprint of an apple might include information about its circular shape, red color, and even its small stem.
You could also try something like this 2012 paper on image retrieval which uses a slew of hand-picked features such as SIFT, regional color moments and object contour fragments. This is probably a lot more work and it's not what TensorFlow is best at.
UPDATE
OP has provided an example pair of images from his application:
Here are the results of using the demo on the pHash.org website on that pair of similar images as well as on a pair of completely dissimilar images.
Comparing the two images provided by the OP:
RADISH (radial hash): pHash determined your images are not similar with PCC = 0.518013
DCT hash: pHash determined your images are not similar with hamming distance = 32.000000.
Marr/Mexican hat wavelet: pHash determined your images are not similar with normalized hamming distance = 0.480903.
Comparing one of his images with a random image from my machine:
RADISH (radial hash): pHash determined your images are not similar with PCC = 0.690619.
DCT hash: pHash determined your images are not similar with hamming distance = 27.000000.
Marr/Mexican hat wavelet: pHash determined your images are not similar with normalized hamming distance = 0.519097.
Conclusion
We'll have to test more images to really know. But so far pHash does not seem to be doing very well. With the default thresholds it doesn't consider the similar images to be similar. And for one algorithm, it actually considers a completely random image to be more similar.
https://github.com/wuzhenyusjtu/VisualSearchServer
It is a simple implementation of similar image searching using TensorFlow and InceptionV3 model. The code implements two methods, a server that handles image search, and a simple indexer that do Nearest Neighbor matching based on the pool3 features extracted.
Related
I'm new in Opencv, I want to comparison two image by theirs corner features rather than others, I tried SURF, SIFT, ORB ..., but their are not suit of me, could someone give me some suggestion about this? for example, the below image image1 and image2 are similar, because their have a lot of same corners ( although it isn't accurate ), but the image3 isn't similar as the 1 and 2, thanks.
There are a lot of things to try here, and others can suggest you others solutions, but here goes mine:
Use the Minutiae algorithm (the algorithm used for fingerprint identification). In the minutiae algorithm, a set of common features are extracted, such as:
These are called Minutiae. As you can see these features are pretty similar to the corners of your images. I suggest you the following procedure:
1) Find the corners (You already did this)
2) Assign each corner to a class of minutiae (e.g. one of the classes of the image). You can do this using a local binary pattern algorithm or just follow the usual recipe from the fingerprint algorithm (see the link at the end or search on Google).
3) To compute the similarity just do some voting. For instance, let's assume an image (I will call it A) has 4 minutiae of type a) and two of type E). To compute the similarity in a new image I will have to compute these minutiae in the new image. Then, see how many items per class both images share. You can add as many features or kind of minutiae as you want to make your algorithm more robust (and also complicated).
In any case, you can look in Google at the Minutiae fingerprint recognition algorithm (is one of the most famous digital image processing algorithms). Here goes one of the multiple slides you can find explaining the algorithm
https://is.muni.cz/el/1433/jaro2008/PV204/um/finger/MinutiaeBasedFpMatching.pdf
Just take care to modify the main algorithm to suit your necessities
Hope it helps
I am a new user of opencv. I am currently doing a project of performing product inspection with opencv. I was planning to extract edges of good product and the bad product then compare their edges maybe with mean square difference. However, it is already quite difficult to extract the edge clearly as the first step.
Good sample:Good product
[![enter image description here][1]][1]
When I use canny edge detection, the edge of the good product (the blue part of the picture) only includes part of the product and is as follows:
Edge of good product
[![enter image description here][2]][2]
I also tried to use adaptiveThreshold to make the greyscale picture more clear and then use the edge detection. But, the edge detected is not as good as expected because of many noise.
Therefore, I would like to ask for a solution of extracting the edge or any better way of comparing good product and bad product with opencv. Sorry for the bad english above.
This task can be made simple if some assumptions like those below are valid. For example:
Products move along the same production line, and therefore the lighting across all product images stays the same.
The products lie on a flat surface parallel to the focal plane of the camera, so the rotation of objects will be only around the axis of the lens.
The lighting can be controlled (this will help you to get better edges, and is actually used in production line image processing)
Since we cannot see the images that you added, it is a bit hard to see what exactly the situation is. But if the above assumptions are valid, image differencing combined with image signatures is one possible approach.
Another possibility is to train a Haar Cascade classifier using good and bad products. With this, the edges and all that will be taken care of. But you will have to collect a lot of data and train the classifier.
I have performed the thinning operation on a binary image with the code provided here. The source image which I used was this one.
And the result image which I obtained after applying thinning operation on the source image was this one
The problem I am facing is how to remove the noise in the image. Which is visible around the thinned white lines.
In such particular case, the easiest and safest solution is to label the connected component (union-find algorithm), and delete the one with a surface lower than one or two pixels.
FiReTiTi and kcc__ have already provided good answers, but I thought I'd provide another perspective. Having looked through some of your previous posts, it appears that you're trying to build software that uses vascular patterns on the hand to identify people. So at some point, you will need to build some kind of classification algorithm.
I bring this up because many such algorithms are quite robust in the presence of this kind of noise. For example, if you intend to use supervised learning to train a convolutional neural net (which would be a reasonable approach assuming you can collect a decent amount of training samples), you may find that extensive pre-processing of this sort is unnecessary, and may even degrade the performance.
Just some thoughts to consider. Cheers!
Another simple but perhaps not so robust is to use contour area to remove small connected regions, then use erode/dilate before applying thinning process.
However you can so process your thinned image directly by using cv::findContours(,) and mask about contours with small area. This is similar to what FiReTiTi answered.
You can use the findContour example from OpenCV to build a contour detection using edge detector such as Canny. The example can be ported directly as part your requirment.
Once you got the contours in vector<vector<Point> > contours;you can iterate over each contour and use cv::contourArea to find the area of each region. Using pre-defined threshold you can remove unwanted areas.
In my opinion why dont you use distance transform on the 1st image and then from the resultant image use size filter to de-speckle the image.
I would like to know how I can use OpenCV to detect on my VideoCamera a Image. The Image can be one of 500 images.
What I'm doing at the moment:
- (void)viewDidLoad
{
[super viewDidLoad];
// Do any additional setup after loading the view.
self.videoCamera = [[CvVideoCamera alloc] initWithParentView:imageView];
self.videoCamera.delegate = self;
self.videoCamera.defaultAVCaptureDevicePosition = AVCaptureDevicePositionBack;
self.videoCamera.defaultAVCaptureSessionPreset = AVCaptureSessionPresetHigh;
self.videoCamera.defaultAVCaptureVideoOrientation = AVCaptureVideoOrientationPortrait;
self.videoCamera.defaultFPS = 30;
self.videoCamera.grayscaleMode = NO;
}
-(void)viewDidAppear:(BOOL)animated{
[super viewDidAppear:animated];
[self.videoCamera start];
}
#pragma mark - Protocol CvVideoCameraDelegate
#ifdef __cplusplus
- (void)processImage:(cv::Mat&)image;
{
// Do some OpenCV stuff with the image
cv::Mat image_copy;
cvtColor(image, image_copy, CV_BGRA2BGR);
// invert image
//bitwise_not(image_copy, image_copy);
//cvtColor(image_copy, image, CV_BGR2BGRA);
}
#endif
The images that I would like to detect are 2-5kb small. Few got text on them but others are just signs. Here a example:
Do you guys know how I can do that?
There are several things in here. I will break down your problem and point you towards some possible solutions.
Classification: Your main task consists on determining if a certain image belongs to a class. This problem by itself can be decomposed in several problems:
Feature Representation You need to decide how you are gonna model your feature, i.e. how are you going to represent each image in a feature space so you can train a classifier to separate those classes. The feature representation by itself is already a big design decision. One could (i) calculate the histogram of the images using n bins and train a classifier or (ii) you could choose a sequence of random patches comparison such as in a random forest. However, after the training, you need to evaluate the performance of your algorithm to see how good your decision was.
There is a known problem called overfitting, which is when you learn too well that you can not generalize your classifier. This can usually be avoided with cross-validation. If you are not familiar with the concept of false positive or false negative, take a look in this article.
Once you define your feature space, you need to choose an algorithm to train that data and this might be considered as your biggest decision. There are several algorithms coming out every day. To name a few of the classical ones: Naive Bayes, SVM, Random Forests, and more recently the community has obtained great results using Deep learning. Each one of those have their own specific usage (e.g. SVM ares great for binary classification) and you need to be familiar with the problem. You can start with simple assumptions such as independence between random variables and train a Naive Bayes classifier to try to separate your images.
Patches: Now you mentioned that you would like to recognize the images on your webcam. If you are going to print the images and display in a video, you need to handle several things. it is necessary to define patches on your big image (input from the webcam) in which you build a feature representation for each patch and classify in the same way you did in the previous step. For doing that, you could slide a window and classify all the patches to see if they belong to the negative class or to one of the positive ones. There are other alternatives.
Scale: Considering that you are able to detect the location of images in the big image and classify it, the next step is to relax the toy assumption of fixes scale. To handle a multiscale approach, you could image pyramid which pretty much allows you to perform the detection in multiresolution. Alternative approaches could consider keypoint detectors, such as SIFT and SURF. Inside SIFT, there is an image pyramid which allows the invariance.
Projection So far we assumed that you had images under orthographic projection, but most likely you will have slight perspective projections which will make the whole previous assumption fail. One naive solution for that would be for instance detect the corners of the white background of your image and rectify the image before building the feature vector for classification. If you used SIFT or SURF, you could design a way of avoiding explicitly handling that. Nevertheless, if your input is gonna be just squares patches, such as in ARToolkit, I would go for manual rectification.
I hope I might have given you a better picture of your problem.
I would recommend using SURF for that, because pictures can be on different distances form your camera, i.e changing the scale. I had one similar experiment and SURF worked just as expected. But SURF has very difficult adjustment (and expensive operations), you should try different setups before you get the needed results.
Here is a link: http://docs.opencv.org/modules/nonfree/doc/feature_detection.html
youtube video (in C#, but can give an idea): http://www.youtube.com/watch?v=zjxWpKCQqJc
I might not be qualified enough to answer this problem. Last time I seriously use OpenCV it was still 1.1. But just some thought on it, and hope it would help (currently I am interested in DIP and ML).
I think it will probably an easier task if you only need to classify an image, if the image is just one from (or very similar to) your 500 images. For this you could use SVM or some neural network (Felix already gave an excellent enumeration on that).
However, your problem seems to be that you need to first find this candidate image in your webcam, the location of which you have little clue beforehand. (let us know whether it is so. I think it is important.)
If so, the harder problem is the detection/localization of your candidate image.
I don't have a general solution for that. The first thing I would do is to see if there is some common feature in your 500 images (e.g., whether all of them enclosed by a red circle, or, half of them have circle and half of them have rectangle). If this can be done, the problem will be simpler (it would be similar to face detection problem, which have good solution).
In other words, this means that you first classify the 500 images to a few groups with common feature (by human), and detect the group first, then scale and use above mentioned technique to classify them into fine result. In this way, it will be more computationally acceptable than trying to detect 500 images one by one.
BTW, this ppt would help to give a visual clue of what is going on for feature extraction and image matching http://courses.cs.washington.edu/courses/cse455/09wi/Lects/lect6.pdf.
Detect vs recognize: detecting the image is just finding it on the background and from your comments I realized you may have your sings surrounded by the background. It might facilitate your algorithm if you can somehow crop your signs from the background (detect) before trying to recognize them. Recognizing is a next stage that presumes you can classify the cropped image correctly as the one seen before.
If you need real time speed and scale/rotation invariance neither SIFT no SURF will do this fast. Nowadays you can do much better if you shift the burden of image processing to a learning stage as was done by Lepitit. In short, he subjected each pattern to a bunch of affine transformations and trained a binary classification tree to recognize each point correctly by doing a lot of binary comparison tests. Trees are extremely fast and a way to go not to mention that most of the processing is done offline. This method is also more robust to off-plane rotations than SIFT or SURF. You will also learn about tree classification which may facilitate you last processing stage.
Finally a recognition stage is based not only on the number of matches but also on their geometric consistency. Since your signs look flat I suggest finding either affine or homography transformation that has most inliers when calculated between matched points.
Looking at your code though I realized that you may not follow any of these recommendations. It may be a good starting point for you to read about decision trees and then play with some sample code (see mushroom.cpp in the above mentioned link)
I am trying to do image detection in C++. I have two images:
Image Scene: 1024x786
Person: 36x49
And I need to identify this particular person from the scene. I've tried to use Correlation but the image is too noisy and therefore doesn't give correct/accurate results.
I've been thinking/researching methods that would best solve this task and these seem the most logical:
Gaussian filters
Convolution
FFT
Basically, I would like to move the noise around the images, so then I can use Correlation to find the person more effectively.
I understand that an FFT will be hard to implement and/or may be slow especially with the size of the image I'm using.
Could anyone offer any pointers to solving this? What would the best technique/algorithm be?
In Andrew Ng's Machine Learning class we did this exact problem using neural networks and a sliding window:
train a neural network to recognize the particular feature you're looking for using data with tags for what the images are, using a 36x49 window (or whatever other size you want).
for recognizing a new image, take the 36x49 rectangle and slide it across the image, testing at each location. When you move to a new location, move the window right by a certain number of pixels, call it the jump_size (say 5 pixels). When you reach the right-hand side of the image, go back to 0 and increment the y of your window by jump_size.
Neural networks are good for this because the noise isn't a huge issue: you don't need to remove it. It's also good because it can recognize images similar to ones it has seen before, but are slightly different (the face is at a different angle, the lighting is slightly different, etc.).
Of course, the downside is that you need the training data to do it. If you don't have a set of pre-tagged images then you might be out of luck - although if you have a Facebook account you can probably write a script to pull all of yours and your friends' tagged photos and use that.
A FFT does only make sense when you already have sort the image with kd-tree or a hierarchical tree. I would suggest to map the image 2d rgb values to a 1d curve and reducing some complexity before a frequency analysis.
I do not have an exact algorithm to propose because I have found that target detection method depend greatly on the specific situation. Instead, I have some tips and advices. Here is what I would suggest: find a specific characteristic of your target and design your code around it.
For example, if you have access to the color image, use the fact that Wally doesn't have much green and blue color. Subtract the average of blue and green from the red image, you'll have a much better starting point. (Apply the same operation on both the image and the target.) This will not work, though, if the noise is color-dependent (ie: is different on each color).
You could then use correlation on the transformed images with better result. The negative point of correlation is that it will work only with an exact cut-out of the first image... Not very useful if you need to find the target to help you find the target! Instead, I suppose that an averaged version of your target (a combination of many Wally pictures) would work up to some point.
My final advice: In my personal experience of working with noisy images, spectral analysis is usually a good thing because the noise tend to contaminate only one particular scale (which would hopefully be a different scale than Wally's!) In addition, correlation is mathematically equivalent to comparing the spectral characteristic of your image and the target.