What are Eigenfaces generated from? - computer-vision

I'm working with eigenfaces for a facial recognition program I am writing. I have a couple questions about how eigenfaces are actually generated:
Are they generated from a lot of pictures of different people, or a lot of pictures of the same person?
Do these people need to include the people you want to recognize? If not, then how would any type of comparison be made?
Is an eigenface determined for every image you provide, or do multiple pictures go towards creating one eigenface?
This is all about the generation or learning phase of the eigenfaces. Thanks for any help or pointing me in the right direction!

I actually find the description for Eigenfaces on Wikipedia quite useful. To answer your questions:
Yes, you should take pictures from many different people.
No, the eigenfaces basically give you a way to describe other faces. You can think of the eigenfaces as a basis in a vector space. You have to make sure that you can describe the face that you want to recognise with the eigenfaces that you have. If you only use Caucasian faces to determine the eigenfaces, you might have problems describing a variety of Asian faces with them and vice versa.
The eigenfaces are computed from a set of images, i.e. multiple images lead to multiple eigenfaces.
Edit: Answering the question, that Kevin added in the comment to the question:
The idea behind using eigenfaces, is that you can express an image of a face by mixing eigenfaces together. Let's suppose you have three eigenfaces ef_1, ef_2, ef_3 and you have an image of a face f_1 = a_1 * ef_1 + a_2 * ef_2 + a_3 * ef_3. The eigenfaces do not change, regardless which face you want to express with them, however, the coefficients a = (a_1, a_2, a_3) are characteristic to the face. This is what you would use to compare two faces.
But in order to get to the stage where you can use eigenfaces, you first have to align (register) an observed face with the eigenfaces, which is not trivial and a completely different topic (see pxu's answer).
P.S.: I recommend, that you keep an eye on Area 51: Computer Vision, which is a Stack Overflow sister site about computer vision in the making.

Many different people are highly necessary to achieve support to cover all possible faces.
No need for that, although you need to represent all dimensions. A good analogy is to barycentric coordinates for describing the location of a point in a triangle. You are getting a weighted average to the vertices. If you don't have sufficient vector support (for example, only having two points), then you can't describe points that lie outside the line no matter how you play with the weighted average. This is essentially bjoernz's point for Caucasian vs. Asian faces. Note that this analogy is a gross simplification. The weights in eigenfaces are actually more like PCA or Fourier coefficients.
Each image gets turned into an eigenface which is a vector of principal components.
Nota bene: you need very good registration of the faces. Eigenfaces is notoriously bad about translation/rotation invariance. Your results are likely to be terrible unless you register well. The original Turk and Pentland paper was groundbreaking not just because of the technique but for the scale and quality of data set they gathered which enabled said technique.

Related

Perspective projection based on 4 points in 2D

I'm writing to ask about homography and perspective projection.
I'm trying to write a piece of code, that will "warp" my image so that its corners align with 4 reference points that are in the 3D space - however, the game engine that I'm running it in, already allows me to get the screen position of them, so I already have their screen-space coordinates of both xi,yi and ui,vi, normalized to values between 0 and 1.
I have to mention that I don't have a degree in mathematics, which seems to be a requirement in the posts I've seen on this topic so far, but I'm hoping there is actually a solution to this problem that one can comprehend. I never had a chance to take classes in Computer Vision.
The reason I came here is that in all the posts I've seen online, the simple explanation that I came across is that each point must be put into a 1x3 matrix and multiplied by a 3x3 homography, which consists of 9 components h1,h2,h3...h9, and this transformation matrix will transform each point to the correct perspective. And that's where I'm hitting a brick wall - how do I calculate the transformation matrix? It feels like it should be a relatively simple algebraic task, but apparently it's not.
At this point I spent days reading on the topic, and the solutions I've come across are either based on matlab (which have a ton of mathematical functions built into them), or include elaborations and discussions that don't really explain much; sometimes they suggest tons of different parameters and simplifications, but rarely explain why and what's their purpose, or they are referencing books and studies that have been since removed from the web, and I found myself more confused than I was in the beginning. Most of the resources I managed to find online are also made in a different context - image stitching and 3d engine development.
I also want to mention that I need to run this code each frame on the CPU, and I'm fairly concerned about the effect of having to run too many matrix transformations and solving a ton of linear algebra equations.
I apologize for not asking about any specific code, but my general question is - can anyone point me in the right direction with this issue?
Limit the problem you deal with.
For example, if you always warp the entire rectangular image, you can treat that the coordinates of the image corners are {(0,0), (1,0), (0,1), (1,1)}.
This can simplify the equation, and you'll be able to solve the equation by yourself.
So you'll be able to implement the answer.
Note : Homograpy is scale invariant. So you can decrease the freedom to 8. (e.g. you can solve the equation under h9=1).
Best advice I can give: read a good book on the subject. For example, "Multiple View Geometry" by Hartley and Zisserman

how to detect the coordinates of certain points on image

I'm using the ORB algorithm to detect and get the coordinates of the crossings of rope shown in the image, which is represented by the red dot. I want to detect the coordinates of the four points surrounding the crossing represented by the blue dots. All the four points have the same distance from the red spot.
Any idea how to get their coordinates by getting use of the red spot coordinate.
Thank you in Advance
Although you're using ORB, you're still going to need an algorithm to segment the rope from the background, or at least some technique to identify image chunks that belong to the rope and that are equidistant from the red dot. There are a number of options to explore.
It's important to consider your lighting & imaging as separate problems to be solved if this is meant to be a real-world application. This looks a bit like a problem for a class rather than for a application you'll sell and support, but you should still consider lighting:
Will your algorithm(s) still work when light level is reduced?
How will detection be affected by changes in camera pose relative to the surface where the rope will be located?
If you'll be detecting "black" rope, will the algorithm also be required to detect rope of different colors? dirty rope? rope on different backgrounds?
Since you're object of interest is rope, you have to consider a class of algorithms suitable for detection of non-rigid objects. Always consider the simplest solution first!
Connected Components
Connected components labeling is a traditional image processing algorithm and still suitable as the starting point for many applications. The last I knew, this was implemented in OpenCV as findContours(). This can also be called "blob finding" or some variant thereof.
https://en.wikipedia.org/wiki/Connected-component_labeling
https://docs.opencv.org/2.4/modules/imgproc/doc/structural_analysis_and_shape_descriptors.html?highlight=findcontours
Depending on lighting, you may have to take different steps to binarize the image before running connected components. As a start, convert the color image to grayscale, which will simplify the task significantly.
Try a manual threshold since you can quickly test a number of values to see the effect. Don't be too discouraged if the binarization isn't quite right--this can often be fixed with preprocessing.
If a range of manual thresholds works (e.g. 52 - 76 in an 8-bit grayscale range), then use an algorithm that will automatically calculate the threshold for you: Otsu, entropy-based methods, etc., will all offer comparable performance. Whichever technique works best, the code/algorithm can be tweaked further to optimize for your rope application.
If thresholding and binarization don't work--which for your rope application seems unlikely, at least how you've presented it--then switch to thinking in terms of gradient-based (edge-based, energy-based) techniques.
But assuming you can separate the rope from the background, you're still going to need a method to start at the red dot [within the rope] and move equal distances out to the blue points. More about that later after a discussion of other rope segmentation methods.
Note: connected components labeling can work in scenarios beyond just binarizing black & white images. If you can create a texture field or some other 2D representation of the image that makes it possible to distinguish the black rope from the relatively light background, you may be able to use a connected components algorithm. (Finding a "more complicated" or "more modern" algorithm isn't necessarily going to be the right approach.)
In a binarized image, blobs can be nested: on a white background you can have several black blobs, inside of one or more of which are white blobs, inside of which are black blobs, etc. An earlier version of OpenCV handled this reasonably well. (OpenCV is a nice starting point, and a touchpoint for many, but for a number of reasons it doesn't always compare favorably to other open source and commercial packages; popularity notwithstanding, OpenCV has some issues.)
Once you have a "blob" (a 4-connected region of pixels) in a 2D digital image, you can treat the blob as an object, at which point you have a number of options:
Edge tracing: trace around the inside and outside edges of the blob. From what I recall, OpenCV does (or at least should) have some relatively straightforward method to get the edges.
Split the blob into component blobs, each of which can be treated separately
Convert the blob to a polygon
...
A connected components algorithm should be high on the list of techniques to try if you have a non-rigid object.
Boolean Operations
Once you have the rope as a connected component (and possibly even without this), you can use boolean image operations to find the spots at the blue dots in your image:
Create a circular region in data, or even in the image
Find the intersection of the circle (an annulus) and the black region representing the rope. Using your original image, you should have four regions.
Find the center point of the intersection regions.
You could even try this without using connected components at all, but using connected components as part of the solution could make it more robust.
Polygon Simplification
If you have a blob, which in your application would be a connected set of black pixels representing the rope on the floor, then you can consider converting this blob to one or more polygons for further processing. There are advantages to working with polygons.
If you consider only the outside boundary of the rope, then you can see that the set of pixels defining the boundary represents a polygon. It's a polygon with a lot of points, and not a convex polygon, but a polygon nonetheless.
To simplify the polygon, you can use an algorithm such as Ramer-Douglas-Puecker:
https://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93Peucker_algorithm
Once you have a simplified polygon, you can try a few techniques to render useful data from the polygon
Angle Bisector Network
Triangulation (e.g. using ear clipping)
Triangulation is typically dependent on initial conditions, so the resulting triangulation for slighting different polygons (that is, rope -> blob -> polygon -> simplified polygon). So in your application it might be useful to triangulate the dark rope region, and then to connect the center of one triangle to the center of the next nearest triangle. You'll also have to deal with crossings, such as the rope overlap. Ultimately this can yield a "skeletonization" of the rope. Speaking of which...
Skeletonization
If the rope problem was posed to you as a class exercise, then it may have been a prompt to try skeletonization. You can read about it here:
https://en.wikipedia.org/wiki/Topological_skeleton
Skeletonization and thinning have their own problems to solve, but you should dig into them a bit and see those problems themselves.
The Medial Axis Transform (MAT) is a related concept. Long story there.
Edge-based techniques
There are a number of techniques to generate "edge images" based on edge strength, energy, entropy, etc. Making them robust takes a little effort. If you've had academic training in image processing you've likely heard of Harris, Sobel, Canny, and similar processing methods--none are magic bullets, but they're simple and dependable and will yield data you need.
An "edge image" consists of pixels representing the image gradient strength [and sometimes the gradient direction]. People may call this edge image something else, but it's the concept that matters.
What you then do with the edge data is another subject altogether. But one reason to think of edge images (or at least object borders) is that it reduces the amount of information your algorithm(s) will need to process.
Mean Shift (and related)
To get back to segmentation mentioned in the section on connected components, there are other methods for segmenting figures from a background: K-means, mean shift, and so on. You probably won't need any of those, but they're neat and worth studying.
Stroke Width Transform
This is an intriguing technique used to extract text from noisy backgrounds. Although it's intended for OCR, it could work for rope since the rope width is relatively constant, the rope shape varies, there are crossings, etc.
In short, and simplifying quite a bit, you can think of SWT as a means to find "strokes" (thick lines) by finding gradients antiparallel to each other. On either side of a stroke (or line), the edge gradient points normal to the object edge. The normal on one side of the stroke points opposite the direction of the normal on the other side of the stroke. By filtering for pixel-gradient pairs within a certain distance of each other, you can isolate certain strokes--even automatically. For your example the collection of points representing edge pairs for the rope would be much more common than other point pairs.
Non-Rigid Matching
There are techniques for matching non-rigid shapes, but they would not be worth exploring. If any of the techniques I mentioned above is unfamiliar to you, explore some of those first before you try any fancier algorithms.
CNNs, machine learning, etc.
Just don't even think of these methods as a starting point.
Other Considerations
If this were an application for industry, security, or whatnot, you'd have to determine how well your image processing worked under all environmental considerations. That's not an easy task, and can make all the difference between a setup that "works" in the lab and a setup that actually works in practice.
I hope that's of some help. Feel free to post a reply if I've confused more than helped, or if you want to explore some idea in more detail. Though I tried to touch on some common(ish) techniques, I didn't mention all the different ways of addressing this problem.
And briefly: once you have a skeleton, point network, or whatever representing a reduced data set for the rope and the red dot (the identified feature), a few techniques to find the items at the blue dots:
For a skeleton, trace along each "branch" of the rope outward from the know until the geodesic distance or straight-line 2D distance is the distance D that you want.
To use geometry, create a circle of width 1 - 2 pixels. Find the intersection of that circle and the rope. Find the center point of the intersections of circle and rope. (Also described above.)
Good luck!

OpenCV detect image against a image set

I would like to know how I can use OpenCV to detect on my VideoCamera a Image. The Image can be one of 500 images.
What I'm doing at the moment:
- (void)viewDidLoad
{
[super viewDidLoad];
// Do any additional setup after loading the view.
self.videoCamera = [[CvVideoCamera alloc] initWithParentView:imageView];
self.videoCamera.delegate = self;
self.videoCamera.defaultAVCaptureDevicePosition = AVCaptureDevicePositionBack;
self.videoCamera.defaultAVCaptureSessionPreset = AVCaptureSessionPresetHigh;
self.videoCamera.defaultAVCaptureVideoOrientation = AVCaptureVideoOrientationPortrait;
self.videoCamera.defaultFPS = 30;
self.videoCamera.grayscaleMode = NO;
}
-(void)viewDidAppear:(BOOL)animated{
[super viewDidAppear:animated];
[self.videoCamera start];
}
#pragma mark - Protocol CvVideoCameraDelegate
#ifdef __cplusplus
- (void)processImage:(cv::Mat&)image;
{
// Do some OpenCV stuff with the image
cv::Mat image_copy;
cvtColor(image, image_copy, CV_BGRA2BGR);
// invert image
//bitwise_not(image_copy, image_copy);
//cvtColor(image_copy, image, CV_BGR2BGRA);
}
#endif
The images that I would like to detect are 2-5kb small. Few got text on them but others are just signs. Here a example:
Do you guys know how I can do that?
There are several things in here. I will break down your problem and point you towards some possible solutions.
Classification: Your main task consists on determining if a certain image belongs to a class. This problem by itself can be decomposed in several problems:
Feature Representation You need to decide how you are gonna model your feature, i.e. how are you going to represent each image in a feature space so you can train a classifier to separate those classes. The feature representation by itself is already a big design decision. One could (i) calculate the histogram of the images using n bins and train a classifier or (ii) you could choose a sequence of random patches comparison such as in a random forest. However, after the training, you need to evaluate the performance of your algorithm to see how good your decision was.
There is a known problem called overfitting, which is when you learn too well that you can not generalize your classifier. This can usually be avoided with cross-validation. If you are not familiar with the concept of false positive or false negative, take a look in this article.
Once you define your feature space, you need to choose an algorithm to train that data and this might be considered as your biggest decision. There are several algorithms coming out every day. To name a few of the classical ones: Naive Bayes, SVM, Random Forests, and more recently the community has obtained great results using Deep learning. Each one of those have their own specific usage (e.g. SVM ares great for binary classification) and you need to be familiar with the problem. You can start with simple assumptions such as independence between random variables and train a Naive Bayes classifier to try to separate your images.
Patches: Now you mentioned that you would like to recognize the images on your webcam. If you are going to print the images and display in a video, you need to handle several things. it is necessary to define patches on your big image (input from the webcam) in which you build a feature representation for each patch and classify in the same way you did in the previous step. For doing that, you could slide a window and classify all the patches to see if they belong to the negative class or to one of the positive ones. There are other alternatives.
Scale: Considering that you are able to detect the location of images in the big image and classify it, the next step is to relax the toy assumption of fixes scale. To handle a multiscale approach, you could image pyramid which pretty much allows you to perform the detection in multiresolution. Alternative approaches could consider keypoint detectors, such as SIFT and SURF. Inside SIFT, there is an image pyramid which allows the invariance.
Projection So far we assumed that you had images under orthographic projection, but most likely you will have slight perspective projections which will make the whole previous assumption fail. One naive solution for that would be for instance detect the corners of the white background of your image and rectify the image before building the feature vector for classification. If you used SIFT or SURF, you could design a way of avoiding explicitly handling that. Nevertheless, if your input is gonna be just squares patches, such as in ARToolkit, I would go for manual rectification.
I hope I might have given you a better picture of your problem.
I would recommend using SURF for that, because pictures can be on different distances form your camera, i.e changing the scale. I had one similar experiment and SURF worked just as expected. But SURF has very difficult adjustment (and expensive operations), you should try different setups before you get the needed results.
Here is a link: http://docs.opencv.org/modules/nonfree/doc/feature_detection.html
youtube video (in C#, but can give an idea): http://www.youtube.com/watch?v=zjxWpKCQqJc
I might not be qualified enough to answer this problem. Last time I seriously use OpenCV it was still 1.1. But just some thought on it, and hope it would help (currently I am interested in DIP and ML).
I think it will probably an easier task if you only need to classify an image, if the image is just one from (or very similar to) your 500 images. For this you could use SVM or some neural network (Felix already gave an excellent enumeration on that).
However, your problem seems to be that you need to first find this candidate image in your webcam, the location of which you have little clue beforehand. (let us know whether it is so. I think it is important.)
If so, the harder problem is the detection/localization of your candidate image.
I don't have a general solution for that. The first thing I would do is to see if there is some common feature in your 500 images (e.g., whether all of them enclosed by a red circle, or, half of them have circle and half of them have rectangle). If this can be done, the problem will be simpler (it would be similar to face detection problem, which have good solution).
In other words, this means that you first classify the 500 images to a few groups with common feature (by human), and detect the group first, then scale and use above mentioned technique to classify them into fine result. In this way, it will be more computationally acceptable than trying to detect 500 images one by one.
BTW, this ppt would help to give a visual clue of what is going on for feature extraction and image matching http://courses.cs.washington.edu/courses/cse455/09wi/Lects/lect6.pdf.
Detect vs recognize: detecting the image is just finding it on the background and from your comments I realized you may have your sings surrounded by the background. It might facilitate your algorithm if you can somehow crop your signs from the background (detect) before trying to recognize them. Recognizing is a next stage that presumes you can classify the cropped image correctly as the one seen before.
If you need real time speed and scale/rotation invariance neither SIFT no SURF will do this fast. Nowadays you can do much better if you shift the burden of image processing to a learning stage as was done by Lepitit. In short, he subjected each pattern to a bunch of affine transformations and trained a binary classification tree to recognize each point correctly by doing a lot of binary comparison tests. Trees are extremely fast and a way to go not to mention that most of the processing is done offline. This method is also more robust to off-plane rotations than SIFT or SURF. You will also learn about tree classification which may facilitate you last processing stage.
Finally a recognition stage is based not only on the number of matches but also on their geometric consistency. Since your signs look flat I suggest finding either affine or homography transformation that has most inliers when calculated between matched points.
Looking at your code though I realized that you may not follow any of these recommendations. It may be a good starting point for you to read about decision trees and then play with some sample code (see mushroom.cpp in the above mentioned link)

Image Processing: Algorithm Improvement for 'Coca-Cola Can' Recognition

One of the most interesting projects I've worked on in the past couple of years was a project about image processing. The goal was to develop a system to be able to recognize Coca-Cola 'cans' (note that I'm stressing the word 'cans', you'll see why in a minute). You can see a sample below, with the can recognized in the green rectangle with scale and rotation.
Some constraints on the project:
The background could be very noisy.
The can could have any scale or rotation or even orientation (within reasonable limits).
The image could have some degree of fuzziness (contours might not be entirely straight).
There could be Coca-Cola bottles in the image, and the algorithm should only detect the can!
The brightness of the image could vary a lot (so you can't rely "too much" on color detection).
The can could be partly hidden on the sides or the middle and possibly partly hidden behind a bottle.
There could be no can at all in the image, in which case you had to find nothing and write a message saying so.
So you could end up with tricky things like this (which in this case had my algorithm totally fail):
I did this project a while ago, and had a lot of fun doing it, and I had a decent implementation. Here are some details about my implementation:
Language: Done in C++ using OpenCV library.
Pre-processing: For the image pre-processing, i.e. transforming the image into a more raw form to give to the algorithm, I used 2 methods:
Changing color domain from RGB to HSV and filtering based on "red" hue, saturation above a certain threshold to avoid orange-like colors, and filtering of low value to avoid dark tones. The end result was a binary black and white image, where all white pixels would represent the pixels that match this threshold. Obviously there is still a lot of crap in the image, but this reduces the number of dimensions you have to work with.
Noise filtering using median filtering (taking the median pixel value of all neighbors and replace the pixel by this value) to reduce noise.
Using Canny Edge Detection Filter to get the contours of all items after 2 precedent steps.
Algorithm: The algorithm itself I chose for this task was taken from this awesome book on feature extraction and called Generalized Hough Transform (pretty different from the regular Hough Transform). It basically says a few things:
You can describe an object in space without knowing its analytical equation (which is the case here).
It is resistant to image deformations such as scaling and rotation, as it will basically test your image for every combination of scale factor and rotation factor.
It uses a base model (a template) that the algorithm will "learn".
Each pixel remaining in the contour image will vote for another pixel which will supposedly be the center (in terms of gravity) of your object, based on what it learned from the model.
In the end, you end up with a heat map of the votes, for example here all the pixels of the contour of the can will vote for its gravitational center, so you'll have a lot of votes in the same pixel corresponding to the center, and will see a peak in the heat map as below:
Once you have that, a simple threshold-based heuristic can give you the location of the center pixel, from which you can derive the scale and rotation and then plot your little rectangle around it (final scale and rotation factor will obviously be relative to your original template). In theory at least...
Results: Now, while this approach worked in the basic cases, it was severely lacking in some areas:
It is extremely slow! I'm not stressing this enough. Almost a full day was needed to process the 30 test images, obviously because I had a very high scaling factor for rotation and translation, since some of the cans were very small.
It was completely lost when bottles were in the image, and for some reason almost always found the bottle instead of the can (perhaps because bottles were bigger, thus had more pixels, thus more votes)
Fuzzy images were also no good, since the votes ended up in pixel at random locations around the center, thus ending with a very noisy heat map.
In-variance in translation and rotation was achieved, but not in orientation, meaning that a can that was not directly facing the camera objective wasn't recognized.
Can you help me improve my specific algorithm, using exclusively OpenCV features, to resolve the four specific issues mentioned?
I hope some people will also learn something out of it as well, after all I think not only people who ask questions should learn. :)
An alternative approach would be to extract features (keypoints) using the scale-invariant feature transform (SIFT) or Speeded Up Robust Features (SURF).
You can find a nice OpenCV code example in Java, C++, and Python on this page: Features2D + Homography to find a known object
Both algorithms are invariant to scaling and rotation. Since they work with features, you can also handle occlusion (as long as enough keypoints are visible).
Image source: tutorial example
The processing takes a few hundred ms for SIFT, SURF is bit faster, but it not suitable for real-time applications. ORB uses FAST which is weaker regarding rotation invariance.
The original papers
SURF: Speeded Up Robust Features
Distinctive Image Features
from Scale-Invariant Keypoints
ORB: an efficient alternative to SIFT or SURF
To speed things up, I would take advantage of the fact that you are not asked to find an arbitrary image/object, but specifically one with the Coca-Cola logo. This is significant because this logo is very distinctive, and it should have a characteristic, scale-invariant signature in the frequency domain, particularly in the red channel of RGB. That is to say, the alternating pattern of red-to-white-to-red encountered by a horizontal scan line (trained on a horizontally aligned logo) will have a distinctive "rhythm" as it passes through the central axis of the logo. That rhythm will "speed up" or "slow down" at different scales and orientations, but will remain proportionally equivalent. You could identify/define a few dozen such scanlines, both horizontally and vertically through the logo and several more diagonally, in a starburst pattern. Call these the "signature scan lines."
Searching for this signature in the target image is a simple matter of scanning the image in horizontal strips. Look for a high-frequency in the red-channel (indicating moving from a red region to a white one), and once found, see if it is followed by one of the frequency rhythms identified in the training session. Once a match is found, you will instantly know the scan-line's orientation and location in the logo (if you keep track of those things during training), so identifying the boundaries of the logo from there is trivial.
I would be surprised if this weren't a linearly-efficient algorithm, or nearly so. It obviously doesn't address your can-bottle discrimination, but at least you'll have your logos.
(Update: for bottle recognition I would look for coke (the brown liquid) adjacent to the logo -- that is, inside the bottle. Or, in the case of an empty bottle, I would look for a cap which will always have the same basic shape, size, and distance from the logo and will typically be all white or red. Search for a solid color eliptical shape where a cap should be, relative to the logo. Not foolproof of course, but your goal here should be to find the easy ones fast.)
(It's been a few years since my image processing days, so I kept this suggestion high-level and conceptual. I think it might slightly approximate how a human eye might operate -- or at least how my brain does!)
Fun problem: when I glanced at your bottle image I thought it was a can too. But, as a human, what I did to tell the difference is that I then noticed it was also a bottle...
So, to tell cans and bottles apart, how about simply scanning for bottles first? If you find one, mask out the label before looking for cans.
Not too hard to implement if you're already doing cans. The real downside is it doubles your processing time. (But thinking ahead to real-world applications, you're going to end up wanting to do bottles anyway ;-)
Isn't it difficult even for humans to distinguish between a bottle and a can in the second image (provided the transparent region of the bottle is hidden)?
They are almost the same except for a very small region (that is, width at the top of the can is a little small while the wrapper of the bottle is the same width throughout, but a minor change right?)
The first thing that came to my mind was to check for the red top of bottle. But it is still a problem, if there is no top for the bottle, or if it is partially hidden (as mentioned above).
The second thing I thought was about the transparency of bottle. OpenCV has some works on finding transparent objects in an image. Check the below links.
OpenCV Meeting Notes Minutes 2012-03-19
OpenCV Meeting Notes Minutes 2012-02-28
Particularly look at this to see how accurately they detect glass:
OpenCV Meeting Notes Minutes 2012-04-24
See their implementation result:
They say it is the implementation of the paper "A Geodesic Active Contour Framework for Finding Glass" by K. McHenry and J. Ponce, CVPR 2006.
It might be helpful in your case a little bit, but problem arises again if the bottle is filled.
So I think here, you can search for the transparent body of the bottles first or for a red region connected to two transparent objects laterally which is obviously the bottle. (When working ideally, an image as follows.)
Now you can remove the yellow region, that is, the label of the bottle and run your algorithm to find the can.
Anyway, this solution also has different problems like in the other solutions.
It works only if your bottle is empty. In that case, you will have to search for the red region between the two black colors (if the Coca Cola liquid is black).
Another problem if transparent part is covered.
But anyway, if there are none of the above problems in the pictures, this seems be to a better way.
I really like Darren Cook's and stacker's answers to this problem. I was in the midst of throwing my thoughts into a comment on those, but I believe my approach is too answer-shaped to not leave here.
In short summary, you've identified an algorithm to determine that a Coca-Cola logo is present at a particular location in space. You're now trying to determine, for arbitrary orientations and arbitrary scaling factors, a heuristic suitable for distinguishing Coca-Cola cans from other objects, inclusive of: bottles, billboards, advertisements, and Coca-Cola paraphernalia all associated with this iconic logo. You didn't call out many of these additional cases in your problem statement, but I feel they're vital to the success of your algorithm.
The secret here is determining what visual features a can contains or, through the negative space, what features are present for other Coke products that are not present for cans. To that end, the current top answer sketches out a basic approach for selecting "can" if and only if "bottle" is not identified, either by the presence of a bottle cap, liquid, or other similar visual heuristics.
The problem is this breaks down. A bottle could, for example, be empty and lack the presence of a cap, leading to a false positive. Or, it could be a partial bottle with additional features mangled, leading again to false detection. Needless to say, this isn't elegant, nor is it effective for our purposes.
To this end, the most correct selection criteria for cans appear to be the following:
Is the shape of the object silhouette, as you sketched out in your question, correct? If so, +1.
If we assume the presence of natural or artificial light, do we detect a chrome outline to the bottle that signifies whether this is made of aluminum? If so, +1.
Do we determine that the specular properties of the object are correct, relative to our light sources (illustrative video link on light source detection)? If so, +1.
Can we determine any other properties about the object that identify it as a can, including, but not limited to, the topological image skew of the logo, the orientation of the object, the juxtaposition of the object (for example, on a planar surface like a table or in the context of other cans), and the presence of a pull tab? If so, for each, +1.
Your classification might then look like the following:
For each candidate match, if the presence of a Coca Cola logo was detected, draw a gray border.
For each match over +2, draw a red border.
This visually highlights to the user what was detected, emphasizing weak positives that may, correctly, be detected as mangled cans.
The detection of each property carries a very different time and space complexity, and for each approach, a quick pass through http://dsp.stackexchange.com is more than reasonable for determining the most correct and most efficient algorithm for your purposes. My intent here is, purely and simply, to emphasize that detecting if something is a can by invalidating a small portion of the candidate detection space isn't the most robust or effective solution to this problem, and ideally, you should take the appropriate actions accordingly.
And hey, congrats on the Hacker News posting! On the whole, this is a pretty terrific question worthy of the publicity it received. :)
Looking at shape
Take a gander at the shape of the red portion of the can/bottle. Notice how the can tapers off slightly at the very top whereas the bottle label is straight. You can distinguish between these two by comparing the width of the red portion across the length of it.
Looking at highlights
One way to distinguish between bottles and cans is the material. A bottle is made of plastic whereas a can is made of aluminum metal. In sufficiently well-lit situations, looking at the specularity would be one way of telling a bottle label from a can label.
As far as I can tell, that is how a human would tell the difference between the two types of labels. If the lighting conditions are poor, there is bound to be some uncertainty in distinguishing the two anyways. In that case, you would have to be able to detect the presence of the transparent/translucent bottle itself.
Please take a look at Zdenek Kalal's Predator tracker. It requires some training, but it can actively learn how the tracked object looks at different orientations and scales and does it in realtime!
The source code is available on his site. It's in MATLAB, but perhaps there is a Java implementation already done by a community member. I have succesfully re-implemented the tracker part of TLD in C#. If I remember correctly, TLD is using Ferns as the keypoint detector. I use either SURF or SIFT instead (already suggested by #stacker) to reacquire the object if it was lost by the tracker. The tracker's feedback makes it easy to build with time a dynamic list of sift/surf templates that with time enable reacquiring the object with very high precision.
If you're interested in my C# implementation of the tracker, feel free to ask.
If you are not limited to just a camera which wasn't in one of your constraints perhaps you can move to using a range sensor like the Xbox Kinect. With this you can perform depth and colour based matched segmentation of the image. This allows for faster separation of objects in the image. You can then use ICP matching or similar techniques to even match the shape of the can rather then just its outline or colour and given that it is cylindrical this may be a valid option for any orientation if you have a previous 3D scan of the target. These techniques are often quite quick especially when used for such a specific purpose which should solve your speed problem.
Also I could suggest, not necessarily for accuracy or speed but for fun you could use a trained neural network on your hue segmented image to identify the shape of the can. These are very fast and can often be up to 80/90% accurate. Training would be a little bit of a long process though as you would have to manually identify the can in each image.
I would detect red rectangles: RGB -> HSV, filter red -> binary image, close (dilate then erode, known as imclose in matlab)
Then look through rectangles from largest to smallest. Rectangles that have smaller rectangles in a known position/scale can both be removed (assuming bottle proportions are constant, the smaller rectangle would be a bottle cap).
This would leave you with red rectangles, then you'll need to somehow detect the logos to tell if they're a red rectangle or a coke can. Like OCR, but with a known logo?
This may be a very naive idea (or may not work at all), but the dimensions of all the coke cans are fixed. So may be if the same image contains both a can and a bottle then you can tell them apart by size considerations (bottles are going to be larger). Now because of missing depth (i.e. 3D mapping to 2D mapping) its possible that a bottle may appear shrunk and there isn't a size difference. You may recover some depth information using stereo-imaging and then recover the original size.
Hmm, I actually think I'm onto something (this is like the most interesting question ever - so it'd be a shame not to continue trying to find the "perfect" answer, even though an acceptable one has been found)...
Once you find the logo, your troubles are half done. Then you only have to figure out the differences between what's around the logo. Additionally, we want to do as little extra as possible. I think this is actually this easy part...
What is around the logo? For a can, we can see metal, which despite the effects of lighting, does not change whatsoever in its basic colour. As long as we know the angle of the label, we can tell what's directly above it, so we're looking at the difference between these:
Here, what's above and below the logo is completely dark, consistent in colour. Relatively easy in that respect.
Here, what's above and below is light, but still consistent in colour. It's all-silver, and all-silver metal actually seems pretty rare, as well as silver colours in general. Additionally, it's in a thin slither and close enough to the red that has already been identified so you could trace its shape for its entire length to calculate a percentage of what can be considered the metal ring of the can. Really, you only need a small fraction of that anywhere along the can to tell it is part of it, but you still need to find a balance that ensures it's not just an empty bottle with something metal behind it.
And finally, the tricky one. But not so tricky, once we're only going by what we can see directly above (and below) the red wrapper. Its transparent, which means it will show whatever is behind it. That's good, because things that are behind it aren't likely to be as consistent in colour as the silver circular metal of the can. There could be many different things behind it, which would tell us that it's an empty (or filled with clear liquid) bottle, or a consistent colour, which could either mean that it's filled with liquid or that the bottle is simply in front of a solid colour. We're working with what's closest to the top and bottom, and the chances of the right colours being in the right place are relatively slim. We know it's a bottle, because it hasn't got that key visual element of the can, which is relatively simplistic compared to what could be behind a bottle.
(that last one was the best I could find of an empty large coca cola bottle - interestingly the cap AND ring are yellow, indicating that the redness of the cap probably shouldn't be relied upon)
In the rare circumstance that a similar shade of silver is behind the bottle, even after the abstraction of the plastic, or the bottle is somehow filled with the same shade of silver liquid, we can fall back on what we can roughly estimate as being the shape of the silver - which as I mentioned, is circular and follows the shape of the can. But even though I lack any certain knowledge in image processing, that sounds slow. Better yet, why not deduce this by for once checking around the sides of the logo to ensure there is nothing of the same silver colour there? Ah, but what if there's the same shade of silver behind a can? Then, we do indeed have to pay more attention to shapes, looking at the top and bottom of the can again.
Depending on how flawless this all needs to be, it could be very slow, but I guess my basic concept is to check the easiest and closest things first. Go by colour differences around the already matched shape (which seems the most trivial part of this anyway) before going to the effort of working out the shape of the other elements. To list it, it goes:
Find the main attraction (red logo background, and possibly the logo itself for orientation, though in case the can is turned away, you need to concentrate on the red alone)
Verify the shape and orientation, yet again via the very distinctive redness
Check colours around the shape (since it's quick and painless)
Finally, if needed, verify the shape of those colours around the main attraction for the right roundness.
In the event you can't do this, it probably means the top and bottom of the can are covered, and the only possible things that a human could have used to reliably make a distinction between the can and the bottle is the occlusion and reflection of the can, which would be a much harder battle to process. However, to go even further, you could follow the angle of the can/bottle to check for more bottle-like traits, using the semi-transparent scanning techniques mentioned in the other answers.
Interesting additional nightmares might include a can conveniently sitting behind the bottle at such a distance that the metal of it just so happens to show above and below the label, which would still fail as long as you're scanning along the entire length of the red label - which is actually more of a problem because you're not detecting a can where you could have, as opposed to considering that you're actually detecting a bottle, including the can by accident. The glass is half empty, in that case!
As a disclaimer, I have no experience in nor have ever thought about image processing outside of this question, but it is so interesting that it got me thinking pretty deeply about it, and after reading all the other answers, I consider this to possibly be the easiest and most efficient way to get it done. Personally, I'm just glad I don't actually have to think about programming this!
EDIT
Additionally, look at this drawing I did in MS Paint... It's absolutely awful and quite incomplete, but based on the shape and colours alone, you can guess what it's probably going to be. In essence, these are the only things that one needs to bother scanning for. When you look at that very distinctive shape and combination of colours so close, what else could it possibly be? The bit I didn't paint, the white background, should be considered "anything inconsistent". If it had a transparent background, it could go over almost any other image and you could still see it.
Am a few years late in answering this question. With the state of the art pushed to its limits by CNNs in the last 5 years I wouldn't use OpenCV to do this task now! (I know you specifically wanted OpenCv features in the question) I feel object detection algorithms such as Faster-RCNNs, YOLO, SSD etc would ace this problem with a significant margin compared to OpenCV features. If I were to tackle this problem now (after 6 years !!) I would definitely use Faster-RCNN.
I'm not aware of OpenCV but looking at the problem logically I think you could differentiate between bottle and can by changing the image which you are looking for i.e. Coca Cola. You should incorporate till top portion of can as in case of can there is silver lining at top of coca cola and in case of bottle there will be no such silver lining.
But obviously this algorithm will fail in cases where top of can is hidden, but in such case even human will not be able to differentiate between the two (if only coca cola portion of bottle/can is visible)
I like the challenge and wanted to give an answer, which solves the issue, I think.
Extract features (keypoints, descriptors such as SIFT, SURF) of the logo
Match the points with a model image of the logo (using Matcher such as Brute Force )
Estimate the coordinates of the rigid body (PnP problem - SolvePnP)
Estimate the cap position according to the rigid body
Do back-projection and calculate the image pixel position (ROI) of the cap of the bottle (I assume you have the intrinsic parameters of the camera)
Check with a method whether the cap is there or not. If there, then this is the bottle
Detection of the cap is another issue. It can be either complicated or simple. If I were you, I would simply check the color histogram in the ROI for a simple decision.
Please, give the feedback if I am wrong. Thanks.
I like your question, regardless of whether it's off topic or not :P
An interesting aside; I've just completed a subject in my degree where we covered robotics and computer vision. Our project for the semester was incredibly similar to the one you describe.
We had to develop a robot that used an Xbox Kinect to detect coke bottles and cans on any orientation in a variety of lighting and environmental conditions. Our solution involved using a band pass filter on the Hue channel in combination with the hough circle transform. We were able to constrain the environment a bit (we could chose where and how to position the robot and Kinect sensor), otherwise we were going to use the SIFT or SURF transforms.
You can read about our approach on my blog post on the topic :)
Deep Learning
Gather at least a few hundred images containing cola cans, annotate the bounding box around them as positive classes, include cola bottles and other cola products label them negative classes as well as random objects.
Unless you collect a very large dataset, perform the trick of using deep learning features for small dataset. Ideally using a combination of Support Vector Machines(SVM) with deep neural nets.
Once you feed the images to a previously trained deep learning model(e.g. GoogleNet), instead of using neural network's decision (final) layer to do classifications, use previous layer(s)' data as features to train your classifier.
OpenCV and Google Net:
http://docs.opencv.org/trunk/d5/de7/tutorial_dnn_googlenet.html
OpenCV and SVM:
http://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html
There are a bunch of color descriptors used to recognise objects, the paper below compares a lot of them. They are specially powerful when combined with SIFT or SURF. SURF or SIFT alone are not very useful in a coca cola can image because they don't recognise a lot of interest points, you need the color information to help. I use BIC (Border/Interior Pixel Classification) with SURF in a project and it worked great to recognise objects.
Color descriptors for Web image retrieval: a comparative study
You need a program that learns and improves classification accuracy organically from experience.
I'll suggest deep learning, with deep learning this becomes a trivial problem.
You can retrain the inception v3 model on Tensorflow:
How to Retrain Inception's Final Layer for New Categories.
In this case, you will be training a convolutional neural network to classify an object as either a coca-cola can or not.
As alternative to all these nice solutions, you can train your own classifier and make your application robust to errors. As example, you can use Haar Training, providing a good number of positive and negative images of your target.
It can be useful to extract only cans and can be combined with the detection of transparent objects.
There is a computer vision package called HALCON from MVTec whose demos could give you good algorithm ideas. There is plenty of examples similar to your problem that you could run in demo mode and then look at the operators in the code and see how to implement them from existing OpenCV operators.
I have used this package to quickly prototype complex algorithms for problems like this and then find how to implement them using existing OpenCV features. In particular for your case you could try to implement in OpenCV the functionality embedded in the operator find_scaled_shape_model. Some operators point to the scientific paper regarding algorithm implementation which can help to find out how to do something similar in OpenCV.
Maybe too many years late, but nevertheless a theory to try.
The ratio of bounding rectangle of red logo region to the overall dimension of the bottle/can is different. In the case of Can, should be 1:1, whereas will be different in that of bottle (with or without cap).
This should make it easy to distinguish between the two.
Update:
The horizontal curvature of the logo region will be different between the Can and Bottle due their respective size difference. This could be specifically useful if your robot needs to pick up can/bottle, and you decide the grip accordingly.
If you are interested in it being realtime, then what you need is to add in a pre-processing filter to determine what gets scanned with the heavy-duty stuff. A good fast, very real time, pre-processing filter that will allow you to scan things that are more likely to be a coca-cola can than not before moving onto more iffy things is something like this: search the image for the biggest patches of color that are a certain tolerance away from the sqrt(pow(red,2) + pow(blue,2) + pow(green,2)) of your coca-cola can. Start with a very strict color tolerance, and work your way down to more lenient color tolerances. Then, when your robot runs out of an allotted time to process the current frame, it uses the currently found bottles for your purposes. Please note that you will have to tweak the RGB colors in the sqrt(pow(red,2) + pow(blue,2) + pow(green,2)) to get them just right.
Also, this is gona seem really dumb, but did you make sure to turn on -oFast compiler optimizations when you compiled your C code?
The first things I would look for are color - like RED , when doing Red eye detection in an image - there is a certain color range to detect , some characteristics about it considering the surrounding area and such as distance apart from the other eye if it is indeed visible in the image.
1: First characteristic is color and Red is very dominant. After detecting the Coca Cola Red there are several items of interest
1A: How big is this red area (is it of sufficient quantity to make a determination of a true can or not - 10 pixels is probably not enough),
1B: Does it contain the color of the Label - "Coca-Cola" or wave.
1B1: Is there enough to consider a high probability that it is a label.
Item 1 is kind of a short cut - pre-process if that doe snot exist in the image - move on.
So if that is the case I can then utilize that segment of my image and start looking more zoom out of the area in question a little bit - basically look at the surrounding region / edges...
2: Given the above image area ID'd in 1 - verify the surrounding points [edges] of the item in question.
A: Is there what appears to be a can top or bottom - silver?
B: A bottle might appear transparent , but so might a glass table - so is there a glass table/shelf or a transparent area - if so there are multiple possible out comes. A Bottle MIGHT have a red cap, it might not, but it should have either the shape of the bottle top / thread screws, or a cap.
C: Even if this fails A and B it still can be a can - partial..
This is more complex when it is partial because a partial bottle / partial can might look the same , so some more processing of measurement of the Red region edge to edge.. small bottle might be similar in size ..
3: After the above analysis that is when I would look at the lettering and the wave logo - because I can orient my search for some of the letters in the words As you might not have all of the text due to not having all of the can, the wave would align at certain points to the text (distance wise) so I could search for that probability and know which letters should exist at that point of the wave at distance x.

Target Detection - Algorithm suggestions

I am trying to do image detection in C++. I have two images:
Image Scene: 1024x786
Person: 36x49
And I need to identify this particular person from the scene. I've tried to use Correlation but the image is too noisy and therefore doesn't give correct/accurate results.
I've been thinking/researching methods that would best solve this task and these seem the most logical:
Gaussian filters
Convolution
FFT
Basically, I would like to move the noise around the images, so then I can use Correlation to find the person more effectively.
I understand that an FFT will be hard to implement and/or may be slow especially with the size of the image I'm using.
Could anyone offer any pointers to solving this? What would the best technique/algorithm be?
In Andrew Ng's Machine Learning class we did this exact problem using neural networks and a sliding window:
train a neural network to recognize the particular feature you're looking for using data with tags for what the images are, using a 36x49 window (or whatever other size you want).
for recognizing a new image, take the 36x49 rectangle and slide it across the image, testing at each location. When you move to a new location, move the window right by a certain number of pixels, call it the jump_size (say 5 pixels). When you reach the right-hand side of the image, go back to 0 and increment the y of your window by jump_size.
Neural networks are good for this because the noise isn't a huge issue: you don't need to remove it. It's also good because it can recognize images similar to ones it has seen before, but are slightly different (the face is at a different angle, the lighting is slightly different, etc.).
Of course, the downside is that you need the training data to do it. If you don't have a set of pre-tagged images then you might be out of luck - although if you have a Facebook account you can probably write a script to pull all of yours and your friends' tagged photos and use that.
A FFT does only make sense when you already have sort the image with kd-tree or a hierarchical tree. I would suggest to map the image 2d rgb values to a 1d curve and reducing some complexity before a frequency analysis.
I do not have an exact algorithm to propose because I have found that target detection method depend greatly on the specific situation. Instead, I have some tips and advices. Here is what I would suggest: find a specific characteristic of your target and design your code around it.
For example, if you have access to the color image, use the fact that Wally doesn't have much green and blue color. Subtract the average of blue and green from the red image, you'll have a much better starting point. (Apply the same operation on both the image and the target.) This will not work, though, if the noise is color-dependent (ie: is different on each color).
You could then use correlation on the transformed images with better result. The negative point of correlation is that it will work only with an exact cut-out of the first image... Not very useful if you need to find the target to help you find the target! Instead, I suppose that an averaged version of your target (a combination of many Wally pictures) would work up to some point.
My final advice: In my personal experience of working with noisy images, spectral analysis is usually a good thing because the noise tend to contaminate only one particular scale (which would hopefully be a different scale than Wally's!) In addition, correlation is mathematically equivalent to comparing the spectral characteristic of your image and the target.