How does the Viola-Jones face detection method work?

How does the Viola-Jones face detection method work? - c++

Please explain to me, in few words, how the Viola-Jones face detection method works.

The Viola-Jones detector is a strong, binary classifier build of several weak
detectors
Each weak detector is an extremely simple binary classifier
During the learning stage, a cascade of weak detectors is trained so as to
gain the desired hit rate / miss rate (or precision / recall) using Adaboost
To detect objects, the original image is partitioned in several rectangular
patches, each of which is submitted to the cascade
If a rectangular image patch passes through all of the cascade stages, then
it is classified as “positive”
The process is repeated at different scales
Actually, at a low level, the
basic component of an object detector
is just something required to say if
a certain sub-region of the original
image contains an istance of the
object of interest or not. That is
what a binary classifier does.
The basic, weak classifier is based on a very simple visual feature (those
kind of features are often referred to as “Haar-like features”)
Haar-like features consist of a class of local
features that are calculated by subtracting the sum of a
subregion of the feature from the sum of the remaining
region of the feature.
These feature are characterised by the fact that they are easy to calculate and with the use of an integral image, very efficient to calculate.
Lienhart introduced an extended set of twisted Haar-like feature (see image)
These are the standard Haar-like feature that have been twisted by 45 degrees. Lienhart did not originally make use of the twisted checker board Haar-like feature (x2y2) since the diagonal elements that they represent can be simply represented using twisted
features, however it is clear that a twisted version of this feature can also be implemented and used.
These twisted Haar-like features can also be fast and efficiently calculated using an integral image that has been twisted 45 degrees. The only implementation issue is that
the twisted features must be rounded to integer values so that they are aligned with pixel boundaries. This process is similar to the rounding used when scaling a Haar-like
feature for larger or smaller windows, however one difference is that for a 45 degrees
twisted feature, the integer number of pixels used for the height and width of the
feature mean that the diagonal coordinates of the pixel will be always on the same diagonal set of pixels
This means that the number of different sized 45 degrees twisted features available is significantly reduced as compared to the standard vertically and horizontally
aligned features.
So we have something like:
About the formula, the Fast computation of Haar-like features using integral images looks like:
Finally, here is a c++ implementation which uses ViolaJones.h by Ivan Kusalic
to see the complete c++ project go here

The Viola-Jones detector is a strong binary classifier build of several weak detectors. Each weak detector is an extremely simple binary classifier
The detection consists of below parts:
Haar Filter: extract features from image to calssify(features act to encode ad-hoc domain knowledge)
Integral Image: allows for very fast feature evaluation
Cascade Classifier: A cascade classifier consists of multiple stages of filters, to classify a image( sliding window of a image) is a face.
Below is an overview of how to detect a face in image.
A detection window shifts around the whole image extract feature(by haar filter computed by Integral Image then send the extracted feature to Cascade Classifier to classify if it is a face). The sliding window shifts pixel-by-pixel. Each time the window shifts, the image region within the window will go through the cascade classifier.
Haar Filter: You can understand the the filter can extract features like eyes, bridge of the nose and so on.
Integral Image: allows for very fast feature evaluation
Cascade Classifier:
A cascade classifier consists of multiple stages of filters, as shown in the figure below. Each time the sliding window shifts, the new region within the sliding window will go through the cascade classifier stage-by-stage. If the input region fails to pass the threshold of a stage, the cascade classifier will immediately reject the region as a face. If a region pass all stages successfully, it will be classified as a candidate of face, which may be refined by further processing.
For more details:
Firstly, I suggest you to read the source paper Rapid Object Detection using a Boosted Cascade of Simple Features to have a overview understanding of the method.
If you can't understand it clearly, you can see Viola-Jones Face Detection or Implementing the Viola-Jones Face Detection Algorithm or Study of Viola-Jones Real Time Face Detector for more details.
Here is a python code Python implementation of the face detection algorithm by Paul Viola and Michael J. Jones.
matlab code here .

Related

What is the best method to train the faces with FaceRecognizer OpenCV to have the best result?

Here I say that I have tried many tutorials to implement face recognition in OpenCV 3.2 by using the FaceRecognizer class in face module. But I did not get the accepted result as I wish.
Here I want to ask and I want to know, that what is the best way or what are the conditions to be care off during training and recognizing?
What I have done to improve the accuracy:
Create (at least) 10 faces for training each person in the best quality, size, and angle.
Try to fit the face in the image.
Equalize the HIST of the images
And then I have tried all the three face recognizer (EigenFaceRecognizer, FisherFaceRecognizer, LBPHFaceRecognizer), the result all was the same, but the recognition rate was really very low, I have trained only for three persons, but also cannot recognize very well (the fist person was recognized as the second and so on problems).
Questions:
Do the training and the recognition images must be from the same
camera?
Do the training images cropped manually (photoshop -> read images then train) or this task
must be done programmatically (detect-> crop-> resize then train)?
And what are the best parameters for the each face recognizer (int num_components, double threshold)
And how to set training Algorithm to return -1 when it is an unknown
person.

Expanding on my comment, Chapter 8 in Mastering OpenCV provides really helpful tips for pre-processing faces to make aid the recognition process, such as:
taking a sample only when both eyes are detected (via haar cascade)
Geometrical transformation and cropping: This process would include scaling, rotating, and translating the images so that the eyes are aligned, followed by the removal of the forehead, chin, ears, and background from the face image.
Separate histogram equalization for left and right sides: This process standardizes the brightness and contrast on both the left- and right-hand sides of the face independently.
Smoothing: This process reduces the image noise using a bilateral filter.
Elliptical mask: The elliptical mask removes some remaining hair and background from the face image.
I've added a hacky load/save to my fork of the example code, feel free to try it/tweak it as you need it. Currently it's very limited, but it's a start.
Additionally, you should also check OpenFace and it's DNN face recognizer.
I haven't played with that yet so can't provide details, but it looks really cool.

Ratio of positive to negative data to use when training a cascade classifier (opencv)

So I'm using OpenCV's LBP detector. The shapes I'm detecting are all roughly circular (differing mostly in aspect ratio), with some wide changes in brightness/contrast, and a little bit of occlusion.
OpenCV's guide on how to train the detector is here
My main question to anyone with experience using it is how numPos and numNeg should be in relation to eachother? I have roughly 1000 positive samples (so ~900 being used per stage)
What I need to decide is how many negative samples to use per stage for training. I have about 20000 images from which to draw negative data, so redundancy isn't really an issue.
In general the rule I hear is 1:2, but that seems like under-utilization, given how much negative data I have at my disposal. On the flip side, what effects should I expect if I train my detector with 1:20? How should I determine the proper ratio?

Viewpoint invariant detection and recognition of simple 3d objects from image

I have a set of simple rigid 3D objects that I wish to detect and recognize from an image (let's say 5 to 10 classes). The objects are simple in sense that they are cylinders in one color or rectangles with simple patterns (stripes for example) or some similarly simple shape. The objects are significantly different from one another (there aren't for example two classes where one is a large cylinder and another one is the same but smaller cylinder).
Because the textures are pretty simple (solids and/or simple patterns), bag-of-words approach fails (they do not contain significant number of unique edges).
While one possible approach is coding manually each classifier (manual feature extraction etc), is there a simple data driven approach (Haar/LBP classifier for example) that would work? If Haar or LBP are good for solving this problem, how would one solve the problem of unknown relative viewpoint (and by such perspective distortion, rotation, etc)? Would just providing positive images from all possible viewpoints for an object converge or is there something else that's usually done? The detection and recognition should run in real-time.

Based on your description of your problem, I see several drawbacks of a Haar or LBP-based detector. First, these features do not use color, which seems to be important here. Second, a classifier using Haar or LBP features is sensitive to in-plane and out-of-plane rotation. If your objects can be in any 3D orientation, you would need to discretize the range of 3D rotations and train a separate detector for each one. For example, for face detection you typically use two detectors: one for frontal faces, and one for profile faces. Finally, if there is not enough texture for bag-of-words, there also may not be enough texture for Haar or LBP.
Since your objects are simple 3D shapes, I would start by trying to detect straight lines and circles using the Hough transform, and trying to group them to form the object's outlines.

Applying Haar-like features to an image / defining the features

I know the general idea of haar-like features and how a shape is computed using the integral image.
However my question is, after defining a shape and computing the integral image how to get the feature.
Meaning, do I apply the shape on every possible position (similar like a Gaussian filter)?
Is the integral image tiled and the shape is computed on each tile?
Or is the position of the shape in the image fixed and has to be predefined?
After that what exactly is the feature the classifier is trained on? E.g. if the image is tiled, would the new 'image' (combining all tiles to a vector) be the feature or would each tile be a feature on its own?
Everything I've found about it just said 'plug it into code library XY'.

A feature for the haar-like feature algorithm is a single shape located in the selected window.
So each feature is binary valued and includes both the 'shape' of the feature and its relative position in the detection window.
The image is processed by selecting many subwindows. The goal then is to discard any subwindows, not representing the desired object, as soon as possible.
This is accomplished by applying above features to each subwindow. From this feature set a classifier is learned.
In case of the Viola-Jones detection framework a chain of classifiers is used, where the first classifiers use less features, therefore being faster to compute.
If a classifier in the chain discards the subwindow further computation on this window is stopped.
The paper by Paul Viola and Michael Jones can be found here. Another useful paper on haar-like feature detection was published by Sri-Kaushik Pavani, David Delgadoand Alejandro F. Frangi under the name Haar-like features with optimally weighted rectangles for rapid object detection.

OpenCV detect image against a image set

I would like to know how I can use OpenCV to detect on my VideoCamera a Image. The Image can be one of 500 images.
What I'm doing at the moment:
- (void)viewDidLoad
{
[super viewDidLoad];
// Do any additional setup after loading the view.
self.videoCamera = [[CvVideoCamera alloc] initWithParentView:imageView];
self.videoCamera.delegate = self;
self.videoCamera.defaultAVCaptureDevicePosition = AVCaptureDevicePositionBack;
self.videoCamera.defaultAVCaptureSessionPreset = AVCaptureSessionPresetHigh;
self.videoCamera.defaultAVCaptureVideoOrientation = AVCaptureVideoOrientationPortrait;
self.videoCamera.defaultFPS = 30;
self.videoCamera.grayscaleMode = NO;
}
-(void)viewDidAppear:(BOOL)animated{
[super viewDidAppear:animated];
[self.videoCamera start];
}
#pragma mark - Protocol CvVideoCameraDelegate
#ifdef __cplusplus
- (void)processImage:(cv::Mat&)image;
{
// Do some OpenCV stuff with the image
cv::Mat image_copy;
cvtColor(image, image_copy, CV_BGRA2BGR);
// invert image
//bitwise_not(image_copy, image_copy);
//cvtColor(image_copy, image, CV_BGR2BGRA);
}
#endif
The images that I would like to detect are 2-5kb small. Few got text on them but others are just signs. Here a example:
Do you guys know how I can do that?

There are several things in here. I will break down your problem and point you towards some possible solutions.
Classification: Your main task consists on determining if a certain image belongs to a class. This problem by itself can be decomposed in several problems:
Feature Representation You need to decide how you are gonna model your feature, i.e. how are you going to represent each image in a feature space so you can train a classifier to separate those classes. The feature representation by itself is already a big design decision. One could (i) calculate the histogram of the images using n bins and train a classifier or (ii) you could choose a sequence of random patches comparison such as in a random forest. However, after the training, you need to evaluate the performance of your algorithm to see how good your decision was.
There is a known problem called overfitting, which is when you learn too well that you can not generalize your classifier. This can usually be avoided with cross-validation. If you are not familiar with the concept of false positive or false negative, take a look in this article.
Once you define your feature space, you need to choose an algorithm to train that data and this might be considered as your biggest decision. There are several algorithms coming out every day. To name a few of the classical ones: Naive Bayes, SVM, Random Forests, and more recently the community has obtained great results using Deep learning. Each one of those have their own specific usage (e.g. SVM ares great for binary classification) and you need to be familiar with the problem. You can start with simple assumptions such as independence between random variables and train a Naive Bayes classifier to try to separate your images.
Patches: Now you mentioned that you would like to recognize the images on your webcam. If you are going to print the images and display in a video, you need to handle several things. it is necessary to define patches on your big image (input from the webcam) in which you build a feature representation for each patch and classify in the same way you did in the previous step. For doing that, you could slide a window and classify all the patches to see if they belong to the negative class or to one of the positive ones. There are other alternatives.
Scale: Considering that you are able to detect the location of images in the big image and classify it, the next step is to relax the toy assumption of fixes scale. To handle a multiscale approach, you could image pyramid which pretty much allows you to perform the detection in multiresolution. Alternative approaches could consider keypoint detectors, such as SIFT and SURF. Inside SIFT, there is an image pyramid which allows the invariance.
Projection So far we assumed that you had images under orthographic projection, but most likely you will have slight perspective projections which will make the whole previous assumption fail. One naive solution for that would be for instance detect the corners of the white background of your image and rectify the image before building the feature vector for classification. If you used SIFT or SURF, you could design a way of avoiding explicitly handling that. Nevertheless, if your input is gonna be just squares patches, such as in ARToolkit, I would go for manual rectification.
I hope I might have given you a better picture of your problem.

I would recommend using SURF for that, because pictures can be on different distances form your camera, i.e changing the scale. I had one similar experiment and SURF worked just as expected. But SURF has very difficult adjustment (and expensive operations), you should try different setups before you get the needed results.
Here is a link: http://docs.opencv.org/modules/nonfree/doc/feature_detection.html
youtube video (in C#, but can give an idea): http://www.youtube.com/watch?v=zjxWpKCQqJc

I might not be qualified enough to answer this problem. Last time I seriously use OpenCV it was still 1.1. But just some thought on it, and hope it would help (currently I am interested in DIP and ML).
I think it will probably an easier task if you only need to classify an image, if the image is just one from (or very similar to) your 500 images. For this you could use SVM or some neural network (Felix already gave an excellent enumeration on that).
However, your problem seems to be that you need to first find this candidate image in your webcam, the location of which you have little clue beforehand. (let us know whether it is so. I think it is important.)
If so, the harder problem is the detection/localization of your candidate image.
I don't have a general solution for that. The first thing I would do is to see if there is some common feature in your 500 images (e.g., whether all of them enclosed by a red circle, or, half of them have circle and half of them have rectangle). If this can be done, the problem will be simpler (it would be similar to face detection problem, which have good solution).
In other words, this means that you first classify the 500 images to a few groups with common feature (by human), and detect the group first, then scale and use above mentioned technique to classify them into fine result. In this way, it will be more computationally acceptable than trying to detect 500 images one by one.
BTW, this ppt would help to give a visual clue of what is going on for feature extraction and image matching http://courses.cs.washington.edu/courses/cse455/09wi/Lects/lect6.pdf.

Detect vs recognize: detecting the image is just finding it on the background and from your comments I realized you may have your sings surrounded by the background. It might facilitate your algorithm if you can somehow crop your signs from the background (detect) before trying to recognize them. Recognizing is a next stage that presumes you can classify the cropped image correctly as the one seen before.
If you need real time speed and scale/rotation invariance neither SIFT no SURF will do this fast. Nowadays you can do much better if you shift the burden of image processing to a learning stage as was done by Lepitit. In short, he subjected each pattern to a bunch of affine transformations and trained a binary classification tree to recognize each point correctly by doing a lot of binary comparison tests. Trees are extremely fast and a way to go not to mention that most of the processing is done offline. This method is also more robust to off-plane rotations than SIFT or SURF. You will also learn about tree classification which may facilitate you last processing stage.
Finally a recognition stage is based not only on the number of matches but also on their geometric consistency. Since your signs look flat I suggest finding either affine or homography transformation that has most inliers when calculated between matched points.
Looking at your code though I realized that you may not follow any of these recommendations. It may be a good starting point for you to read about decision trees and then play with some sample code (see mushroom.cpp in the above mentioned link)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js