What is an ROC curve? - computer-vision

Can someone be kind enough to explain what an ROC curve actually represents with respect to tracking in a test sequence please? An example of an ROC curve is shown in the figure below.

The comments to the original question contain some very useful links to understand ROC curves in general and the discrimination threshold in question. Here is an attempt to understand the reference (Ref. 1) used by the OP and further information specific to the problem of Detecting Pedestrians.
How the ROC Curves are obtained in (Ref. 1) and what is the discrimination threshold in this case:
Motion filters and appearance filters, ("f_i" in eq. (2) p. 156) are evaluated using "integral image" of various time/spatial difference images from video sequences.
Using all these filters the learning algorithm builds the best classifier, (C in eq. (1) p. 156), separating positive examples ( e.g: pedestrians ) from negative examples (e.g: selection of non-pedestrian examples ). The classifier, C, is a thresholded sum of features, F, as given in eq. (1). A feature, F, is a filter, "f_i" thresholded by a feature threshold "t_i".
The thresholds involved (i.e., filter thresholds, "t_i" and classifier threshold "Theta") are computed during AdaBoost training that chooses the features with the lowest weighted error on the training examples.
As in (Ref. 2), a cascade of such classifiers is used to make the detector extremely efficient. During training each stage of the cascade - boosted classifier - is trained using 2250 positive examples (example in Fig. 5 p. 158) and 2250 negative examples.
The final cascade detector is run over a validation sequences to obtain the true positive rate and the false positive rate. This is based on comparing the output of the cascade detector (e.g., presence or absence of a pedestrian) to the ground truth (presence or absence of a pedestrian at the same region based on ground truth or manual review of the video sequence). For a set of threshold values for the entire cascade ("t_i" and "Theta" over all classifier in the cascade) a certain true positive rate and false positive rate is obtained. This will make one point on the ROC curve.
A simple MATLAB example for measuring True Positive Rate and False Positive Rate from a given set of classifier outputs and ground truth can be found here: http://www.mathworks.com/matlabcentral/fileexchange/21212-confusion-matrix---matching-matrix-along-with-precision--sensitivity--specificity-and-model-accuracy
So in this case, each point on the ROC curve will depend on the thresholds chosen for all the cascade layers (hence the discriminative threshold is not a single number in this case). By adjusting these thresholds, one at a time, the output values of true positive rate and false positive rate change (when step 5 is repeated) giving other points on the ROC curve.
This process is repeated for both cases of dynamic and static detectors to obtain the two ROC curves on the figure.
Please go through some more good description and examples through this link on ROC:
ROC curves can be used to compare the performance of classifiers in distinguishing between classes, as example, pedestrian versus non-pedestrian input samples. The area under the ROC curve is used as a measure of the classifier's ability to distinguish between the classes.
Quotes are from:
(Ref. 1) P. Viola, M. J. Jones, and D. Snow, "Detecting Pedestrians Using Patterns of Motion and Appearance", International Journal of Computer Vision 63(2), 153-161, 2005. [online: as of April 2015] http://lear.inrialpes.fr/people/triggs/student/vj/viola-ijcv05.pdf
(Ref. 2) P. Viola and M. J. Jones, "Rapid object detection using a boosted cascade of simple features. In IEEE Conference on Computer Vision and Pattern Recognition. More information at Viola-Jones Algorithm - "Sum of Pixels"?

Related

Using class weight to balance data set lowers accuracy in RBF SVM

I have been using sklearn to learn on some data. This is a binary classifcation task and I am using a RBF kernel. My data set is quite unbalanced (80:20) and I'm using only 120 samples, with 10ish features (I've been experimenting with a few less). Since I set class_weight="auto" the accuracy I've calculated from a cross validated (10 folds) gridsearch has dropped dramatically. Why??
I will include a couple of validation accuracy heatmaps to demonstrate the difference.
NOTE: top heatmap is before classweight was changed to auto.
Accuracy is not the best metrics to use when dealing with unbalanced dataset. Let's say you have 99 positive examples and 1 negative example, and if you predict all outputs to be positive, still you will get 99% accuracy, whereas you have mis-classified the only negative example. You might have gotten high accuracy in the first case because your predictions will be on the side which has high number of samples.
When you do class weight = auto, it takes the imbalance into consideration and hence, your predictions might have moved towards center, you can cross-check it using plotting the histograms of predictions.
My suggestion is, don't use accuracy as performance metric, use something like F1 Score or AUC.

Ratio of positive to negative data to use when training a cascade classifier (opencv)

So I'm using OpenCV's LBP detector. The shapes I'm detecting are all roughly circular (differing mostly in aspect ratio), with some wide changes in brightness/contrast, and a little bit of occlusion.
OpenCV's guide on how to train the detector is here
My main question to anyone with experience using it is how numPos and numNeg should be in relation to eachother? I have roughly 1000 positive samples (so ~900 being used per stage)
What I need to decide is how many negative samples to use per stage for training. I have about 20000 images from which to draw negative data, so redundancy isn't really an issue.
In general the rule I hear is 1:2, but that seems like under-utilization, given how much negative data I have at my disposal. On the flip side, what effects should I expect if I train my detector with 1:20? How should I determine the proper ratio?

What does eigen value of structure tensor matrix denote?

It is known that good feature point across two images can be determined properly, if
the two eigen value of above matrix, are greater than 0. Can someone explain, what does it mean to have both eigen value greater than 0 and why the feature point is not good if either of them is approx. equal to 0.
Note that this matrix always has nonnegative eigenvalues. Basically this rule says that one should favor rapid change in all directions, that is corners are better features than edges or flat surfaces.
The biggest eigenvalue corresponds to the eigenvector pointing towards the direction of the most significant change in the image at the point u.
If the two eigenvalues are small the image at point u does not change much.
If one of the eigenvectors is large and the other is small this point might lie on an edge in the image but it will be difficult to figure out where exactly on that edge.
If both are large, the point is like a corner.
There is a nice presentation with examples in the panoramic stitching slide deck from a course taught by Rajesh Rao at the University of Washington.
Here E(u,v) denotes the Eucledian distance between the two areas in the vicinities of pixels shifted by the vector (u,v) from each other. This distance tells how easy it is to distinguish the two pixels from one another.
Edit The matrix of image derivatives is denoted H in this illustration probably because of its relation to Harris corner detection algorithm.
That is related with the concept of Texturedness in the paper of Thomasi-Shi "Good features to track".
The idea of Textureness is to provide a rating of texture to make features (within a window) identifiable and unique. For instance, lines are not good features since are not unique (see Figure 3.9a)
To solve equation an optical flow equation, it must be possible to invert J (Hessian matrix). In practice next conditions must be satisfied:
Eigenvalues of J cannot differ by several orders of magnitude.
Eigenvalues of Hessian overcome image noise levels λnoise: implies that both eigenvalues of J must be large.
For the first condition we know that the greatest eigenvalue cannot be arbitrarily large because intensity variations in a window are bounded by the maximum allowable pixel value.
Regarding to second condition, being λ1 and λ2 two eigenvalues of J, following situations may rise (See Figure 3.10):
• Two small eigenvalues λ1 and λ2: means a roughly constant intensity profile within a window (Pink region). Problem of figure 3.9-b.
• A large and a small eigenvalue: means unidirectional texture patter (Violet or gray region). Problem of figure 3.9-a.
• λ1 and λ2 are both large: can represent a corner, salt and pepper textures or any other pattern that can be tracked reliably (Green region).
Some references:
1 - ORTIZ CAYON, R. J. (2013). Online video stabilization for UAV. Motion estimation and compensation for unnamed aerial vehicles.
2 - Shi, J., & Tomasi, C. (1994, June). Good features to track. In Computer Vision and Pattern Recognition, 1994. Proceedings CVPR'94., 1994 IEEE Computer Society Conference on (pp. 593-600). IEEE.
3 - Richard Szeliski. Image alignment and stitching: a tutorial. Found.
Trends. Comput. Graph. Vis., 2(1):1–104, January 2006.

Estimating/Choosing optimal Hyperparameters for DBSCAN

I need to find naturally occurring classes of nouns based on their distribution with different preposition (like agentive, instrumental, time, place etc.). I tried using k-means clustering but of less help, it didn't work well, there was a lot of overlap over the classes that I was looking for (probably because of non-globular shape of classes and random initialisation in k-means).
I am now working on using DBSCAN, but I have trouble understanding the epsilon value and mini-points value in this clustering algorithm. Can I use random values or do I need to compute them. Can anybody help? Particularly with epsilon, at least how to compute it if I need to?
Use your domain knowledge to choose the parameters. Epsilon is a radius. You can think of it as a minimum cluster size.
Obviously random values won't work very well. As a heuristic, you can try to look at a k-distance plot; but it's not automatic either.
The first thing to do either way is to choose a good distance function for your data. And perform appropriate normalization.
As for "minPts" it again depends on your data and needs. One user may want a very different value than another. And of course minPts and Epsilon are coupled. If you double epsilon, you will roughly need to increase your minPts by 2^d (for Euclidean distance, because that is how the volume of a hypersphere increases!)
If you want lots of small and fine detailed clusters, choose a low minpts. If you want larger and fewer clusters (and more noise), use a larger minpts. If you don't want any clusters at all, choose minpts larger than your data set size...
It is highly important to select the hyperparameters of DBSCAN algorithm rightly for your dataset and the domain in which it belongs.
eps hyperparameter
In order to determine the best value of eps for your dataset, use the K-Nearest Neighbours approach as explained in these two papers: Sander et al. 1998 and Schubert et al. 2017 (both papers from the original DBSCAN authors).
Here's a condensed version of their approach:
If you have N-dimensional data to begin, then choose n_neighbors in sklearn.neighbors.NearestNeighbors to be equal to 2xN - 1, and find out distances of the K-nearest neighbors (K being 2xN - 1) for each point in your dataset. Sort these distances out and plot them to find the "elbow" which separates noisy points (with high K-nearest neighbor distance) from points (with relatively low K-nearest neighbor distance) which will most likely fall into a cluster. The distance at which this "elbow" occurs is your point of optimal eps.
Here's some python code to illustrate how to do this:
def get_kdist_plot(X=None, k=None, radius_nbrs=1.0):
nbrs = NearestNeighbors(n_neighbors=k, radius=radius_nbrs).fit(X)
# For each point, compute distances to its k-nearest neighbors
distances, indices = nbrs.kneighbors(X)
distances = np.sort(distances, axis=0)
distances = distances[:, k-1]
# Plot the sorted K-nearest neighbor distance for each point in the dataset
plt.figure(figsize=(8,8))
plt.plot(distances)
plt.xlabel('Points/Objects in the dataset', fontsize=12)
plt.ylabel('Sorted {}-nearest neighbor distance'.format(k), fontsize=12)
plt.grid(True, linestyle="--", color='black', alpha=0.4)
plt.show()
plt.close()
k = 2 * X.shape[-1] - 1 # k=2*{dim(dataset)} - 1
get_kdist_plot(X=X, k=k)
Here's an example resultant plot from the code above:
From the plot above, it can be inferred that the optimal value for eps can be assumed at around 22 for the given dataset.
NOTE: I would strongly advice the reader to refer to the two papers cited above (especially Schubert et al. 2017) for additional tips on how to avoid several common pitfalls when using DBSCAN as well as other clustering algorithms.
There are a few articles online –– DBSCAN Python Example: The Optimal Value For Epsilon (EPS) and CoronaVirus Pandemic and Google Mobility Trend EDA –– which basically use the same approach but fail to mention the crucial choice of the value of K or n_neighbors as 2xN-1 when performing the above procedure.
min_samples hyperparameter
As for the min_samples hyperparameter, I agree with the suggestions in the accepted answer. Also, a general guideline for choosing this hyperparameter's optimal value is that it should be set to twice the number of features (Sander et al. 1998). For instance, if each point in the dataset has 10 features, a starting point to consider for min_samples would be 20.

How does the Viola-Jones face detection method work?

Please explain to me, in few words, how the Viola-Jones face detection method works.
The Viola-Jones detector is a strong, binary classifier build of several weak
detectors
Each weak detector is an extremely simple binary classifier
During the learning stage, a cascade of weak detectors is trained so as to
gain the desired hit rate / miss rate (or precision / recall) using Adaboost
To detect objects, the original image is partitioned in several rectangular
patches, each of which is submitted to the cascade
If a rectangular image patch passes through all of the cascade stages, then
it is classified as “positive”
The process is repeated at different scales
Actually, at a low level, the
basic component of an object detector
is just something required to say if
a certain sub-region of the original
image contains an istance of the
object of interest or not. That is
what a binary classifier does.
The basic, weak classifier is based on a very simple visual feature (those
kind of features are often referred to as “Haar-like features”)
Haar-like features consist of a class of local
features that are calculated by subtracting the sum of a
subregion of the feature from the sum of the remaining
region of the feature.
These feature are characterised by the fact that they are easy to calculate and with the use of an integral image, very efficient to calculate.
Lienhart introduced an extended set of twisted Haar-like feature (see image)
These are the standard Haar-like feature that have been twisted by 45 degrees. Lienhart did not originally make use of the twisted checker board Haar-like feature (x2y2) since the diagonal elements that they represent can be simply represented using twisted
features, however it is clear that a twisted version of this feature can also be implemented and used.
These twisted Haar-like features can also be fast and efficiently calculated using an integral image that has been twisted 45 degrees. The only implementation issue is that
the twisted features must be rounded to integer values so that they are aligned with pixel boundaries. This process is similar to the rounding used when scaling a Haar-like
feature for larger or smaller windows, however one difference is that for a 45 degrees
twisted feature, the integer number of pixels used for the height and width of the
feature mean that the diagonal coordinates of the pixel will be always on the same diagonal set of pixels
This means that the number of different sized 45 degrees twisted features available is significantly reduced as compared to the standard vertically and horizontally
aligned features.
So we have something like:
About the formula, the Fast computation of Haar-like features using integral images looks like:
Finally, here is a c++ implementation which uses ViolaJones.h by Ivan Kusalic
to see the complete c++ project go here
The Viola-Jones detector is a strong binary classifier build of several weak detectors. Each weak detector is an extremely simple binary classifier
The detection consists of below parts:
Haar Filter: extract features from image to calssify(features act to encode ad-hoc domain knowledge)
Integral Image: allows for very fast feature evaluation
Cascade Classifier: A cascade classifier consists of multiple stages of filters, to classify a image( sliding window of a image) is a face.
Below is an overview of how to detect a face in image.
A detection window shifts around the whole image extract feature(by haar filter computed by Integral Image then send the extracted feature to Cascade Classifier to classify if it is a face). The sliding window shifts pixel-by-pixel. Each time the window shifts, the image region within the window will go through the cascade classifier.
Haar Filter: You can understand the the filter can extract features like eyes, bridge of the nose and so on.
Integral Image: allows for very fast feature evaluation
Cascade Classifier:
A cascade classifier consists of multiple stages of filters, as shown in the figure below. Each time the sliding window shifts, the new region within the sliding window will go through the cascade classifier stage-by-stage. If the input region fails to pass the threshold of a stage, the cascade classifier will immediately reject the region as a face. If a region pass all stages successfully, it will be classified as a candidate of face, which may be refined by further processing.
For more details:
Firstly, I suggest you to read the source paper Rapid Object Detection using a Boosted Cascade of Simple Features to have a overview understanding of the method.
If you can't understand it clearly, you can see Viola-Jones Face Detection or Implementing the Viola-Jones Face Detection Algorithm or Study of Viola-Jones Real Time Face Detector for more details.
Here is a python code Python implementation of the face detection algorithm by Paul Viola and Michael J. Jones.
matlab code here .