HOG: What is done in the contrast-normalization step? - computer-vision

According to the HOG process, as described in the paper Histogram of Oriented Gradients for Human Detection (see link below), the contrast normalization step is done after the binning and the weighted vote.
I don't understand something - If I already computed the cells' weighted gradients, how can the normalization of the image's contrast help me now?
As far as I understand, contrast normalization is done on the original image, whereas for computing the gradients, I already computed the X,Y derivatives of the ORIGINAL image. So, if I normalize the contrast and I want it to take effect, I should compute everything again.
Is there something I don't understand well?
Should I normalize the cells' values?
Is the normalization in HOG not about contrast anyway, but is about the histogram values (counts of cells in each bin)?
Link to the paper:
http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf

The contrast normalization is achieved by normalization of each block's local histogram.
The whole HOG extraction process is well explained here: http://www.geocities.ws/talh_davidc/#cst_extract
When you normalize the block histogram, you actually normalize the contrast in this block, if your histogram really contains the sum of magnitudes for each direction.
The term "histogram" is confusing here, because you do not count how many pixels has direction k, but instead you sum the magnitudes of such pixels. Thus you can normalize the contrast after computing the block's vector, or even after you computed the whole vector, assuming that you know in which indices in the vector a block starts and a block ends.
The steps of the algorithm due to my understanding - worked for me with 95% success rate:
Define the following parameters (In this example, the parameters are like HOG for Human Detection paper):
A cell size in pixels (e.g. 6x6)
A block size in cells (e.g. 3x3 ==> Means that in pixels it is 18x18)
Block overlapping rate (e.g. 50% ==> Means that both block width and block height in pixels have to be even. It is satisfied in this example, because the cell width and cell height are even (6 pixels), making the block width and height also even)
Detection window size. The size must be dividable by a half of the block size without remainder (so it is possible to exactly place the blocks within with 50% overlapping). For example, the block width is 18 pixels, so the windows width must be a multiplication of 9 (e.g. 9, 18, 27, 36, ...). Same for the window height. In our example, the window width is 63 pixels, and the window height is 126 pixels.
Calculate gradient:
Compute the X difference using convolution with the vector [-1 0 1]
Compute the Y difference using convolution with the transpose of the above vector
Compute the gradient magnitude in each pixel using sqrt(diffX^2 + diffY^2)
Compute the gradient direction in each pixel using atan(diffY / diffX). Note that atan will return values between -90 and 90, while you will probably want the values between 0 and 180. So just flip all the negative values by adding to them +180 degrees. Note that in HOG for Human Detection, they use unsigned directions (between 0 and 180). If you want to use signed directions, you should make a little more effort: If diffX and diffY are positive, your atan value will be between 0 and 90 - leave it as is. If diffX and diffY are negative, again, you'll get the same range of possible values - here, add +180, so the direction is flipped to the other side. If diffX is positive and diffY is negative, you'll get values between -90 and 0 - leave them the same (You can add +360 if you want it positive). If diffY is positive and diffX is negative, you'll again get the same range, so add +180, to flip the direction to the other side.
"Bin" the directions. For example, 9 unsigned bins: 0-20, 20-40, ..., 160-180. You can easily achieve that by dividing each value by 20 and flooring the result. Your new binned directions will be between 0 and 8.
Do for each block separately, using copies of the original matrix (because some blocks are overlapping and we do not want to destroy their data):
Split to cells
For each cell, create a vector with 9 members (one for each bin). For each index in the bin, set the sum of all the magnitudes of all the pixels with that direction. We have totally 6x6 pixels in a cell. So for example, if 2 pixels have direction 0 while the magnitude of the first one is 0.231 and the magnitude of the second one is 0.13, you should write in index 0 in your vector the value 0.361 (= 0.231 + 0.13).
Concatenate all the vectors of all the cells in the block into a large vector. This vector size should of course be NUMBER_OF_BINS * NUMBER_OF_CELLS_IN_BLOCK. In our example, it is 9 * (3 * 3) = 81.
Now, normalize this vector. Use k = sqrt(v[0]^2 + v[1]^2 + ... + v[n]^2 + eps^2) (I used eps = 1). After you computed k, divide each value in the vector by k - thus your vector will be normalized.
Create final vector:
Concatenate all the vectors of all the blocks into 1 large vector. In my example, the size of this vector was 6318

Related

Histogram Binning of Gradient Vectors

I am working on a project that has a small component requiring the comparison of distributions over image gradients. Assume I have computed the image gradients in the x and y directions using a Sobel filter and have for each pixel a 2-vector. Obviously getting the magnitude and direction is reasonably trivial and is as follows:
However, what is not clear to me is how to bin these two components in to a two dimensional histogram for an arbitrary number of bins.
I had considered something along these lines(written in browser):
//Assuming normalised magnitudes.
//Histogram dimensions are bins * bins.
int getHistIdx(float mag, float dir, int bins) {
const int magInt = reinterpret_cast<int>(mag);
const int dirInt = reinterpret_cast<int>(dir);
const int magMod = reinterpret_cast<int>(static_cast<float>(1.0));
const int dirMod = reinterpret_cast<int>(static_cast<float>(TWO_PI));
const int idxMag = (magInt % magMod) & bins
const int idxDir = (dirInt % dirMod) & bins;
return idxMag * bins + idxDir;
}
However, I suspect that the mod operation will introduce a lot of incorrect overlap, i.e. completely different gradients getting placed in to the same bin.
Any insight in to this problem would be very much appreciated.
I would like to avoid using any off the shelf libraries as I want to keep this project as dependency light as possible. Also I intend to implement this in CUDA.
This is more of a what is an histogram question? rather than one of your tags. Two things:
In a 2D plain two directions equal by modulation of 2pi are in fact the same - so it makes sense to modulate.
I see no practical or logical reason of modulating the norms.
Next, you say you want a "two dimensional histogram", but return a single number. A 2D histogram, and what would make sense in your context, is a 3D plot - the plane is theta/R, 2 indexed, while the 3D axis is the "count".
So first suggestion, return
return Pair<int,int>(idxMag,idxDir);
Then you can make a 2D histogram, or 2 2D histograms.
Regarding the "number of bins"
this is use case dependent. You need to define the number of bins you want (maybe different for theta and R). Maybe just some constant 10 bins? Maybe it should depend on the amount of vectors? In any case, you need a function that receives either the number of vectors, or the total set of vectors, and returns the number of bins for each axis. This could be a constant (10 bins) initially, and you can play with it. Once you decide on the number of bins:
Determine the bins
For a bounded case such as 0<theta<2 pi, this is easy. Divide the interval equally into the number of bins, assuming a flat distribution. Your modulation actually handles this well - if you would have actually modulated by 2*pi, which you didn't. You would still need to determine the bin bounds though.
For R this gets trickier, as this is unbounded. Two options here, but both rely on the same tactic - choose a maximal bin. Either arbitrarily (Say R=10), so any vector longer than that is placed in the "longer than max" bin. The rest is divided equally (for example, though you could choose other distributions). Another option is for the longest vector to determine the edge of the maximal bin.
Getting the index
Once you have the bins, you need to search the magnitude/direction of the current vector in your bins. If bins are pairs representing min/max of bin (and maybe an index), say in a linked list, then it would be something like (for mag for example):
bin = histogram.first;
while ( mag > bin.min ) bin = bin.next;
magIdx = bin.index;
If the bin does not hold the index you can just use a counter and increase it in the while. Also, for the magnitude the final bin should hold "infinity" or some large number as a limit. Note this has nothing to do with modulation, though that would work for your direction - as you have coded. I don't see how this makes sense for the norm.
Bottom line though, you have to think a bit about what you want. In any case all the "objects" here are trivial enough to write yourself, or even use small arrays.
I think you should arrange your bins in a square array, and then bin by vx and vy independently.
If your gradients are reasonably even you just need to scan the data first to accumulate the min and max in x and y, and then split the gradients evenly.
If the gradients are very unevenly distributed, you might want to sort the (eg) vx first and arrange that the boundaries between each bin exactly evenly divides the values.
An intermediate solution might be to obtain the min and max ignoring the (eg) 10% most extreme values.

Choose rectangles for maximizing the area

I've got a 2D-binary matrix of arbitrary size. I want to find a set of rectangles in this matrix, showing a maximum area. The constraints are:
Rectangles may only cover "0"-fields in the matrix and no "1"-fields.
Each rectangle has to have a given distance from the next rectangle.
So let me illustrate this a bit further by this matrix:
1 0 0 1
0 0 0 0
0 0 1 0
0 0 0 0
0 1 0 0
Let the minimal distance between two rectangles be 1. Consequently, the optimal solution would be by choosing the rectangles with corners (1,0)-(3,1) and (1,3)-(4,3). These rectangles are min. 1 field apart from each other and they do not lie on "1"-fields. Additionally, this solution got the maximum area (6+4=10).
If the minimal distance would be 2, the optimum would be (1,0)-(4,0) and (1,3)-(4,3) with area 4+4=8.
Till now, I achieved to find out rectangles analogous to this post:
Find largest rectangle containing only zeros in an N×N binary matrix
I saved all these rectangles in a list:
list<rectangle> rectangles;
with
struct rectangle {
int i,j; // bottom left corner of rectangle
int width,length; // width=size in neg. i direction, length=size in pos. j direction
};
Till now, I only thought about brute-force-methods but of course, I am not happy with this.
I hope you can give me some hints and tips of how to find the corresponding rectangles in my list and I hope my problem is clear to you.
The following counterexample shows that even a brute-force checking of all combinations of maximal-area rectangles can fail to find the optimum:
110
000
110
In the above example, there are 2 maximal-area rectangles, each of area 3, one vertical and one horizontal. You can't pick both, so if you are restricted to choosing a subset of these rectangles, the best you can do is to pick (either) one for a total area of 3. But if you instead picked the vertical area-3 rectangle, and then also took the non-maximal 1x2 rectangle consisting of just the leftmost two 0s, you could get a better total area of 5. (That's for a minimum separation distance of 0; if the minimum separation distance is 1, as in your own example, then you could instead pick just the leftmost 0 as a 1x1 rectangle for a total area of 4, which is still better than 3.)
For the special case when the separation distance is 0, there's a trivial algorithm: you can simply put a 1x1 rectangle on every single 0 in the matrix. When the separation distance is strictly greater than 0, I don't yet see a fast algorithm, though I'm less sure that the problem is NP-hard now than I was a few minutes ago...

How to Calculate width of the middle 98% mass of the gray level histogram of a image

I need to calculate the contrast of an color image, so the steps that was given to me are,
computed the histogram for RGB channel separately and combined it together as Histogram = histOfRedC + histOfBlueC + histOfgreenC.
normalize it to unit length, as each image is of different size.
The contrast quality, is equal to the width of the middle 98% mass of the histogram.
I have done the first 2 steps but unable to understand what to compute in 3rd step. Can somebody please explain me what it means?
Let the total mass of the histogram be M.
Accumulate the mass in the bins, starting from index zero, until you pass 0.01 M. You get an index Q01.
Decumulate the mass in the bins, starting from the maximum index, until you pass 0.99 M. You get an index Q99.
These indexes are the so-called first and last percentiles. The contrast is estimated as Q99-Q01.

openCV filter image - replace kernel with local maximum

Some details about my problem:
I'm trying to realize corner detector in openCV (another algorithm, that are built-in: Canny, Harris, etc).
I've got a matrix filled with the response values. The biggest response value is - the biggest probability of corner detected is.
I have a problem, that in neighborhood of a point there are few corners detected (but there is only one). I need to reduce number of false-detected corners.
Exact problem:
I need to walk through the matrix with a kernel, calculate maximum value of every kernel, leave max value, but others values in kernel make equal zero.
Are there build-in openCV functions to do this?
This is how I would do it:
Create a kernel, it defines a pixels neighbourhood.
Create a new image by dilating your image using this kernel. This dilated image contains the maximum neighbourhood value for every point.
Do an equality comparison between these two arrays. Wherever they are equal is a valid neighbourhood maximum, and is set to 255 in the comparison array.
Multiply the comparison array, and the original array together (scaling appropriately).
This is your final array, containing only neighbourhood maxima.
This is illustrated by these zoomed in images:
9 pixel by 9 pixel original image:
After processing with a 5 by 5 pixel kernel, only the local neighbourhood maxima remain (ie. maxima seperated by more than 2 pixels from a pixel with a greater value):
There is one caveat. If two nearby maxima have the same value then they will both be present in the final image.
Here is some Python code that does it, it should be very easy to convert to c++:
import cv
im = cv.LoadImage('fish2.png',cv.CV_LOAD_IMAGE_GRAYSCALE)
maxed = cv.CreateImage((im.width, im.height), cv.IPL_DEPTH_8U, 1)
comp = cv.CreateImage((im.width, im.height), cv.IPL_DEPTH_8U, 1)
#Create a 5*5 kernel anchored at 2,2
kernel = cv.CreateStructuringElementEx(5, 5, 2, 2, cv.CV_SHAPE_RECT)
cv.Dilate(im, maxed, element=kernel, iterations=1)
cv.Cmp(im, maxed, comp, cv.CV_CMP_EQ)
cv.Mul(im, comp, im, 1/255.0)
cv.ShowImage("local max only", im)
cv.WaitKey(0)
I didn't realise until now, but this is what #sansuiso suggested in his/her answer.
This is possibly better illustrated with this image, before:
after processing with a 5 by 5 kernel:
solid regions are due to the shared local maxima values.
I would suggest an original 2-step procedure (there may exist more efficient approaches), that uses opencv built-in functions :
Step 1 : morphological dilation with a square kernel (corresponding to your neighborhood). This step gives you another image, after replacing each pixel value by the maximum value inside the kernel.
Step 2 : test if the cornerness value of each pixel of the original response image is equal to the max value given by the dilation step. If not, then obviously there exists a better corner in the neighborhood.
If you are looking for some built-in functionality, FilterEngine will help you make a custom filter (kernel).
http://docs.opencv.org/modules/imgproc/doc/filtering.html#filterengine
Also, I would recommend some kind of noise reduction, usually blur, before all processing. That is unless you really want the image raw.

compact representation and delivery of point data

I have an array of point data, the values of points are represented as x co-ordinate and y co-ordinate.
These points could be in the range of 500 upto 2000 points or more.
The data represents a motion path which could range from the simple to very complex and can also have cusps in it.
Can I represent this data as one spline or a collection of splines or some other format with very tight compression.
I have tried representing them as a collection of beziers but at best I am getting a saving of 40 %.
For instance if I have an array of 500 points , that gives me 500 x and 500 y values so I have 1000 data pieces.
I around 100 quadratic beziers from this. each bezier is represented as controlx, controly, anchorx, anchory.
which gives me 100 x 4 = 400 pcs of data.
So input = 1000pcs , output = 400pcs.
I would like to further tighen this, any suggestions?
By its nature, spline is an approximation. You can reduce the number of splines you use to reach a higher compression ratio.
You can also achieve lossless compression by using some kind of encoding scheme. I am just making this up as I am typing, using the range example in previous answer (1000 for x and 400 for y),
Each point only needs 19 bits (10 for x, 9 for y). You can use 3 bytes to represent a coordinate.
Use 2 byte to represent displacement up to +/- 63.
Use 1 byte to represent short displacement up to +/- 7 for x, +/- 3 for y.
To decode the sequence properly, you would need some prefix to identify the type of encoding. Let's say we use 110 for full point, 10 for displacement and 0 for short displacement.
The bit layout will look like this,
Coordinates: 110xxxxxxxxxxxyyyyyyyyyy
Dislacement: 10xxxxxxxyyyyyyy
Short Displacement: 0xxxxyyy
Unless your sequence is totally random, you can easily achieve high compression ratio with this scheme.
Let's see how it works using a short example.
3 points: A(500, 400), B(550, 380), C(545, 381)
Let's say you were using 2 byte for each coordinate. It will take 16 bytes to encode this without compression.
To encode the sequence using the compression scheme,
A is first point so full coordinate will be used. 3 bytes.
B's displacement from A is (50, -20) and can be encoded as displacement. 2 bytes.
C's displacement from B is (-5, 1) and it fits the range of short displacement 1 byte.
So you save 10 bytes out of 16 bytes. Real compression ratio is totally depending on the data pattern. It works best on points forming a moving path. If the points are random, only 25% saving can be achieved.
If for example you use 32-bit integers for point coords and there is range limit, like x: 0..1000, y:0..400, you can pack (x, y) into a single 32-bit variable.
That way you achieve another 50% compression.
You could do a frequency analysis of the numbers you are trying to encode and use varying bit lengths to represent them, of course here I am vaguely describing Huffman coding
Firstly, only keep enough decimal points in your data that you actually need. Removing these would reduce your accuracy, but its a calculated loss. To do that, try converting your number to a string, locating the dot's position, and cutting of those many characters from the end. That could process faster than math, IMO. Lastly you can convert it back to a number.
150.234636746 -> "150.234636746" -> "150.23" -> 150.23
Secondly, try storing your data relative to the last number ("relative values"). Basically subtract the last number from this one. Then later to "decompress" it you can keep an accumulator variable and add them up.
A A A A R R
150, 200, 250 -> 150, 50, 50