Spatial hashing for bounding box - c++

I have a function that calculates spatial hash as follows:
int generateHash(int x, int y, int bucketWidth, int bucketsPerSide)
{
return static_cast<int>(floor(x / bucketWidth) + floor(y / bucketWidth) * bucketsPerSide);
}
I have a bounding box defined as x,y,width,height and I'd like to retrieve all hashes for it. How can I do it? The problem seems trivial but I spent all day trying to figure it out and I just can't find a solution. It makes me feel hopeless.
Please note that the box can be so large that it will have more than 4 hashes (corners) assigned to it. I need to generate all hashes including the ones inside.
The dumb solution would be to start from x and y and increment both by 1 in a nested loop and add to std::set to ensure that each hash appear only once but this is extremely inefficient. I know there must be some better way to do it. I tried incrementing by bucketWidth but then it doesn't generate hashes for the rightmost side in some cases.
The closest I got is this:
std::vector<int> getOccupiedBucketIds(const cv::Rect& rect)
{
std::vector<int> occupiedBucketIds;
auto xIncrement = rect.width < bucketWidth ? rect.width : bucketWidth ;
auto yIncrement = rect.height < bucketWidth ? rect.height : bucketWidth ;
for (auto x = rect.x; x <= rect.x + rect.width; x += xIncrement)
{
for (auto y = rect.y; y <= rect.y + rect.width; y += yIncrement)
{
occupiedBucketIds.push_back(generateHash(x, y, bucketWidth , cellsPerSide));
}
}
return occupiedBucketIds;
}
This however leaves the following case unsolved when rect.width%bucketWidth > 0:

You want something like this:
std::vector<hash_t> hashes;
for(double yi=y+bucket_width/2;yi<ymax;yi+=bucket_width)
for(double xi=x+bucket_width/2;xi<xmax;xi+=bucket_width)
hashes.push_back(generateHash(xi,yi,bucket_width,buckets_per_side));
You're not getting the rightmost edge because you are using floating-point mathematics. That is, when you are calculating numbers close to the edges of cells they may be slightly smaller, or slightly larger, than you would expect.
The solution is to instead calculate the locations of the centers of cells, which are far away from the edges.

One of my friends helped me solve it. Like I said, it was kind of trivial...
std::vector<int> getOccupiedBucketIds(const cv::Rect& rect)
{
auto minBucketX = rect.x / bucketWidth;
auto minBucketY = rect.y / bucketWidth;
auto maxBucketX = (rect.x + rect.width) / bucketWidth;
auto maxBucketY = (rect.y + rect.height) / bucketWidth;
std::vector<int> occupiedBucketIds;
occupiedBucketIds.reserve((maxBucketX - minBucketX + 1) * (maxBucketY - minBucketY + 1));
for (auto y = minBucketY; y <= maxBucketY; y++)
{
for (auto x = minBucketX; x <= maxBucketX; x++)
{
occupiedBucketIds.push_back(y * cellsPerSide + x);
}
}
return occupiedBucketIds;
}

Related

SSE optimization of Gaussian blur

I'm working on a school project , I have to optimize part of code in SSE, but I'm stuck on one part for few days now.
I dont see any smart way of using vector SSE instructions(inline assembler / instric f) in this code(its a part of guassian blur algorithm). I would be glad if somebody could give me just a small hint
for (int x = x_start; x < x_end; ++x) // vertical blur...
{
float sum = image[x + (y_start - radius - 1)*image_w];
float dif = -sum;
for (int y = y_start - 2*radius - 1; y < y_end; ++y)
{ // inner vertical Radius loop
float p = (float)image[x + (y + radius)*image_w]; // next pixel
buffer[y + radius] = p; // buffer pixel
sum += dif + fRadius*p;
dif += p; // accumulate pixel blur
if (y >= y_start)
{
float s = 0, w = 0; // border blur correction
sum -= buffer[y - radius - 1]*fRadius; // addition for fraction blur
dif += buffer[y - radius] - 2*buffer[y]; // sum up differences: +1, -2, +1
// cut off accumulated blur area of pixel beyond the border
// assume: added pixel values beyond border = value at border
p = (float)(radius - y); // top part to cut off
if (p > 0)
{
p = p*(p-1)/2 + fRadius*p;
s += buffer[0]*p;
w += p;
}
p = (float)(y + radius - image_h + 1); // bottom part to cut off
if (p > 0)
{
p = p*(p-1)/2 + fRadius*p;
s += buffer[image_h - 1]*p;
w += p;
}
new_image[x + y*image_w] = (unsigned char)((sum - s)/(weight - w)); // set blurred pixel
}
else if (y + radius >= y_start)
{
dif -= 2*buffer[y];
}
} // for y
} // for x
One more feature you can use is logical operations and masks:
for example instead of:
// process only 1
if (p > 0)
p = p*(p-1)/2 + fRadius*p;
you can write
// processes 4 floats
const __m128 &mask = _mm_cmplt_ps(p,0);
const __m128 &notMask = _mm_cmplt_ps(0,p);
const __m128 &p_tmp = ( p*(p-1)/2 + fRadius*p );
p = _mm_add_ps(_mm_and_ps(p_tmp, mask), _mm_and_ps(p, notMask)); // = p_tmp & mask + p & !mask
Also I can recommend you to use a special libraries, which overloads instructions. For example: http://code.compeng.uni-frankfurt.de/projects/vc
dif variable makes iterations of inner loop dependent. You should try to parallelize the outer loop. But with out instructions overloading the code will become unmanageable then.
Also consider rethinking the whole algorithm. Current one doesn't look paralell. May be you can neglect precision, or increase scalar time a bit?

Linear interpolation code on wikipedia - I don't understand it

I'm reading the following code (taken from here)
void linear_interpolation_CPU(float2* result, float2* data,
float* x_out, int M, int N) {
float a;
for(int j = 0; j < N; j++) {
int k = floorf(x_out[j]);
a = x_out[j] - floorf(x_out[j]);
result[j].x = a*data[k+1].x + (-data[k].x*a + data[k].x);
result[j].y = a*data[k+1].y + (-data[k].y*a + data[k].y);
}
}
but I don't get it.
Why isn't the result[y] calculated by using the
formula?
It is calculated that way.
Look at the first two lines:
int k = floorf(x_out[j]);
a = x_out[j] - floorf(x_out[j]);
The first line defines x0 using the floor function. This is because the article assumes a lattice spacing of one for the sample points, as per the line:
the samples are obtained on the 0,1,...,M lattice
Now we could rewrite the second line for clarity as:
a = x_out[j] - k;
The second line is therefore x-x0.
Now, let us examine the equation:
result[j].y = a*data[k+1].y + (-data[k].y*a + data[k].y);
Rewriting this in terms of y, x, and x0 gives:
y = (x-x0)*data[k+1].y + (-data[k].y*(x-x0) + data[k].y);
Let's rename data[k+1].y as y1 and data[k].y as y0:
y = (x-x0)*y1 + (-y0*(x-x0) + y0);
Let's rearrange this by pulling out x-x0:
y = (x-x0)*(y1-y0) + y0;
And rearrange again:
y = y0 + (y1-y0)*(x-x0);
Again, the lattice spacing is important:
the samples are obtained on the 0,1,...,M lattice
Thus, x1-x0 is always 1. If we put it back in, we get
y = y0 + (y1-y0)*(x-x0)/(x1-x0);
Which is just the equation you were looking for.
Granted, it's ridiculous that the code is not written so as to make that apparent.

How to fit the 2D scatter data with a line with C++

I used to work with MATLAB, and for the question I raised I can use p = polyfit(x,y,1) to estimate the best fit line for the scatter data in a plate. I was wondering which resources I can rely on to implement the line fitting algorithm with C++. I understand there are a lot of algorithms for this subject, and for me I expect the algorithm should be fast and meantime it can obtain the comparable accuracy of polyfit function in MATLAB.
This page describes the algorithm easier than Wikipedia, without extra steps to calculate the means etc. : http://faculty.cs.niu.edu/~hutchins/csci230/best-fit.htm . Almost quoted from there, in C++ it's:
#include <vector>
#include <cmath>
struct Point {
double _x, _y;
};
struct Line {
double _slope, _yInt;
double getYforX(double x) {
return _slope*x + _yInt;
}
// Construct line from points
bool fitPoints(const std::vector<Point> &pts) {
int nPoints = pts.size();
if( nPoints < 2 ) {
// Fail: infinitely many lines passing through this single point
return false;
}
double sumX=0, sumY=0, sumXY=0, sumX2=0;
for(int i=0; i<nPoints; i++) {
sumX += pts[i]._x;
sumY += pts[i]._y;
sumXY += pts[i]._x * pts[i]._y;
sumX2 += pts[i]._x * pts[i]._x;
}
double xMean = sumX / nPoints;
double yMean = sumY / nPoints;
double denominator = sumX2 - sumX * xMean;
// You can tune the eps (1e-7) below for your specific task
if( std::fabs(denominator) < 1e-7 ) {
// Fail: it seems a vertical line
return false;
}
_slope = (sumXY - sumX * yMean) / denominator;
_yInt = yMean - _slope * xMean;
return true;
}
};
Please, be aware that both this algorithm and the algorithm from Wikipedia ( http://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line ) fail in case the "best" description of points is a vertical line. They fail because they use
y = k*x + b
line equation which intrinsically is not capable to describe vertical lines. If you want to cover also the cases when data points are "best" described by vertical lines, you need a line fitting algorithm which uses
A*x + B*y + C = 0
line equation. You can still modify the current algorithm to produce that equation:
y = k*x + b <=>
y - k*x - b = 0 <=>
B=1, A=-k, C=-b
In terms of the above code:
B=1, A=-_slope, C=-_yInt
And in "then" block of the if checking for denominator equal to 0, instead of // Fail: it seems a vertical line, produce the following line equation:
x = xMean <=>
x - xMean = 0 <=>
A=1, B=0, C=-xMean
I've just noticed that the original article I was referring to has been deleted. And this web page proposes a little different formula for line fitting: http://hotmath.com/hotmath_help/topics/line-of-best-fit.html
double denominator = sumX2 - 2 * sumX * xMean + nPoints * xMean * xMean;
...
_slope = (sumXY - sumY*xMean - sumX * yMean + nPoints * xMean * yMean) / denominator;
The formulas are identical because nPoints*xMean == sumX and nPoints*xMean*yMean == sumX * yMean == sumY * xMean.
I would suggest coding it from scratch. It is a very simple implementation in C++. You can code up both the intercept and gradient for least-squares fit (the same method as polyfit) from your data directly from the formulas here
http://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line
These are closed form formulas that you can easily evaluate yourself using loops. If you were using higher degree fits then I would suggest a matrix library or more sophisticated algorithms but for simple linear regression as you describe above this is all you need. Matrices and linear algebra routines would be overkill for such a problem (in my opinion).
Equation of line is Ax + By + C=0.
So it can be easily( when B is not so close to zero ) convert to y = (-A/B)*x + (-C/B)
typedef double scalar_type;
typedef std::array< scalar_type, 2 > point_type;
typedef std::vector< point_type > cloud_type;
bool fit( scalar_type & A, scalar_type & B, scalar_type & C, cloud_type const& cloud )
{
if( cloud.size() < 2 ){ return false; }
scalar_type X=0, Y=0, XY=0, X2=0, Y2=0;
for( auto const& point: cloud )
{ // Do all calculation symmetric regarding X and Y
X += point[0];
Y += point[1];
XY += point[0] * point[1];
X2 += point[0] * point[0];
Y2 += point[1] * point[1];
}
X /= cloud.size();
Y /= cloud.size();
XY /= cloud.size();
X2 /= cloud.size();
Y2 /= cloud.size();
A = - ( XY - X * Y ); //!< Common for both solution
scalar_type Bx = X2 - X * X;
scalar_type By = Y2 - Y * Y;
if( fabs( Bx ) < fabs( By ) ) //!< Test verticality/horizontality
{ // Line is more Vertical.
B = By;
std::swap(A,B);
}
else
{ // Line is more Horizontal.
// Classical solution, when we expect more horizontal-like line
B = Bx;
}
C = - ( A * X + B * Y );
//Optional normalization:
// scalar_type D = sqrt( A*A + B*B );
// A /= D;
// B /= D;
// C /= D;
return true;
}
You can also use or go over this implementation there is also documentation here.
Fitting a Line can be acomplished in different ways.
Least Square means minimizing the sum of the squared distance.
But you could take another cost function as example the (not squared) distance. But normaly you use the squred distance (Least Square).
There is also a possibility to define the distance in different ways. Normaly you just use the "y"-axis for the distance. But you could also use the total/orthogonal distance. There the distance is calculated in x- and y-direction. This can be a better fit if you have also errors in x direction (let it be the time of measurment) and you didn't start the measurment on the exact time you saved in the data. For Least Square and Total Least Square Line fit exist algorithms in closed form. So if you fitted with one of those you will get the line with the minimal sum of the squared distance to the datapoints. You can't fit a better line in the sence of your defenition. You could just change the definition as examples taking another cost function or defining distance in another way.
There is a lot of stuff about fitting models into data you could think of, but normaly they all use the "Least Square Line Fit" and you should be fine most times. But if you have a special case it can be necessary to think about what your doing. Taking Least Square done in maybe a few minutes. Thinking about what Method fits you best to the problem envolves understanding the math, which can take indefinit time :-).
Note: This answer is NOT AN ANSWER TO THIS QUESTION but to this one "Line closest to a set of points" that has been flagged as "duplicate" of this one (incorrectly in my opinion), no way to add new answers to it.
The question asks for:
Find the line whose distance from all the points is minimum ? By
distance I mean the shortest distance between the point and the line.
The most usual interpretation of distance "between the point and the line" is the euclidean distance and the most common interpretation of "from all points" is the sum of distances (in absolute or squared value).
When the target is minimize the sum of squared euclidean distances, the linear regression (LST) is not the algorithm to use. In addition, linear regression can not result in a vertical line. The algorithm to be used is the "total least squares". See by example wikipedia for the problem description and this answer in math stack exchange for details about the formulation.
to fit a line y=param[0]x+param[1] simply do this:
// loop over data:
{
sum_x += x[i];
sum_y += y[i];
sum_xy += x[i] * y[i];
sum_x2 += x[i] * x[i];
}
// means
double mean_x = sum_x / ninliers;
double mean_y = sum_y / ninliers;
float varx = sum_x2 - sum_x * mean_x;
float cov = sum_xy - sum_x * mean_y;
// check for zero varx
param[0] = cov / varx;
param[1] = mean_y - param[0] * mean_x;
More on the topic http://easycalculation.com/statistics/learn-regression.php
(formulas are the same, they just multiplied and divided by N, a sample sz.). If you want to fit plane to 3D data use a similar approach -
http://www.mymathforum.com/viewtopic.php?f=13&t=8793
Disclaimer: all quadratic fits are linear and optimal in a sense that they reduce the noise in parameters. However, you might interested in the reducing noise in the data instead. You might also want to ignore outliers since they can bia s your solutions greatly. Both problems can be solved with RANSAC. See my post at:

Better way than if else if else... for linear interpolation

question is easy.
Lets say you have function
double interpolate (double x);
and you have a table that has map of known x-> y
for example
5 15
7 18
10 22
note: real tables are bigger ofc, this is just example.
so for 8 you would return 18+((8-7)/(10-7))*(22-18)=19.3333333
One cool way I found is
http://www.bnikolic.co.uk/blog/cpp-map-interp.html
(long story short it uses std::map, key= x, value = y for x->y data pairs).
If somebody asks what is the if else if else way in title
it is basically:
if ((x>=5) && (x<=7))
{
//interpolate
}
else
if((x>=7) && x<=10)
{
//interpolate
}
So is there a more clever way to do it or map way is the state of the art? :)
Btw I prefer soutions in C++ but obviously any language solution that has 1:1 mapping to C++ is nice.
Well, the easiest way I can think of would be using a binary search to find the point where your point lies. Try to avoid maps if you can, as they are very slow in practice.
This is a simple way:
const double INF = 1.e100;
vector<pair<double, double> > table;
double interpolate(double x) {
// Assumes that "table" is sorted by .first
// Check if x is out of bound
if (x > table.back().first) return INF;
if (x < table[0].first) return -INF;
vector<pair<double, double> >::iterator it, it2;
// INFINITY is defined in math.h in the glibc implementation
it = lower_bound(table.begin(), table.end(), make_pair(x, -INF));
// Corner case
if (it == table.begin()) return it->second;
it2 = it;
--it2;
return it2->second + (it->second - it2->second)*(x - it2->first)/(it->first - it2->first);
}
int main() {
table.push_back(make_pair(5., 15.));
table.push_back(make_pair(7., 18.));
table.push_back(make_pair(10., 22.));
// If you are not sure if table is sorted:
sort(table.begin(), table.end());
printf("%f\n", interpolate(8.));
printf("%f\n", interpolate(10.));
printf("%f\n", interpolate(10.1));
}
You can use a binary search tree to store the interpolation data. This is beneficial when you have a large set of N interpolation points, as interpolation can then be performed in O(log N) time. However, in your example, this does not seem to be the case, and the linear search suggested by RedX is more appropriate.
#include <stdio.h>
#include <assert.h>
#include <map>
static double interpolate (double x, const std::map<double, double> &table)
{
assert(table.size() > 0);
std::map<double, double>::const_iterator it = table.lower_bound(x);
if (it == table.end()) {
return table.rbegin()->second;
} else {
if (it == table.begin()) {
return it->second;
} else {
double x2 = it->first;
double y2 = it->second;
--it;
double x1 = it->first;
double y1 = it->second;
double p = (x - x1) / (x2 - x1);
return (1 - p) * y1 + p * y2;
}
}
}
int main ()
{
std::map<double, double> table;
table.insert(std::pair<double, double>(5, 6));
table.insert(std::pair<double, double>(8, 4));
table.insert(std::pair<double, double>(9, 5));
double y = interpolate(5.1, table);
printf("%f\n", y);
}
Store your points sorted:
index X Y
1 1 -> 3
2 3 -> 7
3 10-> 8
Then loop from max to min and as soon as you get below a number you know it the one you want.
You want let's say 6 so:
// pseudo
for i = 3 to 1
if x[i] <= 6
// you found your range!
// interpolate between x[i] and x[i - 1]
break; // Do not look any further
end
end
Yes, I guess that you should think in a map between those intervals and the natural nummbers. I mean, just label the intervals and use a switch:
switch(I) {
case Int1: //whatever
break;
...
default:
}
I don't know, it's the first thing that I thought of.
EDIT Switch is more efficient than if-else if your numbers are within a relative small interval (that's something to take into account when doing the mapping)
If your x-coordinates must be irregularly spaced, then store the x-coordinates in sorted order, and use a binary search to find the nearest coordinate, for example using Daniel Fleischman's answer.
However, if your problem permits it, consider pre-interpolating to regularly spaced data. So
5 15
7 18
10 22
becomes
5 15
6 16.5
7 18
8 19.3333333
9 20.6666667
10 22
Then at run-time you can interpolate with O(1) using something like this:
double interp1( double x0, double dx, double* y, int n, double xi )
{
double f = ( xi - x0 ) / dx;
if (f<0) return y[0];
if (f>=(n-1)) return y[n-1];
int i = (int) f;
double w = f-(double)i;
return dy[i]*(1.0-w) + dy[i+1]*w;
}
using
double y[6] = {15,16.5,18,19.3333333, 20.6666667, 22 }
double yi = interp1( 5.0 , 1.0 , y, 5, xi );
This isn't necessarily suitable for every problem -- you could end up losing accuracy (if there's no nice grid that contains all your x-samples), and it could have a bad cache penalty if it would make your table much much bigger. But it's a good option for cases where you have some control over the x-coordinates to begin with.
How you've already got it is fairly readable and understandable, and there's a lot to be said for that over a "clever" solution. You can however do away with the lower bounds check and clumsy && because the sequence is ordered:
if (x < 5)
return 0;
else if (x <= 7)
// interpolate
else if (x <= 10)
// interpolate
...

Speeding up self-similarity in an image

I'm writing a program that will generate images. One measurement that I want is the amount of "self-similarity" in the image. I wrote the following code that looks for the countBest-th best matches for each sizeWindow * sizeWindow window in the picture:
double Pattern::selfSimilar(int sizeWindow, int countBest) {
std::vector<int> *pvecount;
double similarity;
int match;
int x1;
int x2;
int xWindow;
int y1;
int y2;
int yWindow;
similarity = 0.0;
// (x1, y1) is the original that's looking for matches.
for (x1 = 0; x1 < k_maxX - sizeWindow; x1++) {
for (y1 = 0; y1 < k_maxY - sizeWindow; y1++) {
pvecount = new std::vector<int>();
// (x2, y2) is the possible match.
for (x2 = 0; x2 < k_maxX - sizeWindow; x2++) {
for (y2 = 0; y2 < k_maxY - sizeWindow; y2++) {
// Testing...
match = 0;
for (xWindow = 0; xWindow < sizeWindow; xWindow++) {
for (yWindow = 0; yWindow < sizeWindow; yWindow++) {
if (m_color[x1 + xWindow][y1 + yWindow] == m_color[x2 + xWindow][y2 + yWindow]) {
match++;
}
}
}
pvecount->push_back(match);
}
}
nth_element(pvecount->begin(), pvecount->end()-countBest, pvecount->end());
similarity += (1.0 / ((k_maxX - sizeWindow) * (k_maxY - sizeWindow))) *
(*(pvecount->end()-countBest) / (double) (sizeWindow * sizeWindow));
delete pvecount;
}
}
return similarity;
}
The good news is that the algorithm does what I want it to: it will return a value from 0.0 to 1.0 about how 'self-similar' a picture is.
The bad news -- as I'm sure that you've already noted -- is that the algorithm is extremely slow. It takes (k_maxX - sizeWindow) * (k_maxY - sizeWindow) * (k_maxX - sizeWindow) * (k_maxY - sizeWindow) * sizeWindow * sizeWindow steps for a run.
Some typical values for the variables:
k_maxX = 1280
k_maxY = 1024
sizeWindow = between 5 and 25
countBest = 3, 4, or 5
m_color[x][y] is defined as short m_color[k_maxX][k_maxY] with values between 0 and 3 (but may increase in the future.)
Right now, I'm not worried about the memory footprint taken by pvecount. Later, I can use a sorted data set that doesn't add another element when it's smaller than countBest. I am only worried about algorithm speed.
How can I speed this up?
Ok, first, this approach is not stable at all. If you add random noise to your image, it will greatly decrease the similarity between the two images. More importantly, from an image processing standpoint, it's not efficient or particularly good. I suggest another approach; for example, using a wavelet-based approach. If you performed a 2d DWT on your image for a few levels and compared the scaling coefficients, you would probably get better results. Plus, the discrete wavelet transform is O(n).
The downside is that wavelets are an advanced mathematical topic. There are some good OpenCourseWare notes on wavelets and filterbanks here.
Your problem strongly reminds me of the calculations that have to be done for motion compensation in video compression. Maybe you should take a closer look what's done in that area.
As rlbond already pointed out, counting the number of points in a window where the colors exactly match isn't what's normally done in comparing pictures. A conceptually simpler method than using discrete cosine or wavelet transformations is to add the squares of the differences
diff = (m_color[x1 + xWindow][y1 + yWindow] - m_color[x2 + xWindow][y2 + yWindow]);
sum += diff*diff;
and use sum instead of match as criterion for similarity (now smaller means better).
Back to what you really asked: I think it is possible to cut down the running time by the factor 2/sizeWindow (maybe squared?), but it is a little bit messy. It's based on the fact that certain pairs of squares you compare stay almost the same when incrementing y1 by 1. If the offsets xOff = x2-x1 and yOff = y2-y1 are the same, only the top (rsp. bottom) vertical stripes of the squares are no longer (rsp. now, but not before) matched. If you keep the values you calculate for match in a two-dimensional array indexed by the offsets xOff = x2-x1 and yOff = y2-y1, then can calculate the new value for match[xOff][yOff] for y1 increased by 1 and x1 staying the same by 2*sizeWindow comparisons:
for (int x = x1; x < x1 + sizeWindow; x++) {
if (m_color[x][y1] == m_color[x + xOff][y1 + yOff]) {
match[xOff][yOff]--; // top stripes no longer compared
}
if (m_color[x][y1+sizeWindow] == m_color[x + xOff][y1 + sizeWindow + yOff]) {
match[xOff][yOff]++; // bottom stripe compared not, but wasn't before
}
}
(as the possible values for yOff changed - by incrementing y1 - from the interval [y2 - y1, k_maxY - sizeWindow - y1 - 1] to the interval [y2 - y1 - 1, k_maxY - sizeWindow - y1 - 2] you can discard the matches with second index yOff = k_maxY - sizeWindow - y1 - 1 and have to calculate the matches with second index yOff = y2 - y1 - 1 differently). Maybe you can also keep the values by how much you increase/decrease match[][] during the loop in an array to get another 2/sizeWindow speed-up.