Determining template type when accessing OpenCV Mat elements - c++

I'm using the following code to add some noise to an image (straight out of the OpenCV reference, page 449 -- explanation of cv::Mat::begin):
void
simulate_noise(Mat const &in, double stddev, Mat &out)
{
cv::Size s = in.size();
vector<double> noise = generate_noise(s.width*s.height, stddev);
typedef cv::Vec<unsigned char, 3> V4;
cv::MatConstIterator_<V4> in_itr = in.begin<V4>();
cv::MatConstIterator_<V4> in_end = in.end<V4>();
cv::MatIterator_<V4> out_itr = out.begin<V4>();
cv::MatIterator_<V4> out_end = out.end<V4>();
for (; in_itr != in_end && out_itr != out_end; ++in_itr, ++out_itr)
{
int noise_index = my_rand(noise.size());
for (int j = 0; j < 3; ++j)
(*out_itr)[j] = (*in_itr)[j] + noise[noise_index];
}
}
Nothing overly complicated:
in and out are allocated cv::Mat objects of the same dimensions and type
iterate over the input image in
at each position, pick a random value from noise (my_rand(int n) returns a random number in [0..n-1]
sum the pixel from in with the random noise value
put the summation result into out
I don't like this code because the following statement seems unavoidable:
typedef cv::Vec<unsigned char, 3> V4;
It has hard-coded two things:
The images have 3 channels
The channel depth is 8bpp
If I get this typedef wrong (e.g. wrong channel depth or wrong number of channels), then my program segfaults. I originally used typedef cv::Vec<unsigned char, 4> V4 to handle images with an arbitrary number of channels (the max OpenCV supports is 4), but this caused a segfault.
Is there any way I can avoid hard-coding the two things above? Ideally, I want something that's as generic as:
typedef cv::Vec<in.type(), in.size()> V4;

I know this comes late. However, the real solution to your problem is to use OpenCV functionality to do what you want to do.
create noise vector as you do already (or use the functions that OpenCV provides hint!)
shuffle noise vector so you don't need individual noise_index for each pixel; or create vector of randomised noise beforehand
build a matrix header around your shuffled/random vector: cv::Mat_<double>(noise);
use matrix operations for computation: out = in + noise; or cv::add(in, noise, out);
PROFIT!
Another advantage of this method is that OpenCV might employ multithreading, SSE or whatever to speed-up this massive-element operation, which you do not. Your code is simpler, cleaner, and OpenCV does all the nasty type handling for you.

The problem is that you need determine to determine type and number of channels at runtime, but templates need the information at compile time. You can avoid hardcoding the number of channels by either using cv::split and cv::merge, or by changing the iteration to
for(int row = 0; row < in.rows; ++row) {
unsigned char* inp = in.ptr<unsigned char>(row);
unsigned char* outp = out.ptr<unsigned char>(row);
for (int col = 0; col < in.cols; ++col) {
for (int c = 0; c < in.channels(); ++c) {
*outp++ = *inp++ + noise();
}
}
}
If you want to get rid of the dependance of the type, I'd suggest putting the above in a templated function and calling that from your function, depending on the type of the matrix.

They are hardcoded because performance is better that way.
In OpenCV1.x there is cvGet2D() , which can be used here since Mat can be casted as an IplImage.
But it's slow since each time you access a pixel the function will find out the type, size, etc. Specially inefficient in loops.

Related

How to access matrix data in opencv by another mat with locations (indexing)

Suppose I have a Mat of indices (locations) called B, We can say that this Mat has dimensions of 1 x 100 and We suppose to have another Mat, called A, full of data of the same dimensions of B.
Now, I would access to the data of A with B. Usually I would create a for loop and I would take for each elements of B, the right elements of A. For the most fussy of the site, this is the code that I would write:
for(int i=0; i < B.cols; i++){
int index = B.at<int>(0, i);
std::cout<<A.at<int>(0, index)<<std:endl;
}
Ok, now that I showed you what I could do, I ask you if there is a way to access the matrix A, always using the B indices, in a more intelligent and fast way. As someone could do in python thanks to the numpy.take() function.
This operation is called remapping. In OpenCV, you can use function cv::remap for this purpose.
Below I present the very basic example of how remap algorithm works; please note that I don't handle border conditions in this example, but cv::remap does - it allows you to use mirroring, clamping, etc. to specify what happens if the indices exceed the dimensions of the image. I also don't show how interpolation is done; check the cv::remap documentation that I've linked to above.
If you are going to use remapping you will probably have to convert indices to floating point; you will also have to introduce another array of indices that should be trivial (all equal to 0) if your image is one-dimensional. If this starts to represent a problem because of performance, I'd suggest you implement the 1-D remap equivalent yourself. But benchmark first before optimizing, of course.
For all the details, check the documentation, which covers everything you need to know to use te algorithm.
cv::Mat<float> remap_example(cv::Mat<float> image,
cv::Mat<float> positions_x,
cv::Mat<float> positions_y)
{
// sizes of positions arrays must be the same
int size_x = positions_x.cols;
int size_y = positions_x.rows;
auto out = cv::Mat<float>(size_y, size_x);
for(int y = 0; y < size_y; ++y)
for(int x = 0; x < size_x; ++x)
{
float ps_x = positions_x(x, y);
float ps_y = positions_y(x, y);
// use interpolation to determine intensity at image(ps_x, ps_y),
// at this point also handle border conditions
// float interpolated = bilinear_interpolation(image, ps_x, ps_y);
out(x, y) = interpolated;
}
return out;
}
One fast way is to use pointer for both A (data) and B (indexes).
const int* pA = A.ptr<int>(0);
const int* pIndexB = B.ptr<int>(0);
int sum = 0;
for(int i = 0; i < Bi.cols; ++i)
{
sum += pA[*pIndexB++];
}
Note: Be carefull with pixel type, in this case (as you write in your code) is int!
Note2: Using cout for each point access put the optimization useless!
Note3: In this article Satya compare four methods for pixel access and fastest seems "foreach": https://www.learnopencv.com/parallel-pixel-access-in-opencv-using-foreach/

Opencv obatin certain pixel RGB value based on mask

My title may not be clear enough, but please look carefully on the following description.Thanks in advance.
I have a RGB image and a binary mask image:
Mat img = imread("test.jpg")
Mat mask = Mat::zeros(img.rows, img.cols, CV_8U);
Give some ones to the mask, assume the number of ones is N. Now the nonzero coordinates are known, based on these coordinates, we can surely obtain the corresponding pixel RGB value of the origin image.I know this can be accomplished by the following code:
Mat colors = Mat::zeros(N, 3, CV_8U);
int counter = 0;
for (int i = 0; i < mask.rows; i++)
{
for (int j = 0; j < mask.cols; j++)
{
if (mask.at<uchar>(i, j) == 1)
{
colors.at<uchar>(counter, 0) = img.at<Vec3b>(i, j)[0];
colors.at<uchar>(counter, 1) = img.at<Vec3b>(i, j)[1];
colors.at<uchar>(counter, 2) = img.at<Vec3b>(i, j)[2];
counter++;
}
}
}
And the coords will be as follows:
enter image description here
However, this two layer of for loop costs too much time. I was wondering if there is a faster method to obatin colors, hope you guys can understand what I was trying to convey.
PS:If I can use python, this can be done in only one sentence:
colors = img[mask == 1]
The .at() method is the slowest way to access Mat values in C++. Fastest is to use pointers, but best practice is an iterator. See the OpenCV tutorial on scanning images.
Just a note, even though Python's syntax is nice for something like this, it still has to loop through all of the elements at the end of the day---and since it has some overhead before this, it's de-facto slower than C++ loops with pointers. You necessarily need to loop through all the elements regardless of your library, you're doing comparisons with the mask for every element.
If you are flexible with using any other open source library using C++, try Armadillo. You can do all linear algebra operations with it and also, you can reduce above code to one line(similar to your Python code snippet).
Or
Try findNonZero()function and find all coordinates in image containing non-zero values. Check this: https://stackoverflow.com/a/19244484/7514664
Compile with optimization enabled, try profiling this version and tell us if it is faster:
vector<Vec3b> colors;
if (img.isContinuous() && mask.isContinuous()) {
auto pimg = img.ptr<Vec3b>();
for (auto pmask = mask.datastart; pmask < mask.dataend; ++pmask, ++pimg) {
if (*pmask)
colors.emplace_back(*pimg);
}
}
else {
for (int r = 0; r < img.rows; ++r) {
auto prowimg = img.ptr<Vec3b>(r);
auto prowmask = img.ptr(r);
for (int c = 0; c < img.cols; ++c) {
if (prowmask[c])
colors.emplace_back(prowimg[c]);
}
}
}
If you know the size of colors, reserve the space for it beforehand.

Add 1 to vector<unsigned char> value - Histogram in C++

I guess it's such an easy question (I'm coming from Java), but I can't figure out how it works.
I simply want to increment an vector element by one. The reason for this is, that I want to compute a histogram out of image values. But whatever I try I just can accomplish to assign a value to the vector. But not to increment it by one!
This is my histogram function:
void histogram(unsigned char** image, int height,
int width, vector<unsigned char>& histogramArray) {
for (int i = 0; i < width; i++) {
for (int j = 0; j < height; j++) {
// histogramArray[1] = (int)histogramArray[1] + (int)1;
// add histogram position by one if greylevel occured
histogramArray[(int)image[i][j]]++;
}
}
// display output
for (int i = 0; i < 256; i++) {
cout << "Position: " << i << endl;
cout << "Histogram Value: " << (int)histogramArray[i] << endl;
}
}
But whatever I try to add one to the histogramArray position, it leads to just 0 in the output. I'm only allowed to assign concrete values like:
histogramArray[1] = 2;
Is there any simple and easy way? I though iterators are hopefully not necesarry at this point, because I know the exakt index position where I want to increment something.
EDIT:
I'm so sorry, I should have been more precise with my question, thank you for your help so far! The code above is working, but it shows a different mean value out of the histogram (difference of around 90) than it should. Also the histogram values are way different than in a graphic program - even though the image values are exactly the same! Thats why I investigated the function and found out if I set the histogram to zeros and then just try to increase one element, nothing happens! This is the commented code above:
for (int i = 0; i < width; i++) {
for (int j = 0; j < height; j++) {
histogramArray[1]++;
// add histogram position by one if greylevel occured
// histogramArray[(int)image[i][j]]++;
}
}
So the position 1 remains 0, instead of having the value height*width. Because of this, I think the correct calculation histogramArray[image[i][j]]++; is also not working properly.
Do you have any explanation for this? This was my main question, I'm sorry.
Just for completeness, this is my mean function for the histogram:
unsigned char meanHistogram(vector<unsigned char>& histogram) {
int allOccurences = 0;
int allValues = 0;
for (int i = 0; i < 256; i++) {
allOccurences += histogram[i] * i;
allValues += histogram[i];
}
return (allOccurences / (float) allValues) + 0.5f;
}
And I initialize the image like this:
unsigned char** image= new unsigned char*[width];
for (int i = 0; i < width; i++) {
image[i] = new unsigned char[height];
}
But there shouldn't be any problem with the initialization code, since all other computations work perfectly and I am able to manipulate and safe the original image. But it's true, that I should change width and height - since I had only square images it didn't matter so far.
The Histogram is created like this and then the function is called like that:
vector<unsigned char> histogramArray(256);
histogram(array, adaptedHeight, adaptedWidth, histogramArray);
So do you have any clue why this part histogramArray[1]++; don't increases my histogram? histogramArray[1] remains 0 all the time! histogramArray[1] = 2; is working perfectly. Also histogramArray[(int)image[i][j]]++; seems to calculate something, but as I said, I think it's wrongly calculating.
I appreciate any help very much! The reason why I used a 2D Array is simply because it is asked for. I like the 1D version also much more, because it's way simpler!
You see, the current problem in your code is not incrementing a value versus assigning to it; it's the way you index your image. The way you've written your histogram function and the image access part puts very fine restrictions on how you need to allocate your images for this code to work.
For example, assuming your histogram function is as you've written it above, none of these image allocation strategies will work: (I've used char instead of unsigned char for brevity.)
char image [width * height]; // Obvious; "char[]" != "char **"
char * image = new char [width * height]; // "char*" != "char **"
char image [height][width]; // Most surprisingly, this won't work either.
The reason why the third case won't work is tough to explain simply. Suffice it to say that a 2D array like this will not implicitly decay into a pointer to pointer, and if it did, it would be meaningless. Contrary to what you might read in some books or hear from some people, in C/C++, arrays and pointers are not the same thing!
Anyway, for your histogram function to work correctly, you have to allocate your image like this:
char** image = new char* [height];
for (int i = 0; i < height; ++i)
image[i] = new char [width];
Now you can fill the image, for example:
for (int i = 0; i < height; ++i)
for (int j = 0; j < width; ++j)
image[i][j] = rand() % 256; // Or whatever...
On an image allocated like this, you can call your histogram function and it will work. After you're done with this image, you have to free it like this:
for (int i = 0; i < height; ++i)
delete[] image[i];
delete[] image;
For now, that's enough about allocation. I'll come back to it later.
In addition to the above, it is vital to note the order of iteration over your image. The way you've written it, you iterate over your columns on the outside, and your inner loop walks over the rows. Most (all?) image file formats and many (most?) image processing applications I've seen do it the other way around. The memory allocations I've shown above also assume that the first index is for the row, and the second is for the column. I suggest you do this too, unless you've very good reasons not to.
No matter which layout you choose for your images (the recommended row-major, or your current column-major,) it is in issue that you should always keep in your mind and take notice of.
Now, on to my recommended way of allocating and accessing images and calculating histograms.
I suggest that you allocate and free images like this:
// Allocate:
char * image = new char [height * width];
// Free:
delete[] image;
That's it; no nasty (de)allocation loops, and every image is one contiguous block of memory. When you want to access row i and column j (note which is which) you do it like this:
image[i * width + j] = 42;
char x = image[i * width + j];
And you'd calculate the histogram like this:
void histogram (
unsigned char * image, int height, int width,
// Note that the elements here are pixel-counts, not colors!
vector<unsigned> & histogram
) {
// Make sure histogram has enough room; you can do this outside as well.
if (histogram.size() < 256)
histogram.resize (256, 0);
int pixels = height * width;
for (int i = 0; i < pixels; ++i)
histogram[image[i]]++;
}
I've eliminated the printing code, which should not be there anyway. Note that I've used a single loop to go through the whole image; this is another advantage of allocating a 1D array. Also, for this particular function, it doesn't matter whether your images are row-major or column major, since it doesn't matter in what order we go through the pixels; it only matters that we go through all the pixels and nothing more.
UPDATE: After the question update, I think all of the above discussion is moot and notwithstanding! I believe the problem could be in the declaration of the histogram vector. It should be a vector of unsigned ints, not single bytes. Your problem seems to be that the value of the vector elements seem to stay at zero when your simplify the code and increment just one element, and are off from the values they need to be when you run the actual code. Well, this could be a symptom of numeric wrap-around. If the number of pixels in your image are a a multiple of 256 (e.g. 32x32 or 1024x1024 image) then it is natural that the sum of their number would be 0 mod 256.
I've already alluded to this point in my original answer. If you read my implementation of the histogram function, you see in the signature that I've declared my vector as vector<unsigned> and have put a comment above it that says this victor counts pixels, so its data type should be suitable.
I guess I should have made it bolder and clearer! I hope this solves your problem.

How to improve sorting pixels in cvMat?

I am trying to sort pixel values of an image (example 80x20) from lowest to highest.
Below is the some code:
bool sortPixel(int first, int second)
{
return (first < second);
}
vector<int>vect_sortPixel;
for(int y=0; y<height; y++)
{
for(int x=0; x<width; x++)
{
vect_sortPixel.push_back(cvGetReal2D(srcImg, y, x));
sort(vect_sortPixel.begin(), vect_sortPixel.end(), sortPixel);
}
}
But it takes quite long time to compute. Any suggestion to reduce the processing time?
Thank you.
Don't use getReal2D. It's quite slow.
Convert image to cv::Mat or Mat. Use its data pointer to get the pixel values. Mat.data() will give you pointer to the original matrix. Use that.
And as far as sorting is concerned, I would advise you to first make an array of all the pixels, then sort it using Merge sort (time complexity O(n log n))
#include<opencv2/highgui/highgui.hpp>
#include<stdio.h>
using namespace cv;
using namespace std;
int main()
{
Mat img = imread("filename.jpg",CV_LOAD_IMAGE_COLOR);
unsigned char *input = (unsigned char*)(img.data);
int i,j,r,g,b;
for(int i = 0;i < img.cols;i++){
for(int j = 0;j < img.rows;j++){
b = input[img.cols * j + i] ;
g = input[img.cols * j+ i + 1];
r = input[img.cols *j + i +2];
}
}
return 0;
}
Using this you can access pixel values from the main matrix.
Warning: This is not how you compare it. I'm suggesting that by using something like this, you can access pixel values.
Mat.data() gives you pointer to the original matrix. This matrix is a 1 D matrix with all the given pixel values.
Image => (x,y,z),(x1,y1,z1), etc..
Mat(original matrix) => x,y,z,x1,y1,z1,...
If you still have some doubts regarding how to extract data from Mat, visit this link OpenCV get pixel channel value from Mat image
and here's a link regarding Merge Sort http://www.cplusplus.happycodings.com/Algorithms/code17.html
There are few problems in your code:
As Froyo already said you use cvGetReal2D which is actually not very fast. You have to convert your cvMat to cv::Mat. To do this there's cv::Mat constructor:
// converts old-style CvMat to the new matrix; the data is not copied by default
Mat(const CvMat* m, bool copyData=false);
And after this use direct pixels acces as mentioned in this SO question.
Another problem is that you use push_back which actually also not very fast. You know the size of array, so why don't you allocate needed memory at the beginning? Like this:
vector<int> vect_sortPixel(mat.cols*mat.rows);
And than just use vect_sortPixel[i] to get needed pixel.
Why do you call sort in the loop? You have to call it after loop, when array is already created! Default STL's sort should work fast:
Complexity
Approximately N*logN comparisons on average (where N is
last-first). In the worst case, up to N^2, depending on specific
sorting algorithm used by library implementation.

Variable block size sum of absolute difference calculation in C++

I would like to perform a variable block size sum of absolute difference calculation with a 2-D array of 16 bit integers in a C++ program as efficiently as possible. I am interested in a real time block matching code. I was wondering if there were any software libraries available to do this? The code is running on windows XP and I'm stuck using Visual Studio 2010 to do the compiling. The CPU is a 2-core AMD Athlon 64 x2 4850e.
By variable block size sum of absolute difference(SAD) calculation I mean the following.
I have one smaller 2-D array I will call the template_grid, and one larger 2-D array I will call the image. I want to find the region in the image that minimizes the sum of the absolute difference between the pixels in the template and the pixels in the region in the image.
The simplest way to calculate the SAD in C++ if would be the following:
for(int shiftY = 0; shiftY < rangeY; shiftY++) {
for(int shiftX = 0; shiftX < rangeX; shiftX++) {
for(int x = 0; x < lenTemplateX; x++) {
for(int y = 0; y < lenTemplateY; y++) {
SAD[shiftY][shiftX]=abs(template_grid[x][y] - image[y + shiftY][x + shiftX]);
}
}
}
}
The SAD calculation for specific array sizes has been optimized in the Intel performance primitives library. However, the arrays I'm working with don't fit the sizes in these libraries.
There are two search ranges I work with,
a large range: rangeY = 45, rangeX = 10
a small range: rangeY = 4, rangeX = 2
There is only one template size and it is:
lenTemplateY = 61, lenTemplateX = 7
Minor optimisation:
for(int shiftY = 0; shiftY < rangeY; shiftY++) {
for(int shiftX = 0; shiftX < rangeX; shiftX++) {
// if you can assume SAD is already filled with 0-es,
// you don't need the next line
SAD[shiftX][shiftY]=0;
for(int tx = 0, imx=shiftX; x < lenTemplateX; tx++,imx++) {
for(int ty = 0, imy=shiftY; y < lenTemplateY; ty++,imy++) {
// two increments of imx/imy may be cheaper than
// two addition with offsets
SAD[shiftY][shiftX]+=abs(template_grid[tx][ty] - image[imx][imy]);
}
}
}
}
Loop unrolling using C++ templates
May be a crazy idea for your configuration (C++ compiler worries me), but it may work. I offer no warranties, but give it a try.
The idea may work because your template_grid sizes and the ranges are constant - thus known at compilation time.Also, for this to work, your image and template_grid must be organised with the same layout (column first or row first) - the way your "sample code" is depicted in the question mixes the SAD x/y with template_grid y/x.
In the followings, I'll assume a "column first" organisation, so that SAD[ix] denotes the ixth column of your SAD** matrix. The code goes just the same for "row first", except the name of the variables won't match the meaning of your value arrays.
So, let's start:
template <
typename sad_type, typename val_type,
size_t template_len
> struct sad1D_simple {
void operator()(
const val_type* img, const val_type* templ,
sad_type& result
) {
// template specialization recursion, with one less element to add
sad1D_simple<sad_type, val_type, template_len-1> one_shorter;
// call it incrementing the img and template offsets
one_shorter(img+1, templ+1, result);
// the add the contribution of the first diff we skipped over above
result+=abs(*(img+template_len-1)-*(templ+template_len-1));
}
};
// at len of 0, the result is zero. We need it to stop the
template <
typename sad_type, typename val_type
>
struct sad1D_simple<sad_type, val_type, 0> {
void operator()(
const val_type* img, const val_type* templ,
sad_type& result
) {
result=0;
}
};
Why a functor struct - struct with operator? The C++ doesn't allow partial specialization of function templates.
What the sad1D_simple does: unrolls a for cycle that computes the SAD of two arrays in input without any offsetting, based on the fact that the length of your template_grid array is a constant known at compile time. It's in the same vein as "computing the factorial of compile time using C++ templates"
How this helps?
Example of use in the code below:
typedef ulong SAD_t;
typedef int16_t pixel_val_t;
const size_t lenTemplateX = 7; // number of cols in the template_grid
const size_t lenTemplateY = 61;
const size_t rangeX=10, rangeY=45;
pixel_val_t **image, **template_grid;
SAD_t** SAD;
// assume those are initialized somehow
for(size_t tgrid_col=0; tgrid_col<lenTemplateX; tgrid_col++) {
pixel_val_t* template_col=template_grid[tgrid_col];
// the X axis - horizontal - is the column axis, right?
for(size_t shiftX=0; shiftX < rangeX; shiftX++) {
pixel_val_t* img_col=image[shiftX];
for(size_t shiftY = 0; shiftY < rangeY; shiftY++) {
// the Y axis - vertical - is the "offset in a column"=row, isn't it?
pixel_val_t* img_col_offsetted=img_col+shiftY;
// this functor is made by recursive specialization
// there's no cycle inside it, it was unrolled into
// lenTemplateY individual subtractions, abs-es and additions
sad1D_simple<SAD_t, pixel_val_t, lenTemplateY> calc;
calc(img_col_offsetted, template_col, SAD[shiftX][shiftY]);
}
}
}
Mmmm... can we do better? No, it won't be the X-axis unrolling, we still want to stay in 1D area, but... well, maybe if we create a ranged sad1D and unroll one more loop on the same axis?It will work iff the rangeX is also constant.
template <
typename sad_type, typename val_type,
size_t range, size_t template_len
> struct sad1D_ranged {
void operator()(
const val_type* img, const val_type* templ,
// result is assumed to have at least `range` slots
sad_type* result
) {
// we'll compute here the first slot of the result
sad1D_simple<sad_type, val_type, template_len>
calculator_for_first_sad;
calculator_for_first_sad(img, templ, *(result));
// now, ask for a recursive specialization for
// the next (range-1) sad-s
sad1D_ranged<sad_type, val_type, range-1, template_len>
one_less_in_range;
// when calling, pass the shifted img and result
one_less_in_range(img+1, templ, result+1);
}
};
// for a range of 0, there's nothing to do, but we need it
// to stop the template specialization recursion
template <
typename sad_type, typename val_type,
size_t template_len
> struct sad1D_ranged<sad_type, val_type, 0, template_len> {
void operator()(
const val_type* img, const val_type* templ,
// result is assumed to have at least `range` slots
sad_type* result
) {
}
};
And here's how you use it:
for(size_t tgrid_col=0; tgrid_col<lenTemplateX; tgrid_col++) {
pixel_val_t* template_col=template_grid[tgrid_col];
for(size_t shiftX=0; shiftX < rangeX; shiftX++) {
pixel_val_t* img_col=image[shiftX];
SAD_t* sad_col=SAD[shiftX];
sad1D_ranged<SAD_t, pixel_val_t, rangeY, lenTemplateY> calc;
calc(img_col, template_col, sad_col);
}
}
Yes... but the question is: will this improve performance?
The heck if I know. For small number of loops within a cycle and for strong data locality (values close one to the other so that they are in the CPU caches), loop unrolling should improve the performance. For a larger number of loops, you may negatively interfere with the CPU branch prediction and other mumbo-jumbo-I-know-may-impact-performance-but-I-don't-know-how.
Feeling of guts: even if the same unrolling technique may work for the other two loops, using it may well result in a degradation of performance: we'll need to jump from one contiguous vector (an image column) to the other - the entire image may not fit into the CPU cache.
Note: if your template_grid data is constant as well (or you have a finite set of constant template grids), one may take one step further and create struct functors with dedicated masks. But I'm out of steam for today.
you could try with OpenCV template matching with the square difference parameter see the tutorial here. OpenCV is optimized with OpenCL but i don't know for this specific function. I think you should give it a try.
I'm not sure how much you are restricted to using SAD, or if you are generally interested in finding the region in the image that matches the template the best. In the last case, you can use a convolution instead of SAD. This can be solved in the Fourier domain in O(N log N), including the Fourier transform (FFT).
In short, you can use the FFT (for example using http://www.fftw.org/) to convert both the template and the image to the frequency domain, then multiply them, and convert back to the time domain.
Of course, this is all irrelevant if you are bound to using SAD.