Total Energy of a signal - c++

I have been given this equation to calculate the total energy of a signal:
Ex= ∑ n|x[n]|2
Which to me suggests that you square each of the blocks up, then get the sum of the whole entire block. I am wondering if the code/Algorithm I have written is accurate for this equation and I have done it the most efficient way.
double totalEnergy(vector<double> data, const int rows, const int cols)
vector<double> temp;
double energy = 0;
for(int i=0; (i < 2); i++)
for(int j=0; (j < 2); j++)
temp.push_back( (data[i*2+j]*data[i*2+j]) );
energy = accumulate (temp.begin(), temp.begin()+(rows*cols), 0);
return energy;
int main(int argc, char *argv[]) {
vector<double> data;
totalEnergy(data, 2, 2);
Result: 64
Any help / advice would be greatly appreciated :)!

This is certainly not the most efficient way to do this computation although I think the implementation is nearly correct: the multiplication by n somehow got lost, though. Since I can't see what the sum index and bounds are I'm not going to fix this but I'll reproduce the results of the implementation, just "better". There are two obvious points which can be improved:
The code doesn't make any use of being in a particular column or row. That is, it could as well just consider the input as a flat array whose size is actually known from the size of the input vector.
The function uses two temporary vectors (the one passed to the function and one inside the function). Creating a std::vector<T> needs to allocate memory which isn't a cheap operation.
As a first approximation, I would transform the input vector in-place and then accumulate the result:
double square(double value) {
return value * value;
double totalEnergy(std::vector<double> data) {
std::transform(data.begin(), data.end(), data.begin(), &square);
return std::accumulate (data.begin(), data.end(), 0);
The function still make a copy of the data and modifies it. I don't like this. Oddly enough the operation you implemented is basically an inner product of a vector with itself, i.e., this yields the same result without creating an extra vector either:
double totalEnergy(std::vector<double> const& data) {
return std::inner_product(data.begin(), data.end(), data.begin(), 0);
Assuming this implements the correct formula (although I'm still suspicious of the n in the original formula), this is probably considerable faster. It seem to be more concise, too...


Having a hard time figuring out logic behind array manipulation

I am given a filled array of size WxH and need to create a new array by scaling both the width and the height by a power of 2. For example, 2x3 becomes 8x12 when scaled by 4, 2^2. My goal is to make sure all the old values in the array are placed in the new array such that 1 value in the old array fills up multiple new corresponding parts in the scaled array. For example:
old_array = [[1,2],
new_array = [[1,1,2,2],
when scaled by a factor of 2. Could someone explain to me the logic on how I would go about programming this?
It's actually very simple. I use a vector of vectors for simplicity noting that 2D matrixes are not efficient. However, any 2D matrix class using [] indexing syntax can, and should be for efficiency, substituted.
#include <vector>
using std::vector;
int main()
vector<vector<int>> vin{ {1,2},{3,4},{5,6} };
size_t scaleW = 2;
size_t scaleH = 3;
vector<vector<int>> vout(scaleH * vin.size(), vector<int>(scaleW * vin[0].size()));
for (size_t i = 0; i < vout.size(); i++)
for (size_t ii = 0; ii < vout[0].size(); ii++)
vout[i][ii] = vin[i / scaleH][ii / scaleW];
auto x = vout[8][3]; // last element s/b 6
Here is my take. It is very similar to #Tudor's but I figure between our two, you can pick what you like or understand best.
First, let's define a suitable 2D array type because C++'s standard library is very lacking in this regard. I've limited myself to a rather simple struct, in case you don't feel comfortable with object oriented programming.
#include <vector>
// using std::vector
struct Array2d
unsigned rows, cols;
std::vector<int> data;
This print function should give you an idea how the indexing works:
#include <cstdio>
// using std::putchar, std::printf, std::fputs
void print(const Array2d& arr)
for(std::size_t row = 0; row < arr.rows; ++row) {
for(std::size_t col = 0; col < arr.cols; ++col)
std::printf("%d, ",[row * arr.cols + col]);
std::fputs("]\n ", stdout);
std::fputs("]\n", stdout);
Now to the heart, the array scaling. The amount of nesting is … bothersome.
Array2d scale(const Array2d& in, unsigned rowfactor, unsigned colfactor)
Array2d out;
out.rows = in.rows * rowfactor;
out.cols = in.cols * colfactor; * out.cols);
for(std::size_t inrow = 0; inrow < in.rows; ++inrow) {
for(unsigned rowoff = 0; rowoff < rowfactor; ++rowoff) {
std::size_t outrow = inrow * rowfactor + rowoff;
for(std::size_t incol = 0; incol < in.cols; ++incol) {
std::size_t in_idx = inrow * in.cols + incol;
int inval =[in_idx];
for(unsigned coloff = 0; coloff < colfactor; ++coloff) {
std::size_t outcol = incol * colfactor + coloff;
std::size_t out_idx = outrow * out.cols + outcol;[out_idx] = inval;
return out;
Let's pull it all together for a little demonstration:
int main()
Array2d in;
in.rows = 2;
in.cols = 3; * in.cols);
for(std::size_t i = 0; i < in.rows * in.cols; ++i)[i] = static_cast<int>(i);
print(scale(in, 3, 2));
This prints
[[0, 1, 2, ]
[3, 4, 5, ]
[[0, 0, 1, 1, 2, 2, ]
[0, 0, 1, 1, 2, 2, ]
[0, 0, 1, 1, 2, 2, ]
[3, 3, 4, 4, 5, 5, ]
[3, 3, 4, 4, 5, 5, ]
[3, 3, 4, 4, 5, 5, ]
To be honest, i'm incredibly bad at algorithms but i gave it a shot.
I am not sure if this can be done using only one matrix, or if it can be done in less time complexity.
Edit: You can estimate the number of operations this will make with W*H*S*S where Sis the scale factor, W is width and H is height of input matrix.
I used 2 matrixes m and r, where m is your input and r is your result/output. All that needs to be done is to copy each element from m at positions [i][j] and turn it into a square of elements with the same value of size scale_factor inside r.
Simply put:
int main()
Matrix<int> m(2, 2);
// initial values in your example
m[0][0] = 1;
m[0][1] = 2;
m[1][0] = 3;
m[1][1] = 4;
// pick some scale factor and create the new matrix
unsigned long scale = 2;
Matrix<int> r(m.rows*scale, m.columns*scale);
// i know this is bad but it is the most
// straightforward way of doing this
// it is also the only way i can think of :(
for(unsigned long i1 = 0; i1 < m.rows; i1++)
for(unsigned long j1 = 0; j1 < m.columns; j1++)
for(unsigned long i2 = i1*scale; i2 < (i1+1)*scale; i2++)
for(unsigned long j2 = j1*scale; j2 < (j1+1)*scale; j2++)
r[i2][j2] = m[i1][j1];
// the output in your example
std::cout << "\n\n";
return 0;
I do not think it is relevant for the question, but i used a class Matrix to store all the elements of the extended matrix. I know it is a distraction but this is still C++ and we have to manage memory. And what you are trying to achieve with this algorithm needs a lot of memory if the scale_factor is big so i wrapped it up using this:
template <typename type_t>
class Matrix
type_t** Data;
// should be private and have Getters but
// that would make the code larger...
unsigned long rows;
unsigned long columns;
// 2d Arrays get big pretty fast with what you are
// trying to do.
Matrix(unsigned long rows, unsigned long columns)
this->rows = rows;
this->columns = columns;
Data = new type_t*[rows];
for(unsigned long i = 0; i < rows; i++)
Data[i] = new type_t[columns];
// It is true, a copy constructor is needed
// as HolyBlackCat pointed out
Matrix(const Matrix& m)
rows = m.rows;
columns = m.columns;
Data = new type_t*[rows];
for(unsigned long i = 0; i < rows; i++)
Data[i] = new type_t[columns];
for(unsigned long j = 0; j < columns; j++)
Data[i][j] = m[i][j];
for(unsigned long i = 0; i < rows; i++)
delete [] Data[i];
delete [] Data;
void Print()
for(unsigned long i = 0; i < rows; i++)
for(unsigned long j = 0; j < columns; j++)
std::cout << Data[i][j] << " ";
std::cout << "\n";
type_t* operator [] (unsigned long row)
return Data[row];
First of all, having a suitable 2D matrix class is presumed but not the question. But I don't know the API of yours, so I'll illustrate with something typical:
struct coord {
size_t x; // x position or column count
size_t y; // y position or row count
template <typename T>
class Matrix2D {
⋮ // implementation details
⋮ // all needed special members (ctors dtor, assignment)
Matrix2D (coord dimensions);
coord dimensions() const; // return height and width
const T& cell (coord position) const; // read-only access
T& cell (coord position); // read-write access
// handy synonym:
const T& operator[](coord position) const { return cell(position); }
T& operator[](coord position) { return cell(position); }
I just showed the public members I need: create a matrix with a given size, query the size, and indexed access to the individual elements.
So, given that, your problem description is:
template<typename T>
Matrix2D<T> scale_pow2 (const Matrix2D& input, size_t pow)
const auto scale_factor= 1 << pow;
const auto size_in = input.dimensions();
Matrix2D<T> result ({size_in.x*scale_factor,size_in.y*scale_factor});
⋮ // fill up result
return result;
OK, so now the problem is precisely defined: what code goes in the big blank immediately above?
Each cell in the input gets put into a bunch of cells in the output. So you can either iterate over the input and write a clump of cells in the output all having the same value, or you can iterate over the output and each cell you need the value for is looked up in the input.
The latter is simpler since you don't need a nested loop (or pair of loops) to write a clump.
for (coord outpos : /* ?? every cell of the output ?? */) {
coord frompos {
outpos.x >> pow,
outpos.y >> pow };
result[outpos] = input[frompos];
Now that's simple!
Calculating the from position for a given output must match the way the scale was defined: you will have pow bits giving the position relative to this clump, and the higher bits will be the index of where that clump came from
Now, we want to set outpos to every legal position in the output matrix indexes. That's what I need. How to actually do that is another sub-problem and can be pushed off with top-down decomposition.
a bit more advanced
Maybe nested loops is the easiest way to get that done, but I won't put those directly into this code, pushing my nesting level even deeper. And looping 0..max is not the simplest thing to write in bare C++ without libraries, so that would just be distracting. And, if you're working with matrices, this is something you'll have a general need for, including (say) printing out the answer!
So here's the double-loop, put into its own code:
struct all_positions {
coord current {0,0};
coord end;
all_positions (coord end) : end{end} {}
bool next() {
if (++current.x < end.x) return true; // not reached the end yet
current.x = 0; // reset to the start of the row
if (++current.y < end.y) return true;
return false; // I don't have a valid position now.
This does not follow the iterator/collection API that you could use in a range-based for loop. For information on how to do that, see my article on Code Project or use the Ranges stuff in the C++20 standard library.
Given this "old fashioned" iteration helper, I can write the loop as:
all_positions scanner {output.dimensions}; // starts at {0,0}
const auto& outpos= scanner.current;
do {
} while (;
Because of the simple implementation, it starts at {0,0} and advancing it also tests at the same time, and it returns false when it can't advance any more. Thus, you have to declare it (gives the first cell), use it, then advance&test. That is, a test-at-the-end loop. A for loop in C++ checks the condition before each use, and advances at the end, using different functions. So, making it compatible with the for loop is more work, and surprisingly making it work with the ranged-for is not much more work. Separating out the test and advance the right way is the real work; the rest is just naming conventions.
As long as this is "custom", you can further modify it for your needs. For example, add a flag inside to tell you when the row changed, or that it's the first or last of a row, to make it handy for pretty-printing.
You need a bunch of things working in addition to the little piece of code you actually want to write. Here, it's a usable Matrix class. Very often, it's prompting for input, opening files, handling command-line options, and that kind of stuff. It distracts from the real problem, so get that out of the way first.
Write your code (the real code you came for) in its own function, separate from any other stuff you also need in order to house it. Get it elsewhere if you can; it's not part of the lesson and just serves as a distraction. Worse, it may be "hard" in ways you are not prepared for (or to do well) as it's unrelated to the actual lesson being worked on.
Figure out the algorithm (flowchart, pseudocode, whatever) in a general way before translating that to legal syntax and API on the objects you are using. If you're just learning C++, don't get bogged down in the formal syntax when you are trying to figure out the logic. Until you naturally start to think in C++ when doing that kind of planning, don't force it. Use whiteboard doodles, tinkertoys, whatever works for you.
Get feedback and review of the idea, the logic of how to make it happen, from your peers and mentors if available, before you spend time coding. Why write up an idea that doesn't work? Fix the logic, not the code.
Finally, sketch the needed control flow, functions and data structures you need. Use pseudocode and placeholder notes.
Then fill in the placeholders and replace the pseudo with the legal syntax. You already planned it out, so now you can concentrate on learning the syntax and library details of the programming language. You can concentrate on "how do I express (some tiny detail) in C++" rather than keeping the entire program in your head. More generally, isolate a part that you will be learning; be learning/practicing one thing without worrying about the entire edifice.
To a large extent, some of those ideas translate to the code as well. Top-Down Design means you state things at a high level and then implement that elsewhere, separately. It makes code readable and maintainable, as well as easier to write in the first place. Functions should be written this way: the function explains how to do (what it does) as a list of details that are just one level of detail further down. Each of those steps then becomes a new function. Functions should be short and expressed at one semantic level of abstraction. Don't dive down into the most primitive details inside the function that explains the task as a set of simpler steps.
Good luck, and keep it up!

Eigen MatrixXd push back in c++

Eigen is a well known matrix Library in c++. I am having trouble finding an in built function to simply push an item on to the end of a matrix. Currently I know that it can be done like this:
Eigen::MatrixXd matrix(10, 3);
long int count = 0;
long int topCount = 10;
for (int i = 0; i < listLength; ++i) {
matrix(count, 0) = list.x;
matrix(count, 1) = list.y;
matrix(count, 2) = list.z;
if (count == topCount) {
topCount *= 2;
matrix.conservativeResize(topCount, 3);
matrix.conservativeResize(count, 3);
And this will work (some of the syntax may be out). But its pretty convoluted for a simple thing to do. Is there already an in built function?
There is no such function for Eigen matrices. The reason for this is such a function would either be very slow or use excessive memory.
For a push_back function to not be prohibitively expensive it must increase the matrix's capacity by some factor when it runs out of space as you have done. However when dealing with matrices, memory usage is often a concern so having a matrix's capacity be larger than necessary could be problematic.
If it instead increased the size by rows() or cols() each time the operation would be O(n*m). Doing this to fill an entire matrix would be O(n*n*m*m) which for even moderately sized matrices would be quite slow.
Additionally, in linear algebra matrix and vector sizes are nearly always constant and known beforehand. Often when resizeing a matrix you don't care about the previous values in the matrix. This is why Eigen's resize function does not retain old values, unlike std::vector's resize.
The only case I can think of where you wouldn't know the matrix's size beforehand is when reading from a file. In this case I would either load the data first into a standard container such as std::vector using push_back and then copy it into an already sized matrix, or if memory is tight run through the file once to get the size and then a second time to copy the values.
There is no such function, however, you can build something like this yourself:
using Eigen::MatrixXd;
using Eigen::Vector3d;
template <typename DynamicEigenMatrix>
void push_back(DynamicEigenMatrix& m, Vector3d&& values, std::size_t row)
if(row >= m.rows()) {
m.conservativeResize(row + 1, Eigen::NoChange);
m.row(row) = values;
int main()
MatrixXd matrix(10, 3);
for (std::size_t i = 0; i < 10; ++i) {
push_back(matrix, Vector3d(1,2,3), i);
std::cout << matrix << "\n";
return 0;
If this needs to perform too many resizes though, it's going to be horrendously slow.

c++ matrix insert value using iterators (homework)

I'm pretty new to C++ and got an assignment to make a matrix using only STL containers. I've used a vector (rows) of vectors (columns). The problem I'm having is in the 'write' operation - for which I may only use an iterator-based implementation. Problem is, quite simply: it writes nothing.
I've tested with a matrix filled with different values, and while the iterator ends up on exactly the right spot, it doesn't change the value.
Here's my code:
void write(matrix mat, int row, int col, int input)
assert(row>=0 && col>=0);
assert(row<=mat.R && col<=mat.C);
//I set up the iterators.
vector<vector<int> >::iterator rowit;
vector<int>::iterator colit;
rowit = mat.rows.begin();
//I go to the row.
for(int i = 0; i<row-1; ++i)
colit = rowit->begin();
//I go to the column.
for(int j = 0; j<col-1; ++j)
*colit = input; //Does nothing.
What am I overlooking?
matrix mat is a parameter by value, it copies the matrix and hence you are writing to a copy.
You should pass the matrix by reference instead, like matrix & mat.
But wait... You are passing the matrix every time as the first parameter, this is a bad sign!
This usually indicates that the parameter should be turned into an object on which you can run the methods; that way, you don't need to pass the parameter at all. So, create a Matrix class instead.
Please note that there is std::vector::operator[].
So, you could just do it like this:
void write(matrix & mat, int row, int col, int input)
assert(row>=0 && col>=0);
assert(row<=mat.R && col<=mat.C);
mat[row][col] = input;

Fastest way to calculate minimum euclidean distance between two matrices containing high dimensional vectors

I started a similar question on another thread, but then I was focusing on how to use OpenCV. Having failed to achieve what I originally wanted, I will ask here exactly what I want.
I have two matrices. Matrix a is 2782x128 and Matrix b is 4000x128, both unsigned char values. The values are stored in a single array. For each vector in a, I need the index of the vector in b with the closest euclidean distance.
Ok, now my code to achieve this:
#include <windows.h>
#include <stdlib.h>
#include <stdio.h>
#include <cstdio>
#include <math.h>
#include <time.h>
#include <sys/timeb.h>
#include <iostream>
#include <fstream>
#include "main.h"
using namespace std;
void main(int argc, char* argv[])
int a_size;
unsigned char* a = NULL;
read_matrix(&a, a_size,"matrixa");
int b_size;
unsigned char* b = NULL;
read_matrix(&b, b_size,"matrixb");
QueryPerformanceFrequency( &liPerfFreq );
QueryPerformanceCounter( &liStart );
int* indexes = NULL;
min_distance_loop(&indexes, b, b_size, a, a_size);
QueryPerformanceCounter( &liEnd );
cout << "loop time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl;
if (a)
if (b)
if (indexes)
void read_matrix(unsigned char** matrix, int& matrix_size, char* matrixPath)
ofstream myfile;
float f;
FILE * pFile;
pFile = fopen (matrixPath,"r");
fscanf (pFile, "%d", &matrix_size);
*matrix = new unsigned char[matrix_size*128];
for (int i=0; i<matrix_size*128; ++i)
unsigned int matPtr;
fscanf (pFile, "%u", &matPtr);
matrix[i]=(unsigned char)matPtr;
fclose (pFile);
void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size)
const int descrSize = 128;
*indexes = (int*)malloc(a_size*sizeof(int));
int dataIndex=0;
int vocIndex=0;
int min_distance;
int distance;
int multiply;
unsigned char* dataPtr;
unsigned char* vocPtr;
for (int i=0; i<a_size; ++i)
min_distance = LONG_MAX;
for (int j=0; j<b_size; ++j)
dataPtr = &a[dataIndex];
vocPtr = &b[vocIndex];
for (int k=0; k<descrSize; ++k)
multiply = *dataPtr++-*vocPtr++;
distance += multiply*multiply;
// If the distance is greater than the previously calculated, exit
if (distance>min_distance)
// if distance smaller
if (distance<min_distance)
min_distance = distance;
(*indexes)[i] = j;
And attached are the files with sample matrices.
I am using windows.h just to calculate the consuming time, so if you want to test the code in another platform than windows, just change windows.h header and change the way of calculating the consuming time.
This code in my computer is about 0.5 seconds. The problem is that I have another code in Matlab that makes this same thing in 0.05 seconds. In my experiments, I am receiving several matrices like matrix a every second, so 0.5 seconds is too much.
Now the matlab code to calculate this:
aa=sum(a.*a,2); bb=sum(b.*b,2); ab=a*b';
d = sqrt(abs(repmat(aa,[1 size(bb,1)]) + repmat(bb',[size(aa,1) 1]) - 2*ab));
[minz index]=min(d,[],2);
Ok. Matlab code is using that (x-a)^2 = x^2 + a^2 - 2ab.
So my next attempt was to do the same thing. I deleted my own code to make the same calculations, but It was 1.2 seconds approx.
Then, I tried to use different external libraries. The first attempt was Eigen:
const int descrSize = 128;
MatrixXi a(a_size, descrSize);
MatrixXi b(b_size, descrSize);
MatrixXi ab(a_size, b_size);
unsigned char* dataPtr = matrixa;
for (int i=0; i<nframes; ++i)
for (int j=0; j<descrSize; ++j)
unsigned char* vocPtr = matrixb;
for (int i=0; i<vocabulary_size; ++i)
for (int j=0; j<descrSize; ++j)
b(i,j)=(int)*vocPtr ++;
ab = a*b.transpose();
MatrixXi aa = a.rowwise().sum();
MatrixXi bb = b.rowwise().sum();
MatrixXi d = (aa.replicate(1,vocabulary_size) + bb.transpose().replicate(nframes,1) - 2*ab).cwiseAbs2();
int* index = NULL;
index = (int*)malloc(nframes*sizeof(int));
for (int i=0; i<nframes; ++i)
This Eigen code costs 1.2 approx for just the line that says: ab = a*b.transpose();
A similar code using opencv was used also, and the cost of the ab = a*b.transpose(); was 0.65 seconds.
So, It is real annoying that matlab is able to do this same thing so quickly and I am not able in C++! Of course being able to run my experiment would be great, but I think the lack of knowledge is what really is annoying me. How can I achieve at least the same performance than in Matlab? Any kind of soluting is welcome. I mean, any external library (free if possible), loop unrolling things, template things, SSE intructions (I know they exist), cache things. As I said, my main purpose is increase my knowledge for being able to code thinks like this with a faster performance.
Thanks in advance
EDIT: more code suggested by David Hammen. I casted the arrays to int before making any calculations. Here is the code:
void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size)
const int descrSize = 128;
int* a_int;
int* b_int;
QueryPerformanceFrequency( &liPerfFreq );
QueryPerformanceCounter( &liStart );
a_int = (int*)malloc(a_size*descrSize*sizeof(int));
b_int = (int*)malloc(b_size*descrSize*sizeof(int));
for(int i=0; i<descrSize*a_size; ++i)
for(int i=0; i<descrSize*b_size; ++i)
QueryPerformanceCounter( &liEnd );
cout << "Casting time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl;
*indexes = (int*)malloc(a_size*sizeof(int));
int dataIndex=0;
int vocIndex=0;
int min_distance;
int distance;
int multiply;
/*unsigned char* dataPtr;
unsigned char* vocPtr;*/
int* dataPtr;
int* vocPtr;
for (int i=0; i<a_size; ++i)
min_distance = LONG_MAX;
for (int j=0; j<b_size; ++j)
dataPtr = &a_int[dataIndex];
vocPtr = &b_int[vocIndex];
for (int k=0; k<descrSize; ++k)
multiply = *dataPtr++-*vocPtr++;
distance += multiply*multiply;
// If the distance is greater than the previously calculated, exit
if (distance>min_distance)
// if distance smaller
if (distance<min_distance)
min_distance = distance;
(*indexes)[i] = j;
The entire process is now 0.6, and the casting loops at the beginning are 0.001 seconds. Maybe I did something wrong?
EDIT2: Anything about Eigen? When I look for external libs they always talk about Eigen and their speed. I made something wrong? Here a simple code using Eigen that shows it is not so fast. Maybe I am missing some config or some flag, or ...
MatrixXd A = MatrixXd::Random(1000, 1000);
MatrixXd B = MatrixXd::Random(1000, 500);
MatrixXd X;
This code is about 0.9 seconds.
As you observed, your code is dominated by the matrix product that represents about 2.8e9 arithmetic operations. Yopu say that Matlab (or rather the highly optimized MKL) computes it in about 0.05s. This represents a rate of 57 GFLOPS showing that it is not only using vectorization but also multi-threading. With Eigen, you can enable multi-threading by compiling with OpenMP enabled (-fopenmp with gcc). On my 5 years old computer (2.66Ghz Core2), using floats and 4 threads, your product takes about 0.053s, and 0.16s without OpenMP, so there must be something wrong with your compilation flags. To summary, to get the best of Eigen:
compile in 64bits mode
use floats (doubles are twice as slow owing to vectorization)
enable OpenMP
if your CPU has hyper-threading, then either disable it or define the OMP_NUM_THREADS environment variable to the number of physical cores (this is very important, otherwise the performance will be very bad!)
if you have other task running, it might be a good idea to reduce OMP_NUM_THREADS to nb_cores-1
use the most recent compiler that you can, GCC, clang and ICC are best, MSVC is usually slower.
One thing that is definitely hurting you in your C++ code is that it has a boatload of char to int conversions. By boatload, I mean up to 2*2782*4000*128 char to int conversions. Those char to int conversions are slow, very slow.
You can reduce this to (2782+4000)*128 such conversions by allocating a pair of int arrays, one 2782*128 and the other 4000*128, to contain the cast-to-integer contents of your char* a and char* b arrays. Work with these int* arrays rather than your char* arrays.
Another problem might be your use of int versus long. I don't work on windows, so this might not be applicable. On the machines I work on, int is 32 bits and long is now 64 bits. 32 bits is more than enough because 255*255*128 < 256*256*128 = 223.
That obviously isn't the problem.
What's striking is that the code in question is not calculating that huge 2728 by 4000 array that the Matlab code is creating. What's even more striking is that Matlab is most likely doing this with doubles rather than ints -- and it's still beating the pants off the C/C++ code.
One big problem is cache. That 4000*128 array is far too big for level 1 cache, and you are iterating over that big array 2782 times. Your code is doing far too much waiting on memory. To overcome this problem, work with smaller chunks of the b array so that your code works with level 1 cache for as long as possible.
Another problem is the optimization if (distance>min_distance) break;. I suspect that this is actually a dis-optimization. Having if tests inside your innermost loop is oftentimes a bad idea. Blast through that inner product as fast as possible. Other than wasted computations, there is no harm in getting rid of this test. Sometimes it is better to make apparently unneeded computations if doing so can remove a branch in an innermost loop. This is one of those cases. You might be able to solve your problem just by eliminating this test. Try doing that.
Getting back to the cache problem, you need to get rid of this branch so that you can split the operations over the a and b matrix into smaller chunks, chunks of no more than 256 rows at a time. That's how many rows of 128 unsigned chars fit into one of the two modern Intel chip's L1 caches. Since 250 divides 4000, look into logically splitting that b matrix into 16 chunks. You may well want to form that big 2872 by 4000 array of inner products, but do so in small chunks. You can add that if (distance>min_distance) break; back in, but do so at a chunk level rather than at the byte by byte level.
You should be able to beat Matlab because it almost certainly is working with doubles, but you can work with unsigned chars and ints.
Matrix multiply generally uses the worst possible cache access pattern for one of the two matrices, and the solution is to transpose one of the matrices and use a specialized multiply algorithm that works on data stored that way.
Your matrix already IS stored transposed. By transposing it into the normal order and then using a normal matrix multiply, your are absolutely killing performance.
Write your own matrix multiply loop that inverts the order of indices to the second matrix (which has the effect of transposing it, without actually moving anything around and breaking cache behavior). And pass your compiler whatever options it has for enabling auto-vectorization.

Variable block size sum of absolute difference calculation in C++

I would like to perform a variable block size sum of absolute difference calculation with a 2-D array of 16 bit integers in a C++ program as efficiently as possible. I am interested in a real time block matching code. I was wondering if there were any software libraries available to do this? The code is running on windows XP and I'm stuck using Visual Studio 2010 to do the compiling. The CPU is a 2-core AMD Athlon 64 x2 4850e.
By variable block size sum of absolute difference(SAD) calculation I mean the following.
I have one smaller 2-D array I will call the template_grid, and one larger 2-D array I will call the image. I want to find the region in the image that minimizes the sum of the absolute difference between the pixels in the template and the pixels in the region in the image.
The simplest way to calculate the SAD in C++ if would be the following:
for(int shiftY = 0; shiftY < rangeY; shiftY++) {
for(int shiftX = 0; shiftX < rangeX; shiftX++) {
for(int x = 0; x < lenTemplateX; x++) {
for(int y = 0; y < lenTemplateY; y++) {
SAD[shiftY][shiftX]=abs(template_grid[x][y] - image[y + shiftY][x + shiftX]);
The SAD calculation for specific array sizes has been optimized in the Intel performance primitives library. However, the arrays I'm working with don't fit the sizes in these libraries.
There are two search ranges I work with,
a large range: rangeY = 45, rangeX = 10
a small range: rangeY = 4, rangeX = 2
There is only one template size and it is:
lenTemplateY = 61, lenTemplateX = 7
Minor optimisation:
for(int shiftY = 0; shiftY < rangeY; shiftY++) {
for(int shiftX = 0; shiftX < rangeX; shiftX++) {
// if you can assume SAD is already filled with 0-es,
// you don't need the next line
for(int tx = 0, imx=shiftX; x < lenTemplateX; tx++,imx++) {
for(int ty = 0, imy=shiftY; y < lenTemplateY; ty++,imy++) {
// two increments of imx/imy may be cheaper than
// two addition with offsets
SAD[shiftY][shiftX]+=abs(template_grid[tx][ty] - image[imx][imy]);
Loop unrolling using C++ templates
May be a crazy idea for your configuration (C++ compiler worries me), but it may work. I offer no warranties, but give it a try.
The idea may work because your template_grid sizes and the ranges are constant - thus known at compilation time.Also, for this to work, your image and template_grid must be organised with the same layout (column first or row first) - the way your "sample code" is depicted in the question mixes the SAD x/y with template_grid y/x.
In the followings, I'll assume a "column first" organisation, so that SAD[ix] denotes the ixth column of your SAD** matrix. The code goes just the same for "row first", except the name of the variables won't match the meaning of your value arrays.
So, let's start:
template <
typename sad_type, typename val_type,
size_t template_len
> struct sad1D_simple {
void operator()(
const val_type* img, const val_type* templ,
sad_type& result
) {
// template specialization recursion, with one less element to add
sad1D_simple<sad_type, val_type, template_len-1> one_shorter;
// call it incrementing the img and template offsets
one_shorter(img+1, templ+1, result);
// the add the contribution of the first diff we skipped over above
// at len of 0, the result is zero. We need it to stop the
template <
typename sad_type, typename val_type
struct sad1D_simple<sad_type, val_type, 0> {
void operator()(
const val_type* img, const val_type* templ,
sad_type& result
) {
Why a functor struct - struct with operator? The C++ doesn't allow partial specialization of function templates.
What the sad1D_simple does: unrolls a for cycle that computes the SAD of two arrays in input without any offsetting, based on the fact that the length of your template_grid array is a constant known at compile time. It's in the same vein as "computing the factorial of compile time using C++ templates"
How this helps?
Example of use in the code below:
typedef ulong SAD_t;
typedef int16_t pixel_val_t;
const size_t lenTemplateX = 7; // number of cols in the template_grid
const size_t lenTemplateY = 61;
const size_t rangeX=10, rangeY=45;
pixel_val_t **image, **template_grid;
SAD_t** SAD;
// assume those are initialized somehow
for(size_t tgrid_col=0; tgrid_col<lenTemplateX; tgrid_col++) {
pixel_val_t* template_col=template_grid[tgrid_col];
// the X axis - horizontal - is the column axis, right?
for(size_t shiftX=0; shiftX < rangeX; shiftX++) {
pixel_val_t* img_col=image[shiftX];
for(size_t shiftY = 0; shiftY < rangeY; shiftY++) {
// the Y axis - vertical - is the "offset in a column"=row, isn't it?
pixel_val_t* img_col_offsetted=img_col+shiftY;
// this functor is made by recursive specialization
// there's no cycle inside it, it was unrolled into
// lenTemplateY individual subtractions, abs-es and additions
sad1D_simple<SAD_t, pixel_val_t, lenTemplateY> calc;
calc(img_col_offsetted, template_col, SAD[shiftX][shiftY]);
Mmmm... can we do better? No, it won't be the X-axis unrolling, we still want to stay in 1D area, but... well, maybe if we create a ranged sad1D and unroll one more loop on the same axis?It will work iff the rangeX is also constant.
template <
typename sad_type, typename val_type,
size_t range, size_t template_len
> struct sad1D_ranged {
void operator()(
const val_type* img, const val_type* templ,
// result is assumed to have at least `range` slots
sad_type* result
) {
// we'll compute here the first slot of the result
sad1D_simple<sad_type, val_type, template_len>
calculator_for_first_sad(img, templ, *(result));
// now, ask for a recursive specialization for
// the next (range-1) sad-s
sad1D_ranged<sad_type, val_type, range-1, template_len>
// when calling, pass the shifted img and result
one_less_in_range(img+1, templ, result+1);
// for a range of 0, there's nothing to do, but we need it
// to stop the template specialization recursion
template <
typename sad_type, typename val_type,
size_t template_len
> struct sad1D_ranged<sad_type, val_type, 0, template_len> {
void operator()(
const val_type* img, const val_type* templ,
// result is assumed to have at least `range` slots
sad_type* result
) {
And here's how you use it:
for(size_t tgrid_col=0; tgrid_col<lenTemplateX; tgrid_col++) {
pixel_val_t* template_col=template_grid[tgrid_col];
for(size_t shiftX=0; shiftX < rangeX; shiftX++) {
pixel_val_t* img_col=image[shiftX];
SAD_t* sad_col=SAD[shiftX];
sad1D_ranged<SAD_t, pixel_val_t, rangeY, lenTemplateY> calc;
calc(img_col, template_col, sad_col);
Yes... but the question is: will this improve performance?
The heck if I know. For small number of loops within a cycle and for strong data locality (values close one to the other so that they are in the CPU caches), loop unrolling should improve the performance. For a larger number of loops, you may negatively interfere with the CPU branch prediction and other mumbo-jumbo-I-know-may-impact-performance-but-I-don't-know-how.
Feeling of guts: even if the same unrolling technique may work for the other two loops, using it may well result in a degradation of performance: we'll need to jump from one contiguous vector (an image column) to the other - the entire image may not fit into the CPU cache.
Note: if your template_grid data is constant as well (or you have a finite set of constant template grids), one may take one step further and create struct functors with dedicated masks. But I'm out of steam for today.
you could try with OpenCV template matching with the square difference parameter see the tutorial here. OpenCV is optimized with OpenCL but i don't know for this specific function. I think you should give it a try.
I'm not sure how much you are restricted to using SAD, or if you are generally interested in finding the region in the image that matches the template the best. In the last case, you can use a convolution instead of SAD. This can be solved in the Fourier domain in O(N log N), including the Fourier transform (FFT).
In short, you can use the FFT (for example using to convert both the template and the image to the frequency domain, then multiply them, and convert back to the time domain.
Of course, this is all irrelevant if you are bound to using SAD.