OpenCV Sum of squared differences speed - c++

I've been using the openCV to do some block matching and I've noticed it's sum of squared differences code is very fast compared to a straight forward for loop like this:
int SSD = 0;
for(int i =0; i < arraySize; i++)
SSD += (array1[i] - array2[i] )*(array1[i] - array2[i]);
If I look at the source code to see where the heavy lifting happens, the
OpenCV folks have their for loops do 4 squared difference calculations at a time in each iteration of the loop. The function to do the block matching looks like this.
int64
icvCmpBlocksL2_8u_C1( const uchar * vec1, const uchar * vec2, int len )
{
int i, s = 0;
int64 sum = 0;
for( i = 0; i <= len - 4; i += 4 )
{
int v = vec1[i] - vec2[i];
int e = v * v;
v = vec1[i + 1] - vec2[i + 1];
e += v * v;
v = vec1[i + 2] - vec2[i + 2];
e += v * v;
v = vec1[i + 3] - vec2[i + 3];
e += v * v;
sum += e;
}
for( ; i < len; i++ )
{
int v = vec1[i] - vec2[i];
s += v * v;
}
return sum + s;
}
This calculation is for unsigned 8 bit integers. They perform a similar calculation for 32-bit floats in this function:
double
icvCmpBlocksL2_32f_C1( const float *vec1, const float *vec2, int len )
{
double sum = 0;
int i;
for( i = 0; i <= len - 4; i += 4 )
{
double v0 = vec1[i] - vec2[i];
double v1 = vec1[i + 1] - vec2[i + 1];
double v2 = vec1[i + 2] - vec2[i + 2];
double v3 = vec1[i + 3] - vec2[i + 3];
sum += v0 * v0 + v1 * v1 + v2 * v2 + v3 * v3;
}
for( ; i < len; i++ )
{
double v = vec1[i] - vec2[i];
sum += v * v;
}
return sum;
}
I was wondering if anyone had any idea if breaking a loop up into chunks of 4 like this might speed up code? I should add that there is no multithreading occuring in this code.

My guess is that this is just a simple implementation of unrolling the loop - it saves 3 additions and 3 compares on each pass of the loop, which can be a great savings if, for example, checking len involves a cache miss. The downside is that this optimization adds code complexity (e.g. the additional for loop at the end to finish the loop for the len % 4 items left if the length is not evenly divisible by 4) and, of course, it's an architecture-dependent optimization whose magnitude of improvement will vary by hardware/compiler/etc...
Still, it's straightforward to follow compared to most optimizations and will probably result in some sort of performance increase regardless of the architecture, so it's low risk to just throw it in there and hope for the best. Since OpenCV is such a well-supported chunk of code, I'm sure that someone instrumented these chunks of code and found them to be well worth it - as you yourself have done.

There is one obvious optimisation of your code, viz:
int SSD = 0;
for(int i = 0; i < arraySize; i++)
{
int v = array1[i] - array2[i];
SSD += v * v;
}

Related

Example on FFT from Numerical Recipes book results in runtime error

I am trying to implement the FFT algorithm on C. I wrote a code based on the function "four1" from the book "Numerical Recipes in C". I know that using external libraries such as FFTW would be more efficient, but I just wanted to try this as a first approach. But I am getting an error at runtime.
After trying to debug for a while, I have decided to copy the exact same function provided in the book, but I still have the same problem. The problem seems to be in the following commands:
tempr = wr * data[j] - wi * data[j + 1];
tempi = wr * data[j + 1] + wi * data[j];
and
data[j + 1] = data[i + 1] - tempi;
the j is sometimes as high as the last index of the array, so you cannot add one when indexing.
As I said, I didnĀ“t change anything from the code, so I am very surprised that it is not working for me; it is a well-known reference for numerical methods in C, and I doubt there are errors in it. Also, I have found some questions regarding the same code example but none of them seemed to have the same issue (see C: Numerical Recipies (FFT), for example). What am I doing wrong?
Here is the code:
#include <iostream>
#include <stdio.h>
using namespace std;
#define SWAP(a,b) tempr=(a);(a)=(b);(b)=tempr
void four1(double* data, unsigned long nn, int isign)
{
unsigned long n, mmax, m, j, istep, i;
double wtemp, wr, wpr, wpi, wi, theta;
double tempr, tempi;
n = nn << 1;
j = 1;
for (i = 1; i < n; i += 2) {
if (j > i) {
SWAP(data[j], data[i]);
SWAP(data[j + 1], data[i + 1]);
}
m = n >> 1;
while (m >= 2 && j > m) {
j -= m;
m >>= 1;
}
j += m;
}
mmax = 2;
while (n > mmax) {
istep = mmax << 1;
theta = isign * (6.28318530717959 / mmax);
wtemp = sin(0.5 * theta);
wpr = -2.0 * wtemp * wtemp;
wpi = sin(theta);
wr = 1.0;
wi = 0.0;
for (m = 1; m < mmax; m += 2) {
for (i = m; i <= n; i += istep) {
j = i + mmax;
tempr = wr * data[j] - wi * data[j + 1];
tempi = wr * data[j + 1] + wi * data[j];
data[j] = data[i] - tempr;
data[j + 1] = data[i + 1] - tempi;
data[i] += tempr;
data[i + 1] += tempi;
}
wr = (wtemp = wr) * wpr - wi * wpi + wr;
wi = wi * wpr + wtemp * wpi + wi;
}
mmax = istep;
}
}
#undef SWAP
int main()
{
// Testing with random data
double data[] = {1, 1, 2, 0, 1, 3, 4, 0};
four1(data, 4, 1);
for (int i = 0; i < 7; i++) {
cout << data[i] << " ";
}
}
The first 2 editions of Numerical Recipes in C use the unusual (for C) convention that arrays are 1-based. (This was probably because the Fortran (1-based) version came first and the translation to C was done without regard to conventions.)
You should read section 1.2 Some C Conventions for Scientific
Computing, specifically the paragraphs on Vectors and One-Dimensional Arrays. As well as trying to justify their 1-based decision, this section does explain how to adapt pointers appropriately to match their code.
In your case, this should work -
int main()
{
// Testing with random data
double data[] = {1, 1, 2, 0, 1, 3, 4, 0};
double *data1based = data - 1;
four1(data1based, 4, 1);
for (int i = 0; i < 7; i++) {
cout << data[i] << " ";
}
}
However, as #Some programmer dude mentions in the comments the workaround advocated by the book is undefined behaviour as data1based points outside the bounds of the data array.
Whilst this way well work in practice, an alternative and non-UB workaround would be to change your interpretation to match their conventions -
int main()
{
// Testing with random data
double data[] = { -1 /*dummy value*/, 1, 1, 2, 0, 1, 3, 4, 0};
four1(data, 4, 1);
for (int i = 1; i < 8; i++) {
cout << data[i] << " ";
}
}
I'd be very wary of this becoming contagious though and infecting your code too widely.
The third edition tacitly recognised this 'mistake' and, as part of supporting C++ and standard library collections, switched to use the C & C++ conventions of zero-based arrays.

Software Prefetch and Backward Looping

Did I used the pre-fetch instruction correctly to reduce memory latency?
Can I do better than this?
When I compile the code with -O3, g++ seems to unroll the inner loop (code at godbolt.org).
The architecture of the CPU is Broadwell.
Thanks.
Backward loop over an array and read/write elements.
Each calculation depends on the previous calculation.
#include <stdlib.h>
#include <iostream>
int main() {
const int N = 25000000;
float* x = reinterpret_cast<float*>(
aligned_alloc(16, 4*N)
); // 0.1 GB
x[N - 1] = 1.0f;
// fetch last cache line of the array
__builtin_prefetch(&x[N - 16], 0, 3);
// Backward loop over the i^th cache line.
for (int i = N - 16; i >= 0; i -= 16) {
for (int j = 15; j >= 1; --j) {
x[i + j - 1] += x[i + j];
}
__builtin_prefetch(&x[i - 16], 0, 3);
x[i - 1] = x[i];
}
std::cout << x[0] << "\n";
free(x);
}

Addition of every subset of two multiplied

I have an array with the elements {7,2,1} and the idea is to do 7 * 2 + 7 * 1 + 2 * 1 which is basically this algorithm:
for(int i=0;i<n-1;++i)
for(int k=i+1;k<n;++k)
sum += a[i] * a[k];
Where a is the array in which I have the numbers and n is the number of elements, I need a more efficient algorithm for doing this, and I have no clue how to do it, can someone give me a hand?
Thank you!
You can do better in the general case. Time to do some math. Let's look at the 3-element version, we have:
ab + ac + bc
= 1/2 * (2ab + 2ac + 2bc)
= 1/2 * (2ab + 2ac + 2bc + a^2 + b^2 + c^2 - (a^2 + b^2 + c^2))
= 1/2 * ((a+b+c)^2 - (a^2 + b^2 + c^2))
That is:
int sum = 0;
int sum_sq = 0;
for (int i : arr) {
sum += i;
sum_sq += i*i;
}
int result = (sum*sum - sum_sq) / 2;
This is O(n) multiplications, instead of O(n^2). This'll certainly be better than the naive implementation at some point. Whether or not it's better for just 3 elements is something I haven't timed.
#chux's suggestion is essentially to redistribute operations:
ai * ai + 1 + ai * ai + 2 + ... + ai * an
-->
ai * (ai + 1 + ... + an)
combined with the avoiding unnecessary recomputation of partial sums of the (ai + 1 + ... + an) terms by leveraging the fact that each differs from the next by the value of one element of the input array.
Here's a one-pass implementation with O(1) overhead:
int psum(size_t n, int array[n]) {
int result = 0;
int rsum = array[n - 1];
for (int i = n - 2; i >= 0; i--) {
result += array[i] * rsum;
rsum += array[i];
}
return result;
}
The sum of all elements to the right of index i is maintained from iteration to iteration in variable rsum. It's unnecessary to track its various values in an array, because we need each value only for one iteration of the loop.
This scales linearly with the number of elements in the input array. You'll see that the number and type of operations is quite similar to #Barry's answer, but nothing analogous to his final step is required, which saves a few operations.
As #Barry observes in comments, the iteration can also be run in the other direction, in conjunction with tracking the left-hand partial sums intead of the right-hand ones. That would diverge a bit more from #chux's description, but it relies on exactly the same principles.
We have (a + b + c + ...)2 = (a2 + b2 + c2 + ...) + 2(ab + bc + ca + ...)
You want the sum S = ab + bc + ca + ..., which has O(n2) pairs (using 2 nested loops)
You can do 2 separated loops, one calculates P = a2 + b2 + c2 + ... in O(n) time, and another calculates Q = (a + b + c + ...)2 also in O(n) time. Then take S = (Q - P) / 2.
Make 1 pass, walk from the end of [a] to the front and form a sum of all the elements "to the right".
2nd pass, Multiple a[i] * sum[i].
O(n).
long sum0(int a[], int n) {
long sum = 0;
for (int i = 0; i < n - 1; ++i)
for (int k = i + 1; k < n; ++k)
sum += a[i] * a[k];
return sum;
}
long sum1(int a[], int n) {
int long sums[n];
sums[n - 1] = 0;
for (int i = n - 2; i >= 0; i--) {
sums[i] = a[i+1] + sums[i + 1];
}
long sum = 0;
for (int i = 0; i < n - 1; ++i)
sum += a[i] * sums[i];
return sum;
}
void test(int a[], int n) {
long s0 = sum0(a, n);
long s1 = sum1(a, n);
if (s0 != s1) printf("%9ld %9ld\n", s0, s1);
}
void tests(int k) {
while (k--) {
int n = rand() % 10 + 2;
int a[n + 1];
for (int m = 0; m < n; m++)
a[m] = rand() % 256;
test(a, n);
}
}
int main() {
int a[3] = { 7, 2, 1 };
printf("%d\n", sum1(a, 3));
tests(1000000);
puts("Done");
}
As it turns out the sums[] array is not needed either as the the running sums needs only 1 location. This effectively makes this answers similar to others
long sum1(int a[], int n) {
int long sums = 0;
long sum = 0;
for (int i = n - 2; i >= 0; i--) {
sums = a[i+1] + sums;
sum += a[i] * sums;
}
return sum;
}

C++: replicating matlab's interp1 spline interpolation function

Can anyone give me some direction to replicating MATLAB's interp1 function, using spline interpolation? I tried closely replicating the algorithm on the wikipedia page, but the results don't really match up.
#include <stdio.h>
#include <stdint.h>
#include <iostream>
#include <vector>
//MATLAB: interp1(x,test_array,query_points,'spline')
int main(){
int size = 10;
std::vector<float> test_array(10);
test_array[0] = test_array[4] = test_array[8] = 1;
test_array[1] = test_array[3] = test_array[5] = test_array[7] = test_array[9] = 4;
test_array[2] = test_array[6] = 7;
std::vector<float> query_points;
for (int i = 0; i < 10; i++)
query_points.push_back(i +.05);
int n = (size - 1);
std::vector<float> a(n+1);
std::vector<float> x(n+1); //sample_points vector
for (int i = 0; i < (n+1); i++){
x[i] = i + 1.0;
a[i] = test_array[i];
}
std::vector<float> b(n);
std::vector<float> d(n);
std::vector<float> h(n);
for (int i = 0; i < (n); ++i)
h[i] = x[i+1] - x[i];
std::vector<float> alpha(n);
for (int i = 1; i < n; ++i)
alpha[i] = ((3 / h[i]) * (a[i+1] - a[i])) - ((3 / h[i-1]) * (a[i] - a[i-1]));
std::vector<float> c(n+1);
std::vector<float> l(n+1);
std::vector<float> u(n+1);
std::vector<float> z(n+1);
l[0] = 1.0;
u[0] = z[0] = 0.0;
for (int i = 1; i < n; ++i){
l[i] = (2 * (x[i+1] - x[i-1])) - (h[i-1] * u[i-1]);
u[i] = h[i] / l[i];
z[i] = (alpha[i] - (h[i-1] * z[i-1])) / l[i];
}
l[n] = 1.0;
z[n] = c[n] = 0.0;
for (int j = (n - 1); j >= 0; j--){
c[j] = z[j] - (u[j] * c[j+1]);
b[j] = ((a[j+1] - a[j]) / h[j]) - ((h[j] / 3) * (c[j+1] + (2 * c[j])));
d[j] = (c[j+1] - c[j]) / (3 * h[j]);
}
std::vector<float> output_array(10);
for (int i = 0; i < n-1; i++){
float eval_point = (query_points[i] - x[i]);
output_array[i] = a[i] + (eval_point * b[i]) + ( eval_point * eval_point * c[i]) + (eval_point * eval_point * eval_point * d[i]);
std::cout << output_array[i] << std::endl;
}
system("pause");
return 0;
}
In hindsight, your code seems to be coded properly referring to the Wikipedia article. However, there is something you need to know about interp1 which I don't think you have taken into account when using it to check your answers.
MATLAB's interp1 when you specify the spline flag assumes that the end point conditions are not-a-knot. The algorithm specified on Wikipedia is the code for a natural spline.
As such, this is probably why your points do not match up. FWIW, consult: http://www.cs.tau.ac.il/~turkel/notes/numeng/spline_note.pdf and look at the diagram on the last page. You'll see that not-a-knot splines and natural splines bear the same shape, but have different y-values when your data consists of just the end points of your spline. However, should you have data points in between the end points, all of the different kinds of splines (more or less) have the same y values.
For the sake of completeness, here is the figure extracted from the PDF notes I referenced above:
If you want to use natural splines, use csape instead of interp1. This provides a cubic spline with end conditions. You call csape like this:
pp = csape(x,y);
x and y are the control points defined for your spline. By default, this returns a natural spline, which is what you're after, and is a struct of type ppform. You can then figure out what the spline evaluates to by using fnval:
yval = fnval(pp, xval);
xval and yval is the input x co-ordinate and the output evaluated for the spline at this particular x.
Use this, then check to see if your code matches up with the values provided by csape.
Minor Note
You need the Curve Fitting Toolbox in MATLAB to use csape. If you don't have this, then unfortunately this method will not work.
I think the interp1 is supported by MATLAB CODER.
Just use the CODER to generate the C code and you have what you need.

Optimize log entropy calculation in sparse matrix

I have a 3007 x 1644 dimensional matrix of terms and documents. I am trying to assign weights to frequency of terms in each document so I'm using this log entropy formula http://en.wikipedia.org/wiki/Latent_semantic_indexing#Term_Document_Matrix (See entropy formula in the last row).
I'm successfully doing this but my code is running for >7 minutes.
Here's the code:
int N = mat.cols();
for(int i=1;i<=mat.rows();i++){
double gfi = sum(mat(i,colon()))(1,1); //sum of occurrence of terms
double g =0;
if(gfi != 0){// to avoid divide by zero error
for(int j = 1;j<=N;j++){
double tfij = mat(i,j);
double pij = gfi==0?0.0:tfij/gfi;
pij = pij + 1; //avoid log0
double G = (pij * log(pij))/log(N);
g = g + G;
}
}
double gi = 1 - g;
for(int j=1;j<=N;j++){
double tfij = mat(i,j) + 1;//avoid log0
double aij = gi * log(tfij);
mat(i,j) = aij;
}
}
Anyone have ideas how I can optimize this to make it faster? Oh and mat is a RealSparseMatrix from amlpp matrix library.
UPDATE
Code runs on Linux mint with 4gb RAM and AMD Athlon II dual core
Running time before change: > 7mins
After #Kereks answer: 4.1sec
Here's a very naive rewrite that removes some redundancies:
int const N = mat.cols();
double const logN = log(N);
for (int i = 1; i <= mat.rows(); ++i)
{
double const gfi = sum(mat(i, colon()))(1, 1); // sum of occurrence of terms
double g = 0;
if (gfi != 0)
{
for (int j = 1; j <= N; ++j)
{
double const pij = mat(i, j) / gfi + 1;
g += pij * log(pij);
}
g /= logN;
}
for (int j = 1; j <= N; ++j)
{
mat(i,j) = (1 - g) * log(mat(i, j) + 1);
}
}
Also make sure that the matrix data structure is sane (e.g. a flat array accessed in strides; not a bunch of dynamically allocated rows).
Also, I think the first + 1 is a bit silly. You know that x -> x * log(x) is continuous at zero with limit zero, so you should write:
double const pij = mat(i, j) / gfi;
if (pij != 0) { g += pij + log(pij); }
In fact, you might even write the first inner for loop like this, avoiding a division when it isn't needed:
for (int j = 1; j <= N; ++j)
{
if (double pij = mat(i, j))
{
pij /= gfi;
g += pij * log(pij);
}
}