OpenCV: Quantifying the difference between two images quickly - c++

I have two images and I want to get a sense for how much they differ on a pixel by pixel level. My basic idea was to take the two images, apply absdiff, and then go each pixel in the difference, take its norm, and store the norm in another array. The problem with this approach is that it is too slow for my application. Does anyone know any alternatives to this?
Many Thanks,
Hillary
The code for calculating the normed difference:
uchar* row_pointer_image_difference;
double* row_pointer_normed_difference;
Vec3b bgrPixel;
double pixel_distance;
for (long int r = 0; r < rows; r++){
row_pointer_image_difference = image_difference.ptr<uchar>(r);
row_pointer_normed_difference = normed_difference.ptr<double>(r);
for (long int c = 0; c < columns; c++){
//calculate pixel distance
bgrPixel = row_pointer_image_difference[c];
pixel_distance = norm(bgrPixel);
row_pointer_normed_difference[c] = pixel_distance;
}
}

You need to clarify your use case better, in order to see what shortcuts are available. Ask yourself: What do you use the difference for? Can you live with an approximate difference? Do you only need to tell if the images are exactly equal?
Also, what computation time do you want to optimize? Worst case? Average? Can you live with a large variance in computation times?
For example, if you are only interested in testing for exact equality, early termination at the first difference is very fast, and will have low expected time if most images are different from each other.
If the fraction of duplicates is expected to be large, random pixel sampling may be a viable approach, and from the sample rate you can quantify the likelihood of false positives and negatives.

Related

Strange behavior in matrix formation (C++, Armadillo)

I have a while loop that continues as long as energy variable (type double) has not converged to below a certain threshold. One of the variables needed to calculate this energy is an Armadillo matrix of doubles, named f_mo. In the while loop, this f_mo updates iteratively, so I calculate f_mo at the beginning of each loop as:
arma::mat f_mo = h_core_mo;//h_core_mo is an Armadillo matrix of doubles
for(size_t p = 0; p < n_mo; p++) {//n_mo is of type size_t
for(size_t q = 0; q < n_mo; q++) {
double sum = 0.0;
for(size_t i = 0; i < n_occ; i++) {//n_occ is of type size_t
//f_mo(p,q) += 2.0*g_mat_full_qp1_qp1_mo(p*n_mo + q, i*n_mo + i)-g_mat_full_qp1_qp1_mo(p*n_mo+i,i*n_mo+q); //all g_mat_ are Armadillo matrices of doubles
sum += 2.0*g_mat_full_qp1_qp1_mo(p*n_mo + q, i*n_mo + i)-g_mat_full_qp1_qp1_mo(p*n_mo+i,i*n_mo+q);
}
for(size_t i2 = 0; i2 < n_occ2; i2++) {//n_occ2 is of type size_t
//f_mo(p,q) -= 1.0*g_mat_full_qp1_qp2_mo(p*n_mo + q, i2*n_mo2 + i2);
sum -= 1.0*g_mat_full_qp1_qp2_mo(p*n_mo + q, i2*n_mo2 + i2);
}
f_mo(p,q) +=sum;
}}
But say I replace the sum (which I add at the end to f_mo(p,q)) with addition to f_mo(p,q) directly (the commented out code). The output f_mo matrices are identical to machine precision. Nothing about the code should change. The only variables affected in the loop are sum and f_mo. And YET, the code converges to a different energy and in vastly different number of while loop iterations. I am at a loss as to the cause of the difference. When I run the same code 2,3,4,5 times, I get the same result every time. When I recompile with no optimization, I get the same issue. When I run on a different computer (controlling for environment), I yet again get a discrepancy in # of while loops despite identical f_mo, but the total number of iterations for each method (sum += and f_mo(p,q) += ) differ.
It is worth noting that the point at which the code outputs differ is always g_mat_full_qp1_qp2_mo, which is recalculated later in the while loop. HOWEVER, every variable going into the calculation of g_mat_full_qp1_qp2_mo is identical between the two codes. This leads me to think there is something more profound about C++ that I do not understand. I welcome any ideas as to how you would proceed in debugging this behavior (I am all but certain it is not a typical bug, and I've controlled for environment and optimization)
I'm going to assume this a Hartree-Fock, or some other kind of electronic structure calculation where you adding the two-electron integrals to the core Hamiltonian, and apply some domain knowledge.
Part of that assume is the individual elements of the two-electron integrals are very small, in particular compared to the core Hamiltonian. Hence as 1201ProgramAlarm mentions in their comment, the order of addition will matter. You will get a more accurate result if you add smaller numbers together first to avoid loosing precision when adding two numbers many orders of magintude apart.. Because you iterate this processes until the Fock matrix f_mo has tightly converged you eventually converge to the same value.
In order to add up the numbers in a more accurate order, and hopefully converge faster, most electronic structure programs have a seperate routine to calculate the two-electron integrals, and then add them to the core Hamiltonian, which is what you are doing, element by element, in your example code.
Presentation on numerical computing.

Fast data structure or algorithm to find mean of each pixel in a stack of images

I have a stack of images in which I want to calculate the mean of each pixel down the stack.
For example, let (x_n,y_n) be the (x,y) pixel in the nth image. Thus, the mean of pixel (x,y) for three images in the image stack is:
mean-of-(x,y) = (1/3) * ((x_1,y_1) + (x_2,y_2) + (x_3,y_3))
My first thought was to load all pixel intensities from each image into a data structure with a single linear buffer like so:
|All pixels from image 1| All pixels from image 2| All pixels from image 3|
To find the sum of a pixel down the image stack, I perform a series of nested for loops like so:
for(int col=0; col<img_cols; col++)
{
for(int row=0; row<img_rows; row++)
{
for(int img=0; img<num_of_images; img++)
{
sum_of_px += px_buffer[(img*img_rows*img_cols)+col*img_rows+row];
}
}
}
Basically img*img_rows*img_cols gives the buffer element of the first pixel in the nth image and col*img_rows+row gives the (x,y) pixel that I want to find for each n image in the stack.
Is there a data structure or algorithm that will help me sum up pixel intensities down an image stack that is faster and more organized than my current implementation?
I am aiming for portability so I will not be using OpenCV and am using C++ on linux.
The problem with the nested loop in the question is that it's not very cache friendly. You go skipping through memory with a long stride, effectively rendering your data cache useless. You're going to spend a lot of time just accessing the memory.
If you can spare the memory, you can create an extra image-sized buffer to accumulate totals for each pixel as you walk through all the pixels in all the images in memory order. Then you do a single pass through the buffer for the division.
Your accumulation buffer may need to use a larger type than you use for individual pixel values, since it has to accumulate many of them. If your pixel values are, say, 8-bit integers, then your accumulation buffer might need 32-bit integers or floats.
Usually, a stack of pixels
(x_1,y_1),...,(x_n,y_n)
is conditionally independent from a stack
(a_1,b_1),...,(a_n,b_n)
And even if they weren't (assuming a particular dataset), then modeling their interactions is a complex task and will give you only an estimate of the mean. So, if you want to compute the exact mean for each stack, you don't have any other choice but to iterate through the three loops that you supply. Languages such as Matlab/octave and libraries such as Theano (python) or Torch7 (lua) all parallelize these iterations. If you are using C++, what you do is well suited for Cuda or OpenMP. As for portability, I think OpenMP is the easier solution.
A portable, fast data structure specifically for the average calculation could be:
std::vector<std::vector<std::vector<sometype> > > VoVoV;
VoVoV.resize(img_cols);
int i,j;
for (i=0 ; i<img_cols ; ++i)
{
VoVoV[i].resize(img_rows);
for (j=0 ; j<img_rows ; ++j)
{
VoVoV[i][j].resize(num_of_images);
// The values of all images at this pixel are stored continguously,
// therefore should be fast to access.
}
}
VoVoV[col][row][img] = foo;
As a side note, 1/3 in your example will evaluate to 0 which is not what you want.
For fast summation/averaging you can now do:
sometype sum = 0;
std::vector<sometype>::iterator it = VoVoV[col][row].begin();
std::vector<sometype>::iterator it_end = VoVoV[col][row].end();
for ( ; it != it_end ; ++it)
sum += *it;
sometype avg = sum / num_of_images; // or similar for integers; check for num_of_images==0
Basically you should not rely that the compiler would optimize away the repeated calculation of always the same offsets.

How is performance dependent on the underlying data values

I have the following C++ code snippet (the C++ part is the profiler class which is omitted here), compiled with VS2010 (64bit Intel machine). The code simply multiplies an array of floats (arr2) with a scalar, and puts the result into another array (arr1):
int M = 150, N = 150;
int niter = 20000; // do many iterations to have a significant run-time
float *arr1 = (float *)calloc (M*N, sizeof(float));
float *arr2 = (float *)calloc (M*N, sizeof(float));
// Read data from file into arr2
float scale = float(6.6e-14);
// START_PROFILING
for (int iter = 0; iter < niter; ++iter) {
for (int n = 0; n < M*N; ++n) {
arr1[n] += scale * arr2[n];
}
}
// END_PROFILING
free(arr1);
free(arr2);
The reading-from-file part and profiling (i.e run-time measurement) is omitted here for simplicity.
When arr2 is initialized to random numbers in the range [0 1], the code runs about 10 times faster as compared to a case where arr2 is initialized to a sparse array in which about 2/3 of the values are zeros. I have played with the compiler options /fp and /O, which changed the run-time a little bit, but the ratio of 1:10 was approximately kept.
How come the performance is dependent on the actual values? What does the CPU do differently that makes the sparse data run ~10 times slower?
Is there a way to make the "slow data" run faster, or will any optimization (e.g vectorizing the calculation) have the same effect on both arrays (i.e, the "slow data" will still run slower then the "fast data")?
EDIT
Complete code is here: https://gist.github.com/1676742, the command line for compiling is in a comment in test.cpp.
The data files are here:
https://ccrma.stanford.edu/~itakatz/tmp/I.bin
https://ccrma.stanford.edu/~itakatz/tmp/I0.bin
Probably that's because your "fast" data consists only of normal floating point numbers, but your "slow" data contains lots of denormalized numbers.
As for your second question, you can try to improve speed with this (and treat all denormalized numbers as exact zeros):
#include <xmmintrin.h>
_mm_setcsr(_mm_getcsr() | 0x8040);
I can think of two reasons for this.
First, the branch predictor may be making incorrect decisions. This is one potential cause of performance gaps caused by data changes without code changes. However, in this case, it seems very unlikely.
The second possible reason is that your "mostly zeros" data doesn't really consist of zeros, but rather of almost-zeros, or that you're keeping arr1 in the almost-zero range. See this Wikipedia link.
There is nothing strange that the data from I.bin takes longer to process: you have lots of numbers like '1.401e-045#DEN' or '2.214e-043#DEN', where #DEN means the number cannot be normalized to the standard float precision. Given that you are going to multiply it by 6.6e-14 you'll definitely have underflow exceptions, which significantly slows down calculations.

What's the numerically best way to calculate the average

what's the best way to calculate the average? With this question I want to know which algorithm for calculating the average is the best in a numerical sense. It should have the least rounding errors, should not be sensitive to over- or underflows and so on.
Thank you.
Additional information: incremental approaches preferred since the number of values may not fit into RAM (several parallel calculations on files larger than 4 GB).
If you want an O(N) algorithm, look at Kahan summation.
You can have a look at http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.3535 (Nick Higham, "The accuracy of floating point summation", SIAM Journal of Scientific Computation, 1993).
If I remember it correctly, compensated summation (Kahan summation) is good if all numbers are positive, as least as good as sorting them and adding them in ascending order (unless there are very very many numbers). The story is much more complicated if some numbers are positive and some are negative, so that you get cancellation. In that case, there is an argument for adding them in descending order.
Sort the numbers in ascending order of magnitude. Sum them, low magnitude first. Divide by the count.
I always use the following pseudocode:
float mean=0.0; // could use doulbe
int n=0; // could use long
for each x in data:
++n;
mean+=(x-mean)/n;
I don't have formal proofs of its stability but you can see that we won't have problems with numerical overflow, assuming that the data values are well behaved. It's referred to in Knuth's The Art of Computer Programming
Just to add one possible answer for further discussion:
Incrementally calculate the average for each step:
AVG_n = AVG_(n-1) * (n-1)/n + VALUE_n / n
or pairwise combination
AVG_(n_a + n_b) = (n_a * AVG_a + n_b * AVG_b) / (n_a + n_b)
(I hope the formulas are clear enough)
A very late post, but since I don't have enough reputation to comment, #Dave's method is the one used (as at December 2020) by the Gnu Scientific Library.
Here is the code, extracted from mean_source.c:
double FUNCTION (gsl_stats, mean) (const BASE data[], const size_t stride, const size_t size)
{
/* Compute the arithmetic mean of a dataset using the recurrence relation mean_(n) = mean(n-1) + (data[n] - mean(n-1))/(n+1) */
long double mean = 0;
size_t i;
for (i = 0; i < size; i++)
{
mean += (data[i * stride] - mean) / (i + 1);
}
return mean;
}
GSL uses the same algorithm to calculate the variance, which is, after all, just a mean of squared differences from a given number.

openMP histogram comparison

I am working on the code that compares image histograms, buy calculating correlation, intersection, ChiSquare and few other methods. General look of these functions are very similar to each other.
Usually I working with pthreads, but this time I decided to build small prototype with openMP (due to it simplicity) and see what kind of results I will get.
This is example of comparing by correlation, code is identical to serial implementation except single line of openMP loop.
double comp(CHistogram* h1, CHistogram* h2){
double Sa = 0;
double Sb = 0;
double Saa = 0;
double Sbb = 0;
double Sab = 0;
double a, b;
int N = h1->length;
#pragma omp parallel for reduction(+:Sa,Sb,Saa,Sbb,Sab) private(a ,b)
for (int i = 0; i<N;i++){
a =h1->data[i];
b =h2->data[i];
Sa+=a;
Sb+=b;
Saa+=a*a;
Sbb+=b*b;
Sab+=a*b;
}
double sUp = Sab - Sa*Sb / N;
double sDown = (Saa-Sa*Sa / N)*(Sbb-Sb*Sb / N);
return sUp / sqrt(sDown);
}
Are there more ways to speed up this function with openMP ?
Thanks!
PS: I know that fastest way would be just to compare different pairs of histograms across multiple threads, but this is not applicable to my situation since only 2 histograms are available at a time.
Tested on quad core machine
I have a little bit of uncertainty, on a longer run openmp seems to perform better than a serial. But if I compare it just for a single histogram and measure time in useconds, then serial is faster in about 20 times.
I guess openmp puts some optimization once it see outside for loop. But in a real solution I will have some code in between histogram comparisons, and I not sure if it will behave the same way.
OpenMp takes some time to set up the parallel region. This overhead means you need to be careful that the overhead isn't greater than the performance that is gained by setting up a parallel region. In your case this means that only when N reaches a certain number will openMP speed up the calculation.
You should think about ways to reduce the total number of openMP calls, for instance is it possible to set up a parallel region outside this function so that you compare different histograms in parallel?