How to multiply a sparse matrix and a dense vector? - c++

I am trying the following:
Eigen::SparseMatrix<double> bijection(2 * face_count, 2 * vert_count);
/* initialization */
Eigen::VectorXd toggles(2 * vert_count);
toggles.setOnes();
Eigen::SparseMatrix<double> deformed;
deformed = bijection * toggles;
Eigen is returning an error claiming:
error: static assertion failed: THE_EVAL_EVALTO_FUNCTION_SHOULD_NEVER_BE_CALLED_FOR_DENSE_OBJECTS
586 | EIGEN_STATIC_ASSERT((internal::is_same<Dest,void>::value),THE_EVAL_EVALTO_FUNCTION_SHOULD_NEVER_BE_CALLED_FOR_DENSE_OBJECTS);
According to the eigen documentaion
Sparse matrix and vector products are allowed. What am I doing wrong?

The problem is you have the wrong output type for the product.
The Eigen documentation states that the following type of multiplication is defined:
dv2 = sm1 * dv1;
Sparse matrix times dense vector equals dense vector.
If you actually do need a sparse representation, I think there is no better way of getting one than performing the multiplication as above and then converting the product to a sparse matrix with the sparseView member function. e.g.
Eigen::SparseMatrix<double> bijection(2 * face_count, 2 * vert_count);
/* initialization */
Eigen::VectorXd toggles(2 * vert_count);
toggles.setOnes();
Eigen::VectorXd deformedDense = bijection * toggles;
Eigen::SparseMatrix<double> deformedSparse = deformedDense.sparseView();

This can be faster than outputting to a dense vector if it is very sparse. Otherwise, 99/100 times the conventional product is faster.
void sparsem_densev_sparsev(const SparseMatrix<double>& A, const VectorX<double>& x, SparseVector<double>& Ax)
{
Ax.resize(x.size());
for (int j = 0; j < A.outerSize(); ++j)
{
if (A.outerIndexPtr()[j + 1] - A.outerIndexPtr()[j] > 0)
{
Ax.insertBack(j) = 0;
}
}
for (int j_idx = 0; j_idx < Ax.nonZeros(); j_idx++)
{
int j = Ax.innerIndexPtr()[j_idx];
for (int k = A.outerIndexPtr()[j]; k < A.outerIndexPtr()[j + 1]; ++k)
{
int i = A.innerIndexPtr()[k];
Ax.valuePtr()[j_idx] += A.valuePtr()[k] * x.coeff(i);
}
}
}
For a (probably not optimal) self-adjoint version (lower triangle), change the j_idx loop to:
for (int j_idx = 0; j_idx < Ax.nonZeros(); j_idx++)
{
int j = Ax.innerIndexPtr()[j_idx];
int i_idx = j_idx;//i>= j, trick to improve binary search
for (int k = A.outerIndexPtr()[j]; k < A.outerIndexPtr()[j + 1]; ++k)
{
int i = A.innerIndexPtr()[k];
Ax.valuePtr()[j_idx] += A.valuePtr()[k] * x.coeff(i);
if (i != j)
{
i_idx = std::distance(Ax.innerIndexPtr(), std::lower_bound(Ax.innerIndexPtr() + i_idx, Ax.innerIndexPtr() + Ax.nonZeros(), i));
Ax.valuePtr()[i_idx] += A.valuePtr()[k] * x.coeff(j);
}
}
}

Related

Error using Eigen: Perform element-wise multiplication between a vector and matrix

I am trying to perform an element-wise multiplication of a row vector with matrix. In MATLAB this would be simply done by the "dot" operator or:
deriv = 1i * k .* fk;
where k is row vector and fk is a matrix.
Now in C++ I have this code:
static const int nx = 10;
static const int ny = 10;
static const int nyk = ny/2 + 1;
static const int nxk = nx/2 + 1;
static const int ncomp = 2;
Matrix <double, 1, nx> eK;
eK.setZero();
for(int i = 0; i < nx; i++){
eK[i] = //some expression
}
fftw_complex *UOut;
UOut= (fftw_complex*) fftw_malloc((((nx)*(ny+1))*nyk)* sizeof(fftw_complex));
for (int i = 0; i < nx; i++){
for (int j = 0; j < ny+1; j++){
for (int k = 0; k < ncomp; k++){
UOut[i*(ny+1)+j][k] = //FFT of some expression
}
}
}
Eigen::Map<Eigen::MatrixXcd, Eigen::Unaligned> U(reinterpret_cast<std::complex<double>*>(UOut),(ny+1),nx);
Now, I am trying to take the product of eK which is a row vector of 1 x 10 and the matrix U of a 11 x 10. I tried few things, none of which seem to work really:
U = 1i * eKX.array() * euhX.array() ; //ERROR
static assertion failed: YOU_MIXED_MATRICES_OF_DIFFERENT_SIZES
( \
| ~~~
176 | (int(Eigen::internal::size_of_xpr_at_compile_time<TYPE0>::ret)==0 && int(Eigen::internal::size_of_xpr_at_compile_time<TYPE1>::ret)==0) \
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
177 | || (\
| ^~~~~
178 | (int(TYPE0::RowsAtCompileTime)==Eigen::Dynamic \
Eigen doesn't do broadcasting the same way Matlab or Numpy do unless you explicitely ask for it, for example with matrix.array().rowwise() * vector.array()
The IMHO clearer form is to interpret the vector as a diagonal matrix.
Eigen::VectorXd eK = ...;
Eigen::Map<Eigen::MatrixXcd, Eigen::Unaligned> U = ...;
Eigen::MatrixXcd result = U * (eK * 1i).asDiagonal();

Constructing distance matrix in parallel in C++11 using OpenMP

I would like to construct a distance matrix in parallel in C++11 using OpenMP. I read various documentations, introductions, examples etc. Yet, I still have a few questions. To facilitate answering this post, I state my questions as assumptions numbered 1 through 7. This way, you can quickly browse through them and point out which ones are correct and which ones are not.
Let us begin with a simple serially executed function computing a dense Armadillo matrix:
// [[Rcpp::export]]
arma::mat compute_dist_mat(arma::mat &coordinates, unsigned int n_points) {
arma::mat dist_mat(n_points, n_points, arma::fill::zeros);
double dist {};
for(unsigned int i {0}; i < n_points; i++) {
for(unsigned int j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
dist_mat.at(i, j) = dist;
dist_mat.at(j, i) = dist;
}
}
return dist_mat;
}
As a side note: this function is supposed to be called from R through the Rcpp interface - indicated by the // [[Rcpp::export]]. And accordingly the top of the file includes
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(cpp11)]]
#include <omp.h>
// [[Rcpp::plugins(openmp)]]
using namespace Rcpp;
using namespace arma;
However, the function should work also fine without the R interface.
In an attempt to parallelize the code, I replace the loops with
unsigned int i {};
unsigned int j {};
# pragma omp parallel for private(dist, i, j) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
for(j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
dist_mat.at(i, j) = dist;
dist_mat.at(j, i) = dist;
}
}
and add n_threads as an argument to the compute_dist_mat function.
This distributes the iterations of the outer loop across threads, with the iterations of the inner loop executed by the respective thread handling the outer loop.
The two loop levels cannot be combined because the inner loop depends on the outer one.
dist, i, and j are all to be initialized above the # pragma line and then declared private rather than initializing them in the loops.
The # pragma line does not have any effect when n_treads = 1, inducing a serial execution.
Extending the dense matrix application, the following code block illustrates the serial sparse matrix case with batch insertion. To motivate the use of sparse matrices here, I set distances below a certain threshold to zero.
// [[Rcpp::export]]
arma::sp_mat compute_dist_spmat(arma::mat &coordinates, unsigned int n_points, double dist_threshold) {
std::vector<double> dists;
std::vector<unsigned int> dist_i;
std::vector<unsigned int> dist_j;
double dist {};
for(unsigned long int i {0}; i < n_points; i++) {
for(unsigned long int j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
if(dist >= dist_threshold) {
dists.push_back(dist);
dist_i.push_back(i);
dist_j.push_back(j);
}
}
}
unsigned int mat_size = dist_i.size();
arma::umat index_mat(2, mat_size * 2);
arma::vec dists_vec(mat_size * 2);
unsigned int j {};
for(unsigned int i {0}; i < mat_size; i++) {
j = i * 2;
index_mat.at(0, j) = dist_i[i];
index_mat.at(1, j) = dist_j[i];
index_mat.at(0, j + 1) = dist_j[i];
index_mat.at(1, j + 1) = dist_i[i];
dists_vec.at(j) = dists[i];
dists_vec.at(j + 1) = dists[i];
}
arma::sp_mat dist_mat(index_mat, values_vec, n_points, n_points);
return dist_mat;
}
Because the function does ex ante not know how many distances are above the threshold, it first stores the non-zero values in standard vectors and then constructs the Armadillo objects from them.
I parallelize the function as follows:
// [[Rcpp::export]]
arma::sp_mat compute_dist_spmat(arma::mat &coordinates, unsigned int n_points, double dist_threshold, unsigned short int n_threads) {
std::vector<std::vector<double>> dists(n_points);
std::vector<std::vector<unsigned int>> dist_j(n_points);
double dist {};
unsigned int i {};
unsigned int j {};
# pragma omp parallel for private(dist, i, j) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
for(j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
if(dist >= dist_threshold) {
dists[i].push_back(dist);
dist_j[i].push_back(j);
}
}
}
unsigned int vec_intervals[n_points + 1];
vec_intervals[0] = 0;
for (i = 0; i < n_points; i++) {
vec_intervals[i + 1] = vec_intervals[i] + dist_j[i].size();
}
unsigned int mat_size {vec_intervals[n_points]};
arma::umat index_mat(2, mat_size * 2);
arma::vec dists_vec(mat_size * 2);
unsigned int vec_begins_i {};
unsigned int vec_length_i {};
unsigned int k {};
# pragma omp parallel for private(i, j, k, vec_begins_i, vec_length_i) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
vec_begins_i = vec_intervals[i];
vec_length_i = vec_intervals[i + 1] - vec_begins_i;
for(j = 0, j < vec_length_i, j++) {
k = (vec_begins_i + j) * 2;
index_mat.at(0, k) = i;
index_mat.at(1, k) = dist_j[i][j];
index_mat.at(0, k + 1) = dist_j[i][j];
index_mat.at(1, k + 1) = i;
dists_vec.at(k) = dists[i][j];
dists_vec.at(k + 1) = dists[i][j];
}
}
arma::sp_mat dist_mat(index_mat, dists_vec, n_points, n_points);
return dist_mat;
}
Using dynamic vectors in the loop is thread-safe.
dist, i, j, k, vec_begins_i, and vec_length_i are all to be initialized above the # pragma line and then declared private rather than initializing them in the loops.
Nothing has to be marked as a section.
Are any of the seven statements incorrect?
The following does not directly answer your question (it's just some dev code I copied from a personal GitHub repo), but it makes several points clear that may be of use in your application:
OpenMP automatically determines private members so long as you are not doing any dynamic memory allocation within the parallel loop
For sparse matrix distance calculations, it becomes important to move beyond a simple calculation of distance at each non-zero index and instead consider the structure of sparsity that is expected, and optimize for that. In the example below, I assume both matrices are very sparse and their intersection is less than their union. Thus, I "precondition" each distance calculation with squared column sums (for calculating Euclidean distance), and then adjust the calculation for the intersection only. This avoids complicated iterator structures and is very fast.
Using as few temporaries as possible is much to your benefit, and sparse matrix iterators do as good of a job of this as any alternative code anyone may ever write.
Eigen provides better vectorization than Armadillo (across the board, I might add) which means you want Eigen instead of Armadillo if those last 20% of performance gains are important to you.
This function calculates the Euclidean distance between all unique pairs of columns in an Eigen::SparseMatrix<double> object:
// sparse column-wise Euclidean distance between all columns
Eigen::MatrixXd distance(Eigen::SparseMatrix<double>& A) {
Eigen::MatrixXd dists(A.cols(), A.cols());
Eigen::VectorXd sq_colsums(A.cols());
for (int col = 0; col < A.cols(); ++col)
for (Eigen::SparseMatrix<double>::InnerIterator it(A, col); it; ++it)
sq_colsums(col) += it.value() * it.value();
#pragma omp parallel for
for (unsigned int i = 0; i < (A.cols() - 1); ++i) {
for (unsigned int j = (i + 1); j < A.cols(); ++j) {
double dist = sq_colsums(i) + sq_colsums(j);
Eigen::SparseMatrix<double>::InnerIterator it1(A, i), it2(A, j);
while (it1 && it2) {
if (it1.row() < it2.row()) ++it1;
else if (it1.row() > it2.row()) ++it2;
else {
dist -= it1.value() * it1.value();
dist -= it2.value() * it2.value();
dist += std::pow(it1.value() - it2.value(), 2);
++it1; ++it2;
}
}
dists(i, j) = std::sqrt(dist);
dists(j, i) = dists(i, j);
}
}
dists.diagonal().array() = 1;
return dists;
}
As Dirk and others have said, there are packages out there (i.e. ParallelDist) that seem to do everything you're after (for dense matrices). Look at wordspace for fast cosine distance calculations. See here for some comparisons. Cosine distance is easy to efficiently calculate in R without use of Rcpp using crossprod operations (see qlcMatrix::cosSparse source code for algorithmic inspiration).

Implement 2d array coordinates in 1d array in C++

The code inside the for loop is for the x and y (j and i) "coordinates" from a 2d array. How could I implement this neighbor/index finding in a 1d array?
I think I could implement it for the first four equations. But i'm confused as how to implement up-left etc.
for(int i=0; i<cols*rows; i++){
//Counts current index's 8 neigbour int values
int count=0;
int x = i%cols;
int y = i/rows;
//rows y i
//cols x j
count+= [grid][i][(j-1+cols)%cols] //left
+[grid][i][(j+1+cols)%cols] //right
+[grid][(i-1+rows)%rows][j] //up
+[grid][(i+1+rows)%rows][j] //down
+[grid][(i-1+rows)%rows][ (j-1+cols)%cols] //up-left
+[grid][(i+1+rows)%rows][ (j+1+cols)%cols] //down-right
+[grid][(i+1+rows)%rows][ (j-1+cols)%cols] //down-left
+[grid][(i-1+rows)%rows][ (j+1+cols)%cols] ;//up-right
}
Starting with a 1-D vector:
int rows = 10;
int cols = 10;
vector<int> grid(rows * cols);
You can manage this in different ways, example
for(int y = 0; y < rows; y++)
{
for(int x = 0; x < cols; x++)
{
int point = grid[y * rows + x];
}
}
Where you can access any point at any given x and y in a 2-dimensional plane.
Top-left is:
x = 0;
y = 0;
bottom-right is
x = cols - 1;
y = rows - 1;
And so on.
Use a function like this
inline int idx(const int i, const int j, const int rows) const
{
return i * rows + j;
}
to convert the 2d indices to 1d indices.
This way you don't have to change your algorithm.
Usage would be grid[idx(i, (j-1+cols)%cols, rows)].
The basic formula for computing the 1d coordinate from the 2d index pattern is usually one of the following:
row_index * row_length + column_index
column_index * column_length + row_index
Which one applies to your case depends on whether you would like to have a row-based or column-based memory layout for your 2d array. It makes sense to factor out the computation of this index into a separate function, as suggested in the other answer.
Then you just need to fill in the values somehow.
You could do it like this, for example:
// iterate big picture
// TODO: make sure to handle the edge cases appropriately
for (int i_row = 1; i_row < n_rows - 1; i_row++) {
for (int i_col = 1; i_col < n_cols -1; i_col++) {
// compute values
dst[i_row*n_cols+i_col] = 0;
for (int r = i_row-1; r < i_row+2; r++) {
for (int c = i_col-1; c < i_col+2; c++) {
dst[i_row*n_cols+i_col] += src[r*n_cols + c];
}
}
}
}
Assuming src and dst are distinct 1d vectors of size n_rows*n_cols...

C++ Pattern Matching with FFT cross-correlation (Images)

everyone I am trying to implement patter matching with FFT but I am not sure what the result should be (I think I am missing something even though a read a lot of stuff about the problem and tried a lot of different implementations this one is the best so far). Here is my FFT correlation function.
void fft2d(fftw_complex**& a, int rows, int cols, bool forward = true)
{
fftw_plan p;
for (int i = 0; i < rows; ++i)
{
p = fftw_plan_dft_1d(cols, a[i], a[i], forward ? FFTW_FORWARD : FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute(p);
}
fftw_complex* t = (fftw_complex*)fftw_malloc(rows * sizeof(fftw_complex));
for (int j = 0; j < cols; ++j)
{
for (int i = 0; i < rows; ++i)
{
t[i][0] = a[i][j][0];
t[i][1] = a[i][j][1];
}
p = fftw_plan_dft_1d(rows, t, t, forward ? FFTW_FORWARD : FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute(p);
for (int i = 0; i < rows; ++i)
{
a[i][j][0] = t[i][0];
a[i][j][1] = t[i][1];
}
}
fftw_free(t);
}
int findCorrelation(int argc, char* argv[])
{
BMP bigImage;
BMP keyImage;
BMP result;
RGBApixel blackPixel = { 0, 0, 0, 1 };
const bool swapQuadrants = (argc == 4);
if (argc < 3 || argc > 4) {
cout << "correlation img1.bmp img2.bmp" << endl;
return 1;
}
if (!keyImage.ReadFromFile(argv[1])) {
return 1;
}
if (!bigImage.ReadFromFile(argv[2])) {
return 1;
}
//Preparations
const int maxWidth = std::max(bigImage.TellWidth(), keyImage.TellWidth());
const int maxHeight = std::max(bigImage.TellHeight(), keyImage.TellHeight());
const int rowsCount = maxHeight;
const int colsCount = maxWidth;
BMP bigTemp = bigImage;
BMP keyTemp = keyImage;
keyImage.SetSize(maxWidth, maxHeight);
bigImage.SetSize(maxWidth, maxHeight);
for (int i = 0; i < rowsCount; ++i)
for (int j = 0; j < colsCount; ++j) {
RGBApixel p1;
if (i < bigTemp.TellHeight() && j < bigTemp.TellWidth()) {
p1 = bigTemp.GetPixel(j, i);
} else {
p1 = blackPixel;
}
bigImage.SetPixel(j, i, p1);
RGBApixel p2;
if (i < keyTemp.TellHeight() && j < keyTemp.TellWidth()) {
p2 = keyTemp.GetPixel(j, i);
} else {
p2 = blackPixel;
}
keyImage.SetPixel(j, i, p2);
}
//Here is where the transforms begin
fftw_complex **a = (fftw_complex**)fftw_malloc(rowsCount * sizeof(fftw_complex*));
fftw_complex **b = (fftw_complex**)fftw_malloc(rowsCount * sizeof(fftw_complex*));
fftw_complex **c = (fftw_complex**)fftw_malloc(rowsCount * sizeof(fftw_complex*));
for (int i = 0; i < rowsCount; ++i) {
a[i] = (fftw_complex*)fftw_malloc(colsCount * sizeof(fftw_complex));
b[i] = (fftw_complex*)fftw_malloc(colsCount * sizeof(fftw_complex));
c[i] = (fftw_complex*)fftw_malloc(colsCount * sizeof(fftw_complex));
for (int j = 0; j < colsCount; ++j) {
RGBApixel p1;
p1 = bigImage.GetPixel(j, i);
a[i][j][0] = (0.299*p1.Red + 0.587*p1.Green + 0.114*p1.Blue);
a[i][j][1] = 0.0;
RGBApixel p2;
p2 = keyImage.GetPixel(j, i);
b[i][j][0] = (0.299*p2.Red + 0.587*p2.Green + 0.114*p2.Blue);
b[i][j][1] = 0.0;
}
}
fft2d(a, rowsCount, colsCount);
fft2d(b, rowsCount, colsCount);
result.SetSize(maxWidth, maxHeight);
for (int i = 0; i < rowsCount; ++i)
for (int j = 0; j < colsCount; ++j) {
fftw_complex& y = a[i][j];
fftw_complex& x = b[i][j];
double u = x[0], v = x[1];
double m = y[0], n = y[1];
c[i][j][0] = u*m + n*v;
c[i][j][1] = v*m - u*n;
int fx = j;
if (fx>(colsCount / 2)) fx -= colsCount;
int fy = i;
if (fy>(rowsCount / 2)) fy -= rowsCount;
float r2 = (fx*fx + fy*fy);
const double cuttoffCoef = (maxWidth * maxHeight) / 37992.;
if (r2<128 * 128 * cuttoffCoef)
c[i][j][0] = c[i][j][1] = 0;
}
fft2d(c, rowsCount, colsCount, false);
const int halfCols = colsCount / 2;
const int halfRows = rowsCount / 2;
if (swapQuadrants) {
for (int i = 0; i < halfRows; ++i)
for (int j = 0; j < halfCols; ++j) {
std::swap(c[i][j][0], c[i + halfRows][j + halfCols][0]);
std::swap(c[i][j][1], c[i + halfRows][j + halfCols][1]);
}
for (int i = halfRows; i < rowsCount; ++i)
for (int j = 0; j < halfCols; ++j) {
std::swap(c[i][j][0], c[i - halfRows][j + halfCols][0]);
std::swap(c[i][j][1], c[i - halfRows][j + halfCols][1]);
}
}
for (int i = 0; i < rowsCount; ++i)
for (int j = 0; j < colsCount; ++j) {
const double& g = c[i][j][0];
RGBApixel pixel;
pixel.Alpha = 0;
int gInt = 255 - static_cast<int>(std::floor(g + 0.5));
pixel.Red = gInt;
pixel.Green = gInt;
pixel.Blue = gInt;
result.SetPixel(j, i, pixel);
}
BMP res;
res.SetSize(maxWidth, maxHeight);
result.WriteToFile("result.bmp");
return 0;
}
Sample output
This question would probably be more appropriately posted on another site like cross validated (metaoptimize.com used to also be a good one, but it appears to be gone)
That said:
There's two similar operations you can perform with FFT: convolution and correlation. Convolution is used for determining how two signals interact with each-other, whereas correlation can be used to express how similar two signals are to each-other. Make sure you're doing the right operation as they're both commonly implemented throught a DFT.
For this type of application of DFTs you usually wouldn't extract any useful information in the fourier spectrum unless you were looking for frequencies common to both data sources or whatever (eg, if you were comparing two bridges to see if their supports are spaced similarly).
Your 3rd image looks a lot like the power domain; normally I see the correlation output entirely grey except where overlap occurred. Your code definitely appears to be computing the inverse DFT, so unless I'm missing something the only other explanation I've come up with for the fuzzy look could be some of the "fudge factor" code in there like:
if (r2<128 * 128 * cuttoffCoef)
c[i][j][0] = c[i][j][1] = 0;
As for what you should expect: wherever there are common elements between the two images you'll see a peak. The larger the peak, the more similar the two images are near that region.
Some comments and/or recommended changes:
1) Convolution & correlation are not scale invariant operations. In other words, the size of your pattern image can make a significant difference in your output.
2) Normalize your images before correlation.
When you get the image data ready for the forward DFT pass:
a[i][j][0] = (0.299*p1.Red + 0.587*p1.Green + 0.114*p1.Blue);
a[i][j][1] = 0.0;
/* ... */
How you grayscale the image is your business (though I would've picked something like sqrt( r*r + b*b + g*g )). However, I don't see you doing anything to normalize the image.
The word "normalize" can take on a few different meanings in this context. Two common types:
normalize the range of values between 0.0 and 1.0
normalize the "whiteness" of the images
3) Run your pattern image through an edge enhancement filter. I've personally made use of canny, sobel, and I think I messed with a few others. As I recall, canny was "quick'n dirty", sobel was more expensive, but I got comparable results when it came time to do correlation. See chapter 24 of the "dsp guide" book that's freely available online. The whole book is worth your time, but if you're low on time then at a minimum chapter 24 will help a lot.
4) Re-scale the output image between [0, 255]; if you want to implement thresholds, do it after this step because the thresholding step is lossy.
My memory on this one is hazy, but as I recall (edited for clarity):
You can scale the final image pixels (before rescaling) between [-1.0, 1.0] by dividing off the largest power spectrum value from the entire power spectrum
The largest power spectrum value is, conveniently enough, the center-most value in the power spectrum (corresponding to the lowest frequency)
If you divide it off the power spectrum, you'll end up doing twice the work; since FFTs are linear, you can delay the division until after the inverse DFT pass to when you're re-scaling the pixels between [0..255].
If after rescaling most of your values end up so black you can't see them, you can use a solution to the ODE y' = y(1 - y) (one example is the sigmoid f(x) = 1 / (1 + exp(-c*x) ), for some scaling factor c that gives better gradations). This has more to do with improving your ability to interpret the results visually than anything you might use to programmatically find peaks.
edit I said [0, 255] above. I suggest you rescale to [128, 255] or some other lower bound that is gray rather than black.

Optimize log entropy calculation in sparse matrix

I have a 3007 x 1644 dimensional matrix of terms and documents. I am trying to assign weights to frequency of terms in each document so I'm using this log entropy formula http://en.wikipedia.org/wiki/Latent_semantic_indexing#Term_Document_Matrix (See entropy formula in the last row).
I'm successfully doing this but my code is running for >7 minutes.
Here's the code:
int N = mat.cols();
for(int i=1;i<=mat.rows();i++){
double gfi = sum(mat(i,colon()))(1,1); //sum of occurrence of terms
double g =0;
if(gfi != 0){// to avoid divide by zero error
for(int j = 1;j<=N;j++){
double tfij = mat(i,j);
double pij = gfi==0?0.0:tfij/gfi;
pij = pij + 1; //avoid log0
double G = (pij * log(pij))/log(N);
g = g + G;
}
}
double gi = 1 - g;
for(int j=1;j<=N;j++){
double tfij = mat(i,j) + 1;//avoid log0
double aij = gi * log(tfij);
mat(i,j) = aij;
}
}
Anyone have ideas how I can optimize this to make it faster? Oh and mat is a RealSparseMatrix from amlpp matrix library.
UPDATE
Code runs on Linux mint with 4gb RAM and AMD Athlon II dual core
Running time before change: > 7mins
After #Kereks answer: 4.1sec
Here's a very naive rewrite that removes some redundancies:
int const N = mat.cols();
double const logN = log(N);
for (int i = 1; i <= mat.rows(); ++i)
{
double const gfi = sum(mat(i, colon()))(1, 1); // sum of occurrence of terms
double g = 0;
if (gfi != 0)
{
for (int j = 1; j <= N; ++j)
{
double const pij = mat(i, j) / gfi + 1;
g += pij * log(pij);
}
g /= logN;
}
for (int j = 1; j <= N; ++j)
{
mat(i,j) = (1 - g) * log(mat(i, j) + 1);
}
}
Also make sure that the matrix data structure is sane (e.g. a flat array accessed in strides; not a bunch of dynamically allocated rows).
Also, I think the first + 1 is a bit silly. You know that x -> x * log(x) is continuous at zero with limit zero, so you should write:
double const pij = mat(i, j) / gfi;
if (pij != 0) { g += pij + log(pij); }
In fact, you might even write the first inner for loop like this, avoiding a division when it isn't needed:
for (int j = 1; j <= N; ++j)
{
if (double pij = mat(i, j))
{
pij /= gfi;
g += pij * log(pij);
}
}