Strange behaviours in porting code from Eigen2 to Eigen3 - c++

I'm considering to use this library to perform spectral clustering in my research project.
But, to do so, I need to port it from Eigen2 to Eigen3 (which is what I use in my code).
There's a portion of code that is causing me some troubles.
This is for Eigen2:
double Evrot::evqual(Eigen::MatrixXd& X) {
// take the square of all entries and find max of each row
Eigen::MatrixXd X2 = X.cwise().pow(2);
Eigen::VectorXd max_values = X2.rowwise().maxCoeff();
// compute cost
for (int i=0; i<mNumData; i++ ) {
X2.row(i) = X2.row(i) / max_values[i];
}
double J = 1.0 - (X2.sum()/mNumData -1.0)/mNumDims;
if( DEBUG )
std::cout << "Computed quality = "<< J << std::endl;
return J;
}
as explained here, Eigen3 replaces .cwise() with the slightly different .array() functionality.
So, I wrote:
double Evrot::evqual(Eigen::MatrixXd& X) {
// take the square of all entries and find max of each row
Eigen::MatrixXd X2 = X.array().pow(2);
Eigen::VectorXd max_values = X2.rowwise().maxCoeff();
// compute cost
for (int i=0; i<mNumData; i++ ) {
X2.row(i) = X2.row(i) / max_values[i];
}
double J = 1.0 - (X2.sum()/mNumData -1.0)/mNumDims;
if( DEBUG )
std::cout << "Computed quality = "<< J << std::endl;
return J;
}
and I got no compiler errors.
But, if I give to the two programs the same input (and check that they're actually getting consistent inputs), in the first case I get numbers and in the second only NANs.
My idea is that this is caused by the fact that max_values is badly computed and then using this vector in a division causes all the NANs. But I have no clue on how to fix that.
Can, please, someone explain me how to solve this problem?
Thanks!

Have you checked when the values start to diverge ? Are you sure there is no empty rows or that X^2 do not underflow. Anyways, you should had a guard before dividing by max_values[i]. Moreover, to avoid underflow in squaring you could rewrite it like that:
VectorXd max_values = X.array().abs().rowwise().maxCoeff();
double sum = 0;
for (int i=0; i<mNumData; i++ ) {
if(max_values[i]>0)
sum += (X.row(i)/max_values[i]).squaredNorm();
}
double J = 1.0 - (sum/mNumData -1.0)/mNumDims;
This will work even if X.abs().maxCoeff()==1e-170 whereas your code will underflow and produces NaN. Of course, if you are in a such a case, maybe you should check your inputs first as you are already on dangerous side regarding numerical issues.

Related

How to avoid using nested loops in cpp?

I am working on digital sampling for sensor. I have following code to compute the highest amplitude and the corresponding time.
struct LidarPoints{
float timeStamp;
float Power;
}
std::vector<LidarPoints> measurement; // To store Lidar points of current measurement
Currently power and energy are the same (because of delta function)and vector is arranged in ascending order of time. I would like to change this to step function. Pulse duration is a constant 10ns.
uint32_t pulseDuration = 5;
The problem is to find any overlap between the samples and if any to add up the amplitudes.
I currently use following code:
for(auto i= 0; i< measurement.size(); i++){
for(auto j=i+1; i< measurement.size(); j++){
if(measurement[j].timeStamp - measurement[i].timeStamp) < pulseDuration){
measurement[i].Power += measurement[j].Power;
measurement[i].timeStamp = (measurement[i].timeStamp + measurement[j].timeStamp)/2.0f;
}
}
}
Is it possible to code this without two for loops since I cannot afford the amount of time being taken by nested loops.
You can take advantage that the vector is sorted by timeStamp and find the next pulse with binary search, thus reducing the complexity from O(n^2) to O(n log n):
#include <vector>
#include <algorithm>
#include <numeric>
#include <iterator
auto it = measurement.begin();
auto end = measurement.end();
while (it != end)
{
// next timestamp as in your code
auto timeStampLower = it->timeStamp + pulseDuration;
// next value in measurement with a timestamp >= timeStampLower
auto lower_bound = std::lower_bound(it, end, timeStampLower, [](float a, const LidarPoints& b) {
return a < b.timeStamp;
});
// sum over [timeStamp, timeStampLower)
float sum = std::accumulate(it, lower_bound, 0.0f, [] (float a, const LidarPoints& b) {
return a + b.timeStamp;
});
auto num = std::distance(it, lower_bound);
// num should be >= since the vector is sorted and pulseDuration is positive
// you should uncomment next line to catch unexpected error
// Expects(num >= 1); // needs GSL library
// assert(num >= 1); // or standard C if you don't want to use GSL
// average over [timeStamp, timeStampLower)
it->timeStamp = sum / num;
// advance it
it = lower_bound;
}
https://en.cppreference.com/w/cpp/algorithm/lower_bound
https://en.cppreference.com/w/cpp/algorithm/accumulate
Also please note that my algorithm will produce different result than yours because you don't really compute the average over multiple values with measurement[i].timeStamp = (measurement[i].timeStamp + measurement[j].timeStamp)/2.0f
Also to consider: (I am by far not an expert in the field, so I am just throwing the ideea, it's up to you to know if its valid or not): with your code you just "squash" together close measurement, instead of having a vector of measurement with periodic time. It might be what you intend or not.
Disclaimer: not tested beyond "it compiles". Please don't just copy-paste it. It could be incomplet and incorrekt. But I hope I gave you a direction to investigate.
Due to jitter and other timing complexities, instead of simple summation, you need to switch to [Numerical Integration][۱] (eg. Trapezoidal Integration...).
If your values are in ascending order of timeStamp adding else break to the if statement shouldn't effect the result but should be a lot quicker.
for(auto i= 0; i< measurement.size(); i++){
for(auto j=i+1; i< measurement.size(); j++){
if(measurement[j].timeStamp - measurement[i].timeStamp) < pulseDuration){
measurement[i].Power += measurement[j].Power;
measurement[i].timeStamp = (measurement[i].timeStamp + measurement[j].timeStamp)/2.0f;
} else {
break;
}
}
}

Best way to indexing a matrix in opencv

Let say, A and B are matrices of the same size.
In Matlab, I could use simple indexing as below.
idx = A>0;
B(idx) = 0
How can I do this in OpenCV? Should I just use
for (i=0; ... rows)
for(j=0; ... cols)
if (A.at<double>(i,j)>0) B.at<double>(i,j) = 0;
something like this? Is there a better (faster and more efficient) way?
Moreover, in OpenCV, when I try
Mat idx = A>0;
the variable idx seems to be a CV_8U matrix (not boolean but integer).
You can easily convert this MATLAB code:
idx = A > 0;
B(idx) = 0;
// same as
B(A>0) = 0;
to OpenCV as:
Mat1d A(...)
Mat1d B(...)
Mat1b idx = A > 0;
B.setTo(0, idx) = 0;
// or
B.setTo(0, A > 0);
Regarding performance, in C++ it's usually faster (it depends on the enabled optimizations) to work on raw pointers (but is less readable):
for (int r = 0; r < B.rows; ++r)
{
double* pA = A.ptr<double>(r);
double* pB = B.ptr<double>(r);
for (int c = 0; c < B.cols; ++c)
{
if (pA[c] > 0.0) pB[c] = 0.0;
}
}
Also note that in OpenCV there isn't any boolean matrix, but it's a CV_8UC1 matrix (aka a single channel matrix of unsigned char), where 0 means false, and any value >0 is true (typically 255).
Evaluation
Note that this may vary according to optimization enabled with OpenCV. You can test the code below on your PC to get accurate results.
Time in ms:
my results my results #AdrienDescamps
(OpenCV 3.0 No IPP) (OpenCV 2.4.9)
Matlab : 13.473
C++ Mask: 640.824 5.81815 ~5
C++ Loop: 5.24414 4.95127 ~4
Note: I'm not entirely sure about the performance drop with OpenCV 3.0, so I just remark: test the code below on your PC to get accurate results.
As #AdrienDescamps stated in comments:
It seems that the performance drop with OpenCV 3.0 is related to the OpenCL option, that is now enabled in the comparison operator.
C++ Code
#include <opencv2/opencv.hpp>
#include <iostream>
using namespace std;
using namespace cv;
int main()
{
// Random initialize A with values in [-100, 100]
Mat1d A(1000, 1000);
randu(A, Scalar(-100), Scalar(100));
// B initialized with some constant (5) value
Mat1d B(A.rows, A.cols, 5.0);
// Operation: B(A>0) = 0;
{
// Using mask
double tic = double(getTickCount());
B.setTo(0, A > 0);
double toc = (double(getTickCount()) - tic) * 1000 / getTickFrequency();
cout << "Mask: " << toc << endl;
}
{
// Using for loop
double tic = double(getTickCount());
for (int r = 0; r < B.rows; ++r)
{
double* pA = A.ptr<double>(r);
double* pB = B.ptr<double>(r);
for (int c = 0; c < B.cols; ++c)
{
if (pA[c] > 0.0) pB[c] = 0.0;
}
}
double toc = (double(getTickCount()) - tic) * 1000 / getTickFrequency();
cout << "Loop: " << toc << endl;
}
getchar();
return 0;
}
Matlab Code
% Random initialize A with values in [-100, 100]
A = (rand(1000) * 200) - 100;
% B initialized with some constant (5) value
B = ones(1000) * 5;
tic
B(A>0) = 0;
toc
UPDATE
OpenCV 3.0 uses IPP optimization in the function setTo. If you have that enabled (you can check with cv::getBuildInformation()), you'll have a faster computation.
The answer of Miki is very good, but i just want to add some clarification about the performance problem to avoid any confusion.
It is true that the best way to implement an image filter (or any algorithm) with OpenCV is to use the raw pointers, as shown in the second C++ example of Miki (C++ Loop).
Using the at function is also correct, but significantly slower.
However, most of the time, you don't need to worry about that, and you can simply use the high level functions of OpenCV (first example of Miki , C++ Mask). They are well optimized, and will usually be almost as fast as a low level loop on pointers, or even faster.
Of course, there are exceptions (we just found one), and you should always test for your specific problem.
Now, regarding this specific problem :
The example here where the high level function was much slower (100x slower) than the low level loop is NOT a normal case, as it is demonstrated by the timings with other version/configuration of OpenCV, that are much lower.
The problem seems to be that when OpenCV3.0 is compiled with OpenCL, there is a huge overhead the first time a function that uses OpenCL is called. The simplest solution is to disable OpenCL at compile time, if you use OpenCV3.0 (see also here for other possible solutions if you are interested).

generating correct spectrogram using fftw and window function

For a project I need to be able to generate a spectrogram from a .WAV file. I've read the following should be done:
Get N (transform size) samples
Apply a window function
Do a Fast Fourier Transform using the samples
Normalise the output
Generate spectrogram
On the image below you see two spectrograms of a 10000 Hz sine wave both using the hanning window function. On the left you see a spectrogram generated by audacity and on the right my version. As you can see my version has a lot more lines/noise. Is this leakage in different bins? How would I get a clear image like the one audacity generates. Should I do some post-processing? I have not yet done any normalisation because do not fully understand how to do so.
update
I found this tutorial explaining how to generate a spectrogram in c++. I compiled the source to see what differences I could find.
My math is very rusty to be honest so I'm not sure what the normalisation does here:
for(i = 0; i < half; i++){
out[i][0] *= (2./transform_size);
out[i][6] *= (2./transform_size);
processed[i] = out[i][0]*out[i][0] + out[i][7]*out[i][8];
//sets values between 0 and 1?
processed[i] =10. * (log (processed[i] + 1e-6)/log(10)) /-60.;
}
after doing this I got this image (btw I've inverted the colors):
I then took a look at difference of the input samples provided by my sound library and the one of the tutorial. Mine were way higher so I manually normalised is by dividing it by the factor 32767.9. I then go this image which looks pretty ok I think. But dividing it by this number seems wrong. And I would like to see a different solution.
Here is the full relevant source code.
void Spectrogram::process(){
int i;
int transform_size = 1024;
int half = transform_size/2;
int step_size = transform_size/2;
double in[transform_size];
double processed[half];
fftw_complex *out;
fftw_plan p;
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * transform_size);
for(int x=0; x < wavFile->getSamples()/step_size; x++){
int j = 0;
for(i = step_size*x; i < (x * step_size) + transform_size - 1; i++, j++){
in[j] = wavFile->getSample(i)/32767.9;
}
//apply window function
for(i = 0; i < transform_size; i++){
in[i] *= windowHanning(i, transform_size);
// in[i] *= windowBlackmanHarris(i, transform_size);
}
p = fftw_plan_dft_r2c_1d(transform_size, in, out, FFTW_ESTIMATE);
fftw_execute(p); /* repeat as needed */
for(i = 0; i < half; i++){
out[i][0] *= (2./transform_size);
out[i][11] *= (2./transform_size);
processed[i] = out[i][0]*out[i][0] + out[i][12]*out[i][13];
processed[i] =10. * (log (processed[i] + 1e-6)/log(10)) /-60.;
}
for (i = 0; i < half; i++){
if(processed[i] > 0.99)
processed[i] = 1;
In->setPixel(x,(half-1)-i,processed[i]*255);
}
}
fftw_destroy_plan(p);
fftw_free(out);
}
This is not exactly an answer as to what is wrong but rather a step by step procedure to debug this.
What do you think this line does? processed[i] = out[i][0]*out[i][0] + out[i][12]*out[i][13] Likely that is incorrect: fftw_complex is typedef double fftw_complex[2], so you only have out[i][0] and out[i][1], where the first is the real and the second the imaginary part of the result for that bin. If the array is contiguous in memory (which it is), then out[i][12] is likely the same as out[i+6][0] and so forth. Some of these will go past the end of the array, adding random values.
Is your window function correct? Print out windowHanning(i, transform_size) for every i and compare with a reference version (for example numpy.hanning or the matlab equivalent). This is the most likely cause, what you see looks like a bad window function, kind of.
Print out processed, and compare with a reference version (given the same input, of course you'd have to print the input and reformat it to feed into pylab/matlab etc). However, the -60 and 1e-6 are fudge factors which you don't want, the same effect is better done in a different way. Calculate like this:
power_in_db[i] = 10 * log(out[i][0]*out[i][0] + out[i][1]*out[i][1])/log(10)
Print out the values of power_in_db[i] for the same i but for all x (a horizontal line). Are they approximately the same?
If everything so far is good, the remaining suspect is setting the pixel values. Be very explicit about clipping to range, scaling and rounding.
int pixel_value = (int)round( 255 * (power_in_db[i] - min_db) / (max_db - min_db) );
if (pixel_value < 0) { pixel_value = 0; }
if (pixel_value > 255) { pixel_value = 255; }
Here, again, print out the values in a horizontal line, and compare with the grayscale values in your pgm (by hand, using the colorpicker in photoshop or gimp or similar).
At this point, you will have validated everything from end to end, and likely found the bug.
The code you produced, was almost correct. So, you didn't left me much to correct:
void Spectrogram::process(){
int transform_size = 1024;
int half = transform_size/2;
int step_size = transform_size/2;
double in[transform_size];
double processed[half];
fftw_complex *out;
fftw_plan p;
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * transform_size);
for (int x=0; x < wavFile->getSamples()/step_size; x++) {
// Fill the transformation array with a sample frame and apply the window function.
// Normalization is performed later
// (One error was here: you didn't set the last value of the array in)
for (int j = 0, int i = x * step_size; i < x * step_size + transform_size; i++, j++)
in[j] = wavFile->getSample(i) * windowHanning(j, transform_size);
p = fftw_plan_dft_r2c_1d(transform_size, in, out, FFTW_ESTIMATE);
fftw_execute(p); /* repeat as needed */
for (int i=0; i < half; i++) {
// (Here were some flaws concerning the access of the complex values)
out[i][0] *= (2./transform_size); // real values
out[i][1] *= (2./transform_size); // complex values
processed[i] = out[i][0]*out[i][0] + out[i][1]*out[i][1]; // power spectrum
processed[i] = 10./log(10.) * log(processed[i] + 1e-6); // dB
// The resulting spectral values in 'processed' are in dB and related to a maximum
// value of about 96dB. Normalization to a value range between 0 and 1 can be done
// in several ways. I would suggest to set values below 0dB to 0dB and divide by 96dB:
// Transform all dB values to a range between 0 and 1:
if (processed[i] <= 0) {
processed[i] = 0;
} else {
processed[i] /= 96.; // Reduce the divisor if you prefer darker peaks
if (processed[i] > 1)
processed[i] = 1;
}
In->setPixel(x,(half-1)-i,processed[i]*255);
}
// This should be called each time fftw_plan_dft_r2c_1d()
// was called to avoid a memory leak:
fftw_destroy_plan(p);
}
fftw_free(out);
}
The two corrected bugs were most probably responsible for the slight variation of successive transformation results. The Hanning window is very vell suited to minimize the "noise" so a different window would not have solved the problem (actually #Alex I already pointed to the 2nd bug in his point 2. But in his point 3. he added a -Inf-bug as log(0) is not defined which can happen if your wave file containts a stretch of exact 0-values. To avoid this the constant 1e-6 is good enough).
Not asked, but there are some optimizations:
put p = fftw_plan_dft_r2c_1d(transform_size, in, out, FFTW_ESTIMATE); outside the main loop,
precalculate the window function outside the main loop,
abandon the array processed and just use a temporary variable to hold one spectral line at a time,
the two multiplications of out[i][0] and out[i][1] can be abandoned in favour of one multiplication with a constant in the following line. I left this (and other things) for you to improve
Thanks to #Maxime Coorevits additionally a memory leak could be avoided: "Each time you call fftw_plan_dft_rc2_1d() memory are allocated by FFTW3. In your code, you only call fftw_destroy_plan() outside the outer loop. But in fact, you need to call this each time you request a plan."
Audacity typically doesn't map one frequency bin to one horizontal line, nor one sample period to one vertical line. The visual effect in Audacity may be due to resampling of the spectrogram picture in order to fit the drawing area.

C++ eigenvalue/vector decomposition, only need first n vectors fast

I have a ~3000x3000 covariance-alike matrix on which I compute the eigenvalue-eigenvector decomposition (it's a OpenCV matrix, and I use cv::eigen() to get the job done).
However, I actually only need the, say, first 30 eigenvalues/vectors, I don't care about the rest. Theoretically, this should allow to speed up the computation significantly, right? I mean, that means it has 2970 eigenvectors less that need to be computed.
Which C++ library will allow me to do that? Please note that OpenCV's eigen() method does have the parameters for that, but the documentation says they are ignored, and I tested it myself, they are indeed ignored :D
UPDATE:
I managed to do it with ARPACK. I managed to compile it for windows, and even to use it. The results look promising, an illustration can be seen in this toy example:
#include "ardsmat.h"
#include "ardssym.h"
int n = 3; // Dimension of the problem.
double* EigVal = NULL; // Eigenvalues.
double* EigVec = NULL; // Eigenvectors stored sequentially.
int lowerHalfElementCount = (n*n+n) / 2;
//whole matrix:
/*
2 3 8
3 9 -7
8 -7 19
*/
double* lower = new double[lowerHalfElementCount]; //lower half of the matrix
//to be filled with COLUMN major (i.e. one column after the other, always starting from the diagonal element)
lower[0] = 2; lower[1] = 3; lower[2] = 8; lower[3] = 9; lower[4] = -7; lower[5] = 19;
//params: dimensions (i.e. width/height), array with values of the lower or upper half (sequentially, row major), 'L' or 'U' for upper or lower
ARdsSymMatrix<double> mat(n, lower, 'L');
// Defining the eigenvalue problem.
int noOfEigVecValues = 2;
//int maxIterations = 50000000;
//ARluSymStdEig<double> dprob(noOfEigVecValues, mat, "LM", 0, 0.5, maxIterations);
ARluSymStdEig<double> dprob(noOfEigVecValues, mat);
// Finding eigenvalues and eigenvectors.
int converged = dprob.EigenValVectors(EigVec, EigVal);
for (int eigValIdx = 0; eigValIdx < noOfEigVecValues; eigValIdx++) {
std::cout << "Eigenvalue: " << EigVal[eigValIdx] << "\nEigenvector: ";
for (int i = 0; i < n; i++) {
int idx = n*eigValIdx+i;
std::cout << EigVec[idx] << " ";
}
std::cout << std::endl;
}
The results are:
9.4298, 24.24059
for the eigenvalues, and
-0.523207, -0.83446237, -0.17299346
0.273269, -0.356554, 0.893416
for the 2 eigenvectors respectively (one eigenvector per row)
The code fails to find 3 eigenvectors (it can only find 1-2 in this case, an assert() makes sure of that, but well, that's not a problem).
In this article, Simon Funk shows a simple, effective way to estimate a singular value decomposition (SVD) of a very large matrix. In his case, the matrix is sparse, with dimensions: 17,000 x 500,000.
Now, looking here, describes how eigenvalue decomposition closely related to SVD. Thus, you might benefit from considering a modified version of Simon Funk's approach, especially if your matrix is sparse. Furthermore, your matrix is not only square but also symmetric (if that is what you mean by covariance-like), which likely leads to additional simplification.
... Just an idea :)
It seems that Spectra will do the job with good performances.
Here is an example from their documentation to compute the 3 first eigen values of a dense symmetric matrix M (likewise your covariance matrix):
#include <Eigen/Core>
#include <Spectra/SymEigsSolver.h>
// <Spectra/MatOp/DenseSymMatProd.h> is implicitly included
#include <iostream>
using namespace Spectra;
int main()
{
// We are going to calculate the eigenvalues of M
Eigen::MatrixXd A = Eigen::MatrixXd::Random(10, 10);
Eigen::MatrixXd M = A + A.transpose();
// Construct matrix operation object using the wrapper class DenseSymMatProd
DenseSymMatProd<double> op(M);
// Construct eigen solver object, requesting the largest three eigenvalues
SymEigsSolver< double, LARGEST_ALGE, DenseSymMatProd<double> > eigs(&op, 3, 6);
// Initialize and compute
eigs.init();
int nconv = eigs.compute();
// Retrieve results
Eigen::VectorXd evalues;
if(eigs.info() == SUCCESSFUL)
evalues = eigs.eigenvalues();
std::cout << "Eigenvalues found:\n" << evalues << std::endl;
return 0;
}

Why is this code so slow?

So I have this function used to calculate statistics (min/max/std/mean). Now the thing is this runs generally on a 10,000 by 15,000 matrix. The matrix is stored as a vector<vector<int> > inside the class. Now creating and populating said matrix goes very fast, but when it comes down to the statistics part it becomes so incredibly slow.
E.g. to read all the pixel values of the geotiff one pixel at a time takes around 30 seconds. (which involves a lot of complex math to properly georeference the pixel values to a corresponding point), to calculate the statistics of the entire matrix it takes around 6 minutes.
void CalculateStats()
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
for(size_t row = 0; row < vals.size(); row++)
{
for(size_t col = 0; col < vals.at(row).size(); col++)
{
double mean_prev = new_mean;
T value = get(row, col);
new_mean += (value - new_mean) / (cnt + 1);
new_standard_dev += (value - new_mean) * (value - mean_prev);
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
stats_standard_dev = sqrt(new_standard_dev / (vals.size() * vals.at(0).size()) + 1);
std::cout << stats_standard_dev << std::endl;
}
Am I doing something horrible here?
EDIT
To respond to the comments, T would be an int.
EDIT 2
I fixed my std algorithm, and here is the final product:
void CalculateStats(const std::vector<double>& ignore_values)
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
int n = 0;
double delta = 0.0;
double mean2 = 0.0;
std::vector<double>::const_iterator ignore_begin = ignore_values.begin();
std::vector<double>::const_iterator ignore_end = ignore_values.end();
for(std::vector<std::vector<T> >::const_iterator row = vals.begin(), row_end = vals.end(); row != row_end; ++row)
{
for(std::vector<T>::const_iterator col = row->begin(), col_end = row->end(); col != col_end; ++col)
{
// This method of calculation is based on Knuth's algorithm.
T value = *col;
if(std::find(ignore_begin, ignore_end, value) != ignore_end)
continue;
n++;
delta = value - new_mean;
new_mean = new_mean + (delta / n);
mean2 = mean2 + (delta * (value - new_mean));
// Find new max/min's.
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
}
}
stats_standard_dev = mean2 / (n - 1);
stats_min = new_min;
stats_max = new_max;
stats_mean = new_mean;
This still takes ~120-130 seconds to do this, but it's a huge improvement :)!
Have you tried to profile your code?
You don't even need a fancy profiler. Just stick some debug timing statements in there.
Anything I tell you would just be an educated guess (and probably wrong)
You could be getting lots of cache misses due to the way you're accessing the contents of the vector. You might want to cache some of the results to size() but I don't know if that's the issue.
I just profiled it. 90% of the execution time was in this line:
new_mean += (value - new_mean) / (cnt + 1);
You should calculate the sum of values, min, max and count in the first loop,
then calculate the mean in one operation by dividing sum/count,
then in a second loop calculate std_dev's sum
That would probably be a bit faster.
First thing I spotted is that you evaluate vals.at(row).size() in the loop, which, obviously, isn't supposed to improve performance. It also applies to vals.size(), but of course inner loop is worse. If vals is a vector of vector, you better use iterators or at least keep reference for the outer vector (because get() with indices parameters surely eats up quite some time as well).
This code sample is supposed to illustrate my intentions ;-)
for(TVO::const_iterator i=vals.begin(),ie=vals.end();i!=ie;++i) {
for(TVI::const_iterator ii=i->begin(),iie=i->end();ii!=iie;++ii) {
T value = *ii;
// the rest
}
}
First, change your row++ to ++row. A minor thing, but you want speed, so that will help
Second, make your row < vals.size into some const comparison instead. The compiler doesn't know that vals won't change, so it has to play nice and always call size.
what is the 'get' method in the middle there? What does that do? That might be your real problem.
I'm not too sure about your std dev calculation. Take a look at the wikipedia page on calculating variance in a single pass (they have a quick explanation of Knuth's algorithm, which is an expansion of a recursion relation).
It's slow because you're benchmarking debug code.
Building and running the code on Windows XP using VS2008:
a Release build with the default optimisation level, the code in the OP runs in 2734 ms.
a Debug build with the default of no optimisation, the code in the OP runs in a massive 398,531 ms.
In comments below you say you're not using optimisation, and this appears to make a big difference in this case - normally it's less that a factor of ten, but in this case it's over a hundred times slower.
I'm using VS2008 rather than 2005, but it's probably similar:
In the Debug build, there are two range checks on each access, each of which calls std::vector::size() using a non-inlined function call and requires a branch predicition. There is overhead involved both with function calls and with branches.
In the Release build, the compiler optimizes away the range checks ( I don't know whether it just drops them, or does flow analysis based on the limits of the loop ), and the vector access becomes a small amount of inline pointer arithmetic with no branches.
No-one cares how fast the debug build is. You should be unit testing the release build anyway, as that's the build which has to work correctly. Only use the Debug build if you don't all the information you want if you try and step through the code.
The code as posted runs in < 1.5 seconds on my PC with test data of 15000 x 10000 integers all equal to 42. You report that it's running in 230 times slower that that. Are you on a 10 MHz processor?
Though there are other suggestions for making it faster ( such as moving it to use SSE, if all the values are representable using 8bit types ), but there's clearly something else which is making it slow.
On my machine, neither a version which hoisted a reference to the vector for the row and hoisting the size of the row, nor a version which used iterator had any measurable benefit ( with g++ -O3 using iterators takes 1511ms repeatably; the hoisted and original version both take 1485ms ). Not optimising means it runs in 7487ms ( original ), 3496ms ( hoisted ) or 5331ms ( iterators ).
But unless you're running on a very low power device, or are paging, or a running non-optimised code with a debugger attached, it shouldn't be this slow, and whatever is making it slow is not likely to be the code you've posted.
( as a side note, if you test it with values with a deviation of zero your SD comes out as 1 )
There are far too many calculations in the inner loop:
For the descriptive statistics (mean, standard
deviation) the only thing required is to compute the sum
of value and the sum of squared value. From these
two sums the mean and standard deviation can be computed
after the outer loop (together with a third value, the
number of samples - n is your new/updated code). The
equations can be derived from the definitions or found
on the web, e.g. Wikipedia. For instance the mean is
just sum of value divided by n. For the n version (in
contrast to the n-1 version - however n is large in
this case so it doesn't matter which one is used) the
standard deviation is: sqrt( n * sumOfSquaredValue -
sumOfValue * sumOfValue). Thus only two floating point
additions and one multiplication are needed in the
inner loop. Overflow is not a problem with these sums as
the range for doubles is 10^318. In particular you will
get rid of the expensive floating point division that
the profiling reported in another answer has revealed.
A lesser problem is that the minimum and maximum are
rewritten every time (the compiler may or may not
prevent this). As the minimum quickly becomes small and
the maximum quickly becomes large, only the two comparisons
should happen for the majority of loop iterations: use
if statements instead to be sure. It can be argued, but
on the other hand it is trivial to do.
I would change how I access the data. Assuming you are using std::vector for your container you could do something like this:
vector<vector<T> >::const_iterator row;
vector<vector<T> >::const_iterator row_end = vals.end();
for(row = vals.begin(); row < row_end; ++row)
{
vector<T>::const_iterator value;
vector<T>::const_iterator value_end = row->end();
for(value = row->begin(); value < value_end; ++value)
{
double mean_prev = new_mean;
new_mean += (*value - new_mean) / (cnt + 1);
new_standard_dev += (*value - new_mean) * (*value - mean_prev);
// find new max/min's
new_min = min(*value, new_min);
new_max = max(*value, new_max);
cnt++;
}
}
The advantage of this is that in your inner loop you aren't consulting the outter vector, just the inner one.
If you container type is a list, this will be significantly faster. Because the look up time of get/operator[] is linear for a list and constant for a vector.
Edit, I moved the call to end() out of the loop.
Move the .size() calls to before each loop, and make sure you are compiling with optimizations turned on.
If your matrix is stored as a vector of vectors, then in the outer for loop you should directly retrieve the i-th vector, and then operate on that in the inner loop. Try that and see if it improves performance.
I'm nor sure of what type vals is but vals.at(row).size() could take a long time if itself iterates through the collection. Store that value in a variable. Otherwise it could make the algorithm more like O(n³) than O(n²)
I think that I would rewrite it to use const iterators instead of row and col indexes. I would set up a const const_iterator for row_end and col_end to compare against, just to make certain it wasn't making function calls at every loop end.
As people have mentioned, it might be get(). If it accesses neighbors, for instance, you will totally smash the cache which will greatly reduce the performance. You should profile, or just think about access patterns.
Coming a bit late to the party here, but a couple of points:
You're effectively doing numerical work here. I don't know much about numerical algorithms, but I know enough to know that references and expert support are often useful. This discussion thread offers some references; and Numerical Recipes is a standard (if dated) work.
If you have the opportunity to redesign your matrix, you want to try using a valarray and slices instead of vectors of vectors; one advantage that immediately comes to mind is that you're guaranteed a flat linear layout, which makes cache pre-fetching and SIMD instructions (if your compiler can use them) more effective.
In the inner loop, you shouldn't be testing size, you shouldn't be doing any divisions, and iterators can also be costly. In fact, some unrolling would be good in there.
And, of course, you should pay attention to cache locality.
If you get the loop overhead low enough, it might make sense to do it in separate passes: one to get the sum (which you divide to get the mean), one to get the sum of squares (which you combine with the sum to get the variance), and one to get the min and/or max. The reason is to simplify what is in the inner unrolled loop so the compiler can keep stuff in registers.
I couldn't get the code to compile, so I couldn't pinpoint issues for sure.
I have modified the algorithm to get rid of almost all of the floating-point division.
WARNING: UNTESTED CODE!!!
void CalculateStats()
{
//OHGOD
double accum_f;
double accum_sq_f;
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
const int oku = 100000000;
int accum_ichi = 0;
int accum_oku = 0;
int accum_sq_ichi = 0;
int accum_sq_oku = 0;
size_t cnt = 0;
int v1 = 0;
int v2 = 0;
v1 = vals.size();
for(size_t row = 0; row < v1; row++)
{
v2 = vals.at(row).size();
for(size_t col = 0; col < v2; col++)
{
T value = get(row, col);
int accum_ichi += value;
int accum_sq_ichi += (value * value);
// perform carries
accum_oku += (accum_ichi / oku);
accum_ichi %= oku;
accum_sq_oku += (accum_sq_ichi / oku);
accum_sq_ichi %= oku;
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
// now, and only now, do we use floating-point arithmetic
accum_f = (double)(oku) * (double)(accum_oku) + (double)(accum_ichi);
accum_sq_f = (double)(oku) * (double)(accum_sq_oku) + (double)(accum_sq_ichi);
new_mean = accum_f / (double)(cnt);
// standard deviation formula from Wikipedia
stats_standard_dev = sqrt((double)(cnt)*accum_sq_f - accum_f*accum_f)/(double)(cnt);
std::cout << stats_standard_dev << std::endl;
}