sqrt Matlab and C++ numerical differences - c++

I witness numerical differences between Matlab and C++ code. Discrepancy seems to stem from different output for sqrt method in Matlab and C++. For very small numbers (< 10-5) it seems that relative difference is quite big.
Which approach would you suggest to
make sure differences come from sqrt
tune the cpp code as to replicate to float precision the Matlab code
EDIT
I add more precision about the code.
float* buttonVar = new float[nBut];
for (int_T ibut = 0; ibut < nBut; ibut++)
{
for (int_T id = start_idx; id <= stop_idx; id++)
{
inputArray[id - start_idx] = arr[ibut * nDepth + id];
}
reduceVector(inputArray, reducedArray, inputarray_size, d1, d2);
buttonMean[ibut] = 0;
buttonVar[ibut] = 0;
for (int_T id = 0; id < min(nd, nDepth); id++)
{
buttonMean[ibut] += reducedArray[id] / float(nd);
}
for (int_T id = 0; id < min(nd, nDepth); id++)
{
buttonVar[ibut] += (reducedArray[id] - buttonMean[ibut])
*(reducedArray[id] - buttonMean[ibut]);
}
buttonVar[ibut] = sqrtf(buttonVar[ibut] / float(nd));
}
In Matlab, I am converting to single the number to be sqrt. Discrepancy in the code appears in buttonVar.
Final results that are compared in Google Tests are results from several more operations with no other call to mathematics functions. These additional operations are in methods which were thouroughly Google Tested, and there is perfect match to float precision of outputs for these tests.
Numerical difference in buttonVar is up to 15% relative difference (=100*abs(cpp_res - matlab_res)/matlab_res. Significant relative difference occurs when buttonVar his of order of magnitude 10e-6.

Converting to double in C++ inside the computation solves the difference problem. We reach a very satisfactory match after converting to double.

Related

Strange behaviours in porting code from Eigen2 to Eigen3

I'm considering to use this library to perform spectral clustering in my research project.
But, to do so, I need to port it from Eigen2 to Eigen3 (which is what I use in my code).
There's a portion of code that is causing me some troubles.
This is for Eigen2:
double Evrot::evqual(Eigen::MatrixXd& X) {
// take the square of all entries and find max of each row
Eigen::MatrixXd X2 = X.cwise().pow(2);
Eigen::VectorXd max_values = X2.rowwise().maxCoeff();
// compute cost
for (int i=0; i<mNumData; i++ ) {
X2.row(i) = X2.row(i) / max_values[i];
}
double J = 1.0 - (X2.sum()/mNumData -1.0)/mNumDims;
if( DEBUG )
std::cout << "Computed quality = "<< J << std::endl;
return J;
}
as explained here, Eigen3 replaces .cwise() with the slightly different .array() functionality.
So, I wrote:
double Evrot::evqual(Eigen::MatrixXd& X) {
// take the square of all entries and find max of each row
Eigen::MatrixXd X2 = X.array().pow(2);
Eigen::VectorXd max_values = X2.rowwise().maxCoeff();
// compute cost
for (int i=0; i<mNumData; i++ ) {
X2.row(i) = X2.row(i) / max_values[i];
}
double J = 1.0 - (X2.sum()/mNumData -1.0)/mNumDims;
if( DEBUG )
std::cout << "Computed quality = "<< J << std::endl;
return J;
}
and I got no compiler errors.
But, if I give to the two programs the same input (and check that they're actually getting consistent inputs), in the first case I get numbers and in the second only NANs.
My idea is that this is caused by the fact that max_values is badly computed and then using this vector in a division causes all the NANs. But I have no clue on how to fix that.
Can, please, someone explain me how to solve this problem?
Thanks!
Have you checked when the values start to diverge ? Are you sure there is no empty rows or that X^2 do not underflow. Anyways, you should had a guard before dividing by max_values[i]. Moreover, to avoid underflow in squaring you could rewrite it like that:
VectorXd max_values = X.array().abs().rowwise().maxCoeff();
double sum = 0;
for (int i=0; i<mNumData; i++ ) {
if(max_values[i]>0)
sum += (X.row(i)/max_values[i]).squaredNorm();
}
double J = 1.0 - (sum/mNumData -1.0)/mNumDims;
This will work even if X.abs().maxCoeff()==1e-170 whereas your code will underflow and produces NaN. Of course, if you are in a such a case, maybe you should check your inputs first as you are already on dangerous side regarding numerical issues.

Computing analytical signal using FFT in C++

I'm currently lost trying to figure out how to implement an equivalent version of MATLAB's hilbert() function in C++. I'm very new to signal processing, but, ultimately, I would like to figure out a way to phase shift any given signal by 90 degrees. I was attempting to follow the method suggested in this question on MATLAB central, which appears to work based on tests using GNU Octave.
I have what I believe to be a working implementation of both FFT and the inverse FFT, and I have tried implementing the method described in the answer to this post in order to compute the analytical signal. I have tried doing this by applying the FFT, setting the upper half of the array to zero, and then applying the inverse FFT, but, based on graphs I made of output from a test, there must be a problem with the way I have implemented finding the analytical signal.
What would be a suitable way to implement the hilbert() function from MATLAB in C++ given a working implementation of FFT and inverse FFT? Is there a better way of achieving the 90 degree phase shift?
Checking the MATLAB implementation the following should return the same result as the hilbert function. You'll obviously have to modify it to match your specific implementation. I'm assuming a signal class of some sort exists.
signal hilbert(const signal &x)
{
int limit1, limit2;
signal xfreq = fft(x);
if (x.numel % 2 == 0) {
limit1 = x.numel/2;
limit2 = limit1 + 1;
} else {
limit1 = (x.numel + 1)/2;
limit2 = limit1;
}
// multiply the first half by 2 (except the first element)
for (int i = 1; i < limit1; ++i) {
xfreq[i].real *= 2;
xfreq[i].imag *= 2;
}
for (int i = limit2; i < x.numel; ++i) {
xfreq[i].real = 0;
xfreq[i].imag = 0;
}
return ifft(xfreq);
}
Edit: Forgot to set the second half to zeros.
Edit2: Fixed a logical error. I coded the following up in MATLAB which matches hilbert.
function h = hil(x)
n = numel(x);
if (mod(n,2) == 0)
limit1 = n/2;
limit2 = limit1 + 2;
else
limit1 = (n+1)/2;
limit2 = limit1+1;
end
xfreq = fft(x);
for i = 2:limit1
xfreq(i) = xfreq(i)*2;
end
for i = limit2:n
xfreq(i) = 0;
end
h = ifft(xfreq);
end

Complex matrix exponential in C++

Is it actually possible to calculate the Matrix Exponential of a Complex Matrix in c / c++?
I've managed to take the product of two complex matrices using blas functions from the GNU Science Library. for matC = matA * matB:
gsl_blas_zgemm (CblasNoTrans, CblasNoTrans, GSL_COMPLEX_ONE, matA, matB, GSL_COMPLEX_ZERO, matC);
And I've managed to get the matrix exponential of a matrix by using the undocumented
gsl_linalg_exponential_ss(&m.matrix, &em.matrix, .01);
But this doesn't seems to accept complex arguments.
Is there anyway to do this? I used to think c++ was capable of anything. Now I think its outdated and cryptic...
Several options:
modify the gsl_linalg_exponential_ss code to accept complex matrices
write your complex NxN matrix as real 2N x 2N matrix
Diagonalize the matrix, take the exponential of the eigenvalues, and rotate the matrix back to the original basis
Using the complex matrix product that is available, implement the matrix exponential according to it's definition: exp(A) = sum_(n=0)^(n=infinity) A^n/(n!)
You have to check which methods are appropriate for your problem.
C++ is a general purpose language. As mentioned above, if you need specific functionality you have to find a library that can do it or implement it yourself. Alternatively you could use software like MatLab and Mathematica. If that's too expensive there are open source alternatives, e.g. Sage and Octave.
"I used to think c++ was capable of anything" - if a general-purpose language has built-in complex math in its core, then something is wrong with that language.
Fur such very specific tasks there is a well-accepted solution: libraries. Either write your own, or much better, use an already existing one.
I myself rarely need complex matrices in C++, I always used Matlab and similar tools for that. However, this http://www.mathtools.net/C_C__/Mathematics/index.html might be of interest to you if you know Matlab.
There are a couple other libraries which might be of help:
http://eigen.tuxfamily.org/index.php?title=Main_Page
http://math.nist.gov/lapack++/
I was also thinking to do the same, writing your complex NxN matrix as real 2N x 2N matrix is the best way to solve the problem, then use gsl_linalg_exponential_ss().
Suppose A=Ar+i*Ai, where A is the complex matrix and Ar and Ai are the real matrices. Then write the new matrix B=[Ar Ai ;-Ai Ar] (Here the matrix is written in matlab notation). Now calculate the exponential of B, that is eB=[eB1 eB2 ;eB3 eB4], then exponential of A is given by, eA=eB1+1i.*eB2
(summing the matrices eB1 and 1i.*eB2).
I have written a code to calculate the matrix exponential of the complex matrices with the gsl function, gsl_linalg_exponential_ss(&m.matrix, &em.matrix, .01);
Here you have the complete code, and the compilation results. I have checked the result with the Matlab and result agrees.
#include <stdio.h>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_linalg.h>
#include <gsl/gsl_complex.h>
#include <gsl/gsl_complex_math.h>
void my_gsl_complex_matrix_exponential(gsl_matrix_complex *eA, gsl_matrix_complex *A, int dimx)
{
int j,k=0;
gsl_complex temp;
gsl_matrix *matreal =gsl_matrix_alloc(2*dimx,2*dimx);
gsl_matrix *expmatreal =gsl_matrix_alloc(2*dimx,2*dimx);
//Converting the complex matrix into real one using A=[Areal, Aimag;-Aimag,Areal]
for (j = 0; j < dimx;j++)
for (k = 0; k < dimx;k++)
{
temp=gsl_matrix_complex_get(A,j,k);
gsl_matrix_set(matreal,j,k,GSL_REAL(temp));
gsl_matrix_set(matreal,dimx+j,dimx+k,GSL_REAL(temp));
gsl_matrix_set(matreal,j,dimx+k,GSL_IMAG(temp));
gsl_matrix_set(matreal,dimx+j,k,-GSL_IMAG(temp));
}
gsl_linalg_exponential_ss(matreal,expmatreal,.01);
double realp;
double imagp;
for (j = 0; j < dimx;j++)
for (k = 0; k < dimx;k++)
{
realp=gsl_matrix_get(expmatreal,j,k);
imagp=gsl_matrix_get(expmatreal,j,dimx+k);
gsl_matrix_complex_set(eA,j,k,gsl_complex_rect(realp,imagp));
}
gsl_matrix_free(matreal);
gsl_matrix_free(expmatreal);
}
int main()
{
int dimx=4;
int i, j ;
gsl_matrix_complex *A = gsl_matrix_complex_alloc (dimx, dimx);
gsl_matrix_complex *eA = gsl_matrix_complex_alloc (dimx, dimx);
for (i = 0; i < dimx;i++)
{
for (j = 0; j < dimx;j++)
{
gsl_matrix_complex_set(A,i,j,gsl_complex_rect(i+j,i-j));
if ((i-j)>=0)
printf("%d+%di ",i+j,i-j);
else
printf("%d%di ",i+j,i-j);
}
printf(";\n");
}
my_gsl_complex_matrix_exponential(eA,A,dimx);
printf("\n Printing the complex matrix exponential\n");
gsl_complex compnum;
for (i = 0; i < dimx;i++)
{
for (j = 0; j < dimx;j++)
{
compnum=gsl_matrix_complex_get(eA,i,j);
if (GSL_IMAG(compnum)>=0)
printf("%f+%fi\t ",GSL_REAL(compnum),GSL_IMAG(compnum));
else
printf("%f%fi\t ",GSL_REAL(compnum),GSL_IMAG(compnum));
}
printf("\n");
}
return(0);
}

Controlling the index variables in C++ AMP

I have just started trying C++ AMP and I decided to give it a shot with the current project I am working on. At some point, I have to build a distance matrix for the vectors I have and I have written the code below for this
unsigned int samplesize=samplelist.size();
unsigned int vs = samplelist.front().size();
vector<double> samplevec(samplesize*vs);
vector<double> distancevec(samplesize*samplesize,0);
it1=samplelist.begin();
for(int i=0 ; i<samplesize; ++i){
for(int j = 0 ; j<vs ; ++j){
samplevec[j + i*vs] = (*it1)[j];
}
++it1;
}
array_view<const double,2> samplearray(samplesize,vs,samplevec);
array_view<writeonly<double>,2> distances(samplesize,samplesize,distancevec);
parallel_for_each(distances.grid, [=](index<2> idx) restrict(direct3d){
double sqrsum=0;
double tempd=0;
for ( unsigned int i=0 ; i<vs ; ++i)
{
tempd = samplearray(idx.x,i) - samplearray(idx.y,i);
sqrsum += tempd*tempd;
}
distances[idx]=sqrsum;
}
However, as you can see, this does not take into account the symmetry property of distance matrices. When I calculate sqrsum of matrices i and j, I don't want to do the same calculation again when the order of the i and j are reversed. Is there any way to accomplish this? I came up with the following trick, but I don't know if this would bump up the performance significantly
for ( unsigned int i=0 ; i<vs ; ++i)
{
if(idx.x<=idx.y){
break;
}
tempd = samplearray(idx.x,i) - samplearray(idx.y,i);
sqrsum += tempd*tempd;
}
Can the if-condition do the job? Or do you think the if statement would hurt the performance unnecessarily? I couldn't came up with any alternative to it
BTW, I just noticed that the above written code does not work on my machine, whose gpu only supports single precision. Is there anything to do to get around that problem? Error message is as follows:
"runtime_exception: Concurrency;;parallel_for_each uses features unsupported by the selected accelerator.
ID3D11Device::CreateComputeShader: Shader uses double precision float ops which are not supported on the current device."
I think you can eliminate if-condition, if you would schedule only as many threads as you need, instead of scheduling entire rectangle that covers your output matrix. What you need is upper or lower triangle without diagonal, which you can calculate using arithmetic sequence.
The alternative would be to organize input data such that it is in two 1D vectors, each thread would read value from vector 1, then vector 2 and calculate distance and store it in one of the input vectors.
Finally, the error on double precision shows up, because the card you are using does not support double precision operations. Please check your card specification to confirm that. You can workaround it by switching to single precision type i.e. "float" in array_view template.

Why is this code so slow?

So I have this function used to calculate statistics (min/max/std/mean). Now the thing is this runs generally on a 10,000 by 15,000 matrix. The matrix is stored as a vector<vector<int> > inside the class. Now creating and populating said matrix goes very fast, but when it comes down to the statistics part it becomes so incredibly slow.
E.g. to read all the pixel values of the geotiff one pixel at a time takes around 30 seconds. (which involves a lot of complex math to properly georeference the pixel values to a corresponding point), to calculate the statistics of the entire matrix it takes around 6 minutes.
void CalculateStats()
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
for(size_t row = 0; row < vals.size(); row++)
{
for(size_t col = 0; col < vals.at(row).size(); col++)
{
double mean_prev = new_mean;
T value = get(row, col);
new_mean += (value - new_mean) / (cnt + 1);
new_standard_dev += (value - new_mean) * (value - mean_prev);
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
stats_standard_dev = sqrt(new_standard_dev / (vals.size() * vals.at(0).size()) + 1);
std::cout << stats_standard_dev << std::endl;
}
Am I doing something horrible here?
EDIT
To respond to the comments, T would be an int.
EDIT 2
I fixed my std algorithm, and here is the final product:
void CalculateStats(const std::vector<double>& ignore_values)
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
int n = 0;
double delta = 0.0;
double mean2 = 0.0;
std::vector<double>::const_iterator ignore_begin = ignore_values.begin();
std::vector<double>::const_iterator ignore_end = ignore_values.end();
for(std::vector<std::vector<T> >::const_iterator row = vals.begin(), row_end = vals.end(); row != row_end; ++row)
{
for(std::vector<T>::const_iterator col = row->begin(), col_end = row->end(); col != col_end; ++col)
{
// This method of calculation is based on Knuth's algorithm.
T value = *col;
if(std::find(ignore_begin, ignore_end, value) != ignore_end)
continue;
n++;
delta = value - new_mean;
new_mean = new_mean + (delta / n);
mean2 = mean2 + (delta * (value - new_mean));
// Find new max/min's.
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
}
}
stats_standard_dev = mean2 / (n - 1);
stats_min = new_min;
stats_max = new_max;
stats_mean = new_mean;
This still takes ~120-130 seconds to do this, but it's a huge improvement :)!
Have you tried to profile your code?
You don't even need a fancy profiler. Just stick some debug timing statements in there.
Anything I tell you would just be an educated guess (and probably wrong)
You could be getting lots of cache misses due to the way you're accessing the contents of the vector. You might want to cache some of the results to size() but I don't know if that's the issue.
I just profiled it. 90% of the execution time was in this line:
new_mean += (value - new_mean) / (cnt + 1);
You should calculate the sum of values, min, max and count in the first loop,
then calculate the mean in one operation by dividing sum/count,
then in a second loop calculate std_dev's sum
That would probably be a bit faster.
First thing I spotted is that you evaluate vals.at(row).size() in the loop, which, obviously, isn't supposed to improve performance. It also applies to vals.size(), but of course inner loop is worse. If vals is a vector of vector, you better use iterators or at least keep reference for the outer vector (because get() with indices parameters surely eats up quite some time as well).
This code sample is supposed to illustrate my intentions ;-)
for(TVO::const_iterator i=vals.begin(),ie=vals.end();i!=ie;++i) {
for(TVI::const_iterator ii=i->begin(),iie=i->end();ii!=iie;++ii) {
T value = *ii;
// the rest
}
}
First, change your row++ to ++row. A minor thing, but you want speed, so that will help
Second, make your row < vals.size into some const comparison instead. The compiler doesn't know that vals won't change, so it has to play nice and always call size.
what is the 'get' method in the middle there? What does that do? That might be your real problem.
I'm not too sure about your std dev calculation. Take a look at the wikipedia page on calculating variance in a single pass (they have a quick explanation of Knuth's algorithm, which is an expansion of a recursion relation).
It's slow because you're benchmarking debug code.
Building and running the code on Windows XP using VS2008:
a Release build with the default optimisation level, the code in the OP runs in 2734 ms.
a Debug build with the default of no optimisation, the code in the OP runs in a massive 398,531 ms.
In comments below you say you're not using optimisation, and this appears to make a big difference in this case - normally it's less that a factor of ten, but in this case it's over a hundred times slower.
I'm using VS2008 rather than 2005, but it's probably similar:
In the Debug build, there are two range checks on each access, each of which calls std::vector::size() using a non-inlined function call and requires a branch predicition. There is overhead involved both with function calls and with branches.
In the Release build, the compiler optimizes away the range checks ( I don't know whether it just drops them, or does flow analysis based on the limits of the loop ), and the vector access becomes a small amount of inline pointer arithmetic with no branches.
No-one cares how fast the debug build is. You should be unit testing the release build anyway, as that's the build which has to work correctly. Only use the Debug build if you don't all the information you want if you try and step through the code.
The code as posted runs in < 1.5 seconds on my PC with test data of 15000 x 10000 integers all equal to 42. You report that it's running in 230 times slower that that. Are you on a 10 MHz processor?
Though there are other suggestions for making it faster ( such as moving it to use SSE, if all the values are representable using 8bit types ), but there's clearly something else which is making it slow.
On my machine, neither a version which hoisted a reference to the vector for the row and hoisting the size of the row, nor a version which used iterator had any measurable benefit ( with g++ -O3 using iterators takes 1511ms repeatably; the hoisted and original version both take 1485ms ). Not optimising means it runs in 7487ms ( original ), 3496ms ( hoisted ) or 5331ms ( iterators ).
But unless you're running on a very low power device, or are paging, or a running non-optimised code with a debugger attached, it shouldn't be this slow, and whatever is making it slow is not likely to be the code you've posted.
( as a side note, if you test it with values with a deviation of zero your SD comes out as 1 )
There are far too many calculations in the inner loop:
For the descriptive statistics (mean, standard
deviation) the only thing required is to compute the sum
of value and the sum of squared value. From these
two sums the mean and standard deviation can be computed
after the outer loop (together with a third value, the
number of samples - n is your new/updated code). The
equations can be derived from the definitions or found
on the web, e.g. Wikipedia. For instance the mean is
just sum of value divided by n. For the n version (in
contrast to the n-1 version - however n is large in
this case so it doesn't matter which one is used) the
standard deviation is: sqrt( n * sumOfSquaredValue -
sumOfValue * sumOfValue). Thus only two floating point
additions and one multiplication are needed in the
inner loop. Overflow is not a problem with these sums as
the range for doubles is 10^318. In particular you will
get rid of the expensive floating point division that
the profiling reported in another answer has revealed.
A lesser problem is that the minimum and maximum are
rewritten every time (the compiler may or may not
prevent this). As the minimum quickly becomes small and
the maximum quickly becomes large, only the two comparisons
should happen for the majority of loop iterations: use
if statements instead to be sure. It can be argued, but
on the other hand it is trivial to do.
I would change how I access the data. Assuming you are using std::vector for your container you could do something like this:
vector<vector<T> >::const_iterator row;
vector<vector<T> >::const_iterator row_end = vals.end();
for(row = vals.begin(); row < row_end; ++row)
{
vector<T>::const_iterator value;
vector<T>::const_iterator value_end = row->end();
for(value = row->begin(); value < value_end; ++value)
{
double mean_prev = new_mean;
new_mean += (*value - new_mean) / (cnt + 1);
new_standard_dev += (*value - new_mean) * (*value - mean_prev);
// find new max/min's
new_min = min(*value, new_min);
new_max = max(*value, new_max);
cnt++;
}
}
The advantage of this is that in your inner loop you aren't consulting the outter vector, just the inner one.
If you container type is a list, this will be significantly faster. Because the look up time of get/operator[] is linear for a list and constant for a vector.
Edit, I moved the call to end() out of the loop.
Move the .size() calls to before each loop, and make sure you are compiling with optimizations turned on.
If your matrix is stored as a vector of vectors, then in the outer for loop you should directly retrieve the i-th vector, and then operate on that in the inner loop. Try that and see if it improves performance.
I'm nor sure of what type vals is but vals.at(row).size() could take a long time if itself iterates through the collection. Store that value in a variable. Otherwise it could make the algorithm more like O(n³) than O(n²)
I think that I would rewrite it to use const iterators instead of row and col indexes. I would set up a const const_iterator for row_end and col_end to compare against, just to make certain it wasn't making function calls at every loop end.
As people have mentioned, it might be get(). If it accesses neighbors, for instance, you will totally smash the cache which will greatly reduce the performance. You should profile, or just think about access patterns.
Coming a bit late to the party here, but a couple of points:
You're effectively doing numerical work here. I don't know much about numerical algorithms, but I know enough to know that references and expert support are often useful. This discussion thread offers some references; and Numerical Recipes is a standard (if dated) work.
If you have the opportunity to redesign your matrix, you want to try using a valarray and slices instead of vectors of vectors; one advantage that immediately comes to mind is that you're guaranteed a flat linear layout, which makes cache pre-fetching and SIMD instructions (if your compiler can use them) more effective.
In the inner loop, you shouldn't be testing size, you shouldn't be doing any divisions, and iterators can also be costly. In fact, some unrolling would be good in there.
And, of course, you should pay attention to cache locality.
If you get the loop overhead low enough, it might make sense to do it in separate passes: one to get the sum (which you divide to get the mean), one to get the sum of squares (which you combine with the sum to get the variance), and one to get the min and/or max. The reason is to simplify what is in the inner unrolled loop so the compiler can keep stuff in registers.
I couldn't get the code to compile, so I couldn't pinpoint issues for sure.
I have modified the algorithm to get rid of almost all of the floating-point division.
WARNING: UNTESTED CODE!!!
void CalculateStats()
{
//OHGOD
double accum_f;
double accum_sq_f;
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
const int oku = 100000000;
int accum_ichi = 0;
int accum_oku = 0;
int accum_sq_ichi = 0;
int accum_sq_oku = 0;
size_t cnt = 0;
int v1 = 0;
int v2 = 0;
v1 = vals.size();
for(size_t row = 0; row < v1; row++)
{
v2 = vals.at(row).size();
for(size_t col = 0; col < v2; col++)
{
T value = get(row, col);
int accum_ichi += value;
int accum_sq_ichi += (value * value);
// perform carries
accum_oku += (accum_ichi / oku);
accum_ichi %= oku;
accum_sq_oku += (accum_sq_ichi / oku);
accum_sq_ichi %= oku;
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
// now, and only now, do we use floating-point arithmetic
accum_f = (double)(oku) * (double)(accum_oku) + (double)(accum_ichi);
accum_sq_f = (double)(oku) * (double)(accum_sq_oku) + (double)(accum_sq_ichi);
new_mean = accum_f / (double)(cnt);
// standard deviation formula from Wikipedia
stats_standard_dev = sqrt((double)(cnt)*accum_sq_f - accum_f*accum_f)/(double)(cnt);
std::cout << stats_standard_dev << std::endl;
}