Large block coefficient-wise multiplication fails in Eigen library C++

Large block coefficient-wise multiplication fails in Eigen library C++ - c++

I've read through a lot of the documentation, however if you find something which I've missed that can explain away my issue I'll be pleased. For background, I'm compiling on x86 Windows 10 in Visual Studio 2015 using the 3.2.7 Eigen Library. The 3.2.7 version is from May and while there have been releases since then, I haven't seen anything in the changelog that would indicate my issue has been solved.
The issue seems to only appear for matrices above a certain size. I don't know if this is a byproduct of something specific to my system or something inherent to Eigen.
The following code produces an access violation in both Debug and Release mode.
int mx1Rows = 255, cols = 254;
{//this has an access violation at the assignment of mx2
Eigen::MatrixXd mx1(mx1Rows, cols);
Eigen::MatrixXd mx2(mx1Rows + 1, cols);
Eigen::Block<Eigen::MatrixXd, -1, -1, false> temp = mx2.topRows(mx1Rows);
mx2 = temp.array() * mx1.array();//error
}
I believe the assignment of the coefficient-wise multiplication to be safe since the result should be aliased.
This issue becomes interesting when mx1Rows is reduced to the value 254, then the access violation doesn't appear. That's correct, the mx2 dimensions of 256 by 254 produce the problem but the dimensions 255 by 254 don't. If I increase the column size I can also get the access violation, so the problem likely has something to do with the total number of entries. The issue appears even if mx1 and mx2 are filled with values, having filled matrixes is not necessary to reproduce the issue.
Similar code that does not assign the topRows() block to temp does not produce the access violation in Release mode. I believe there is something more to this since I originally identified this issue in code that was considerably more complex and it only appeared after a certain number of loops (the matrix sizes were consistent between loops). There is so much going on in my code that I haven't been able to isolate the conditions under which the access violation only appears after a certain number of loops.
What I am curious to know is
1) Am I using Eigen in some obviously wrong way?
2) Are you able to reproduce this issue? (what is your environment particulars?)
3) Is this a bug in the Eigen library?
It's easy enough to work around this problem by assigning the block to a temporary matrix instead of a block, even if it is inefficient, so I'm not interested in hearing about that.

The problem is that temp references the coefficients held by mx2, but in the last assignment, mx2 is first resized before the expression gets evaluated. Therefore, during the actual evaluation of the expression, temp references garbage. More precisely, here is what is actually generated (in a simplified manner):
double* temp_data = mx2.data;
free(mx2.data);
mx2.data = malloc(sizeof(double)*mx1Rows*cols);
for(j=0;j<cols;++j)
for(i=0;i<mx1Rows;++i)
mx2(i,j) = temp_data[i+j*(mw1Rows+1)] * mx1(i,j);
This is called an aliasing issue.
You can workaround by evaluating the expression in a temporary:
mx2 = (temp.array() * mx1.array()).eval();
Another solution is to copy mx2.topRows(.) into a true MatrixXd holding its own memory:
MatrixXd temp = mx2.topRows(mx1Rows);
mx2 = temp.array() * mx1.array();
Yet another solution is to evaluate into temp and resize afterward:
Block<MatrixXd, -1, -1, false> temp = mx2.topRows(mx1Rows);
temp = temp.array() * mx1.array();
mx2.conservativeResize(mx1Rows,cols);

Looks like a bug that affects small dimensions as well. Remove the comment in the bug inducing line to get correct results.
Correction. As ggael's answer points out, it is aliasing. It's of the type often encountered using auto to create a temp that is later used on the same object.
#include <iostream>
#include <Eigen/Dense>
int main()
{//this has an access violation at the assignment of mx2
//const int mx1Rows = 255, cols = 254;
const int mx1Rows = 3, cols = 2;
Eigen::MatrixXd mx1(mx1Rows, cols);
int value = 0;
for (int j = 0; j < cols; j++)
for (int i = 0; i < mx1Rows; i++)
mx1(i,j)=value++;
Eigen::MatrixXd mx2(mx1Rows + 1, cols);
for (int j = 0; j < cols; j++)
for (int i = 0; i < mx1Rows+1; i++)
mx2(i,j)=value++;
Eigen::Block<Eigen::MatrixXd, -1, -1> temp = mx2.topRows(mx1Rows);
mx2 = temp.array()/*.eval().array()*/ * mx1.array();r
std::cout << mx2.array() << std::endl;
}
// with /*.eval().array()*/ uncommented
//0 30
//7 44
//16 60
// Original showing bug
//-0 -4.37045e+144
//-1.45682e+144 -5.82726e+144
//-2.91363e+144 -7.28408e+144

Related

How can I make this as fast as possible? - Iterating through an image mat

The question is quite straightforward. I'll also explain what I do in case there is a faster way to do this without optimizing this specific way.
I go through an image and its rgb values. I have bins of size 256 for each color. So for every pixel I calculate the 3 bins of its rgb values. The bins essentially give me the index to access data for the specific color in a large vector. With this data, I do some calculations which are irrelevant. What I want to optimize is the accessing part.
Keep in mind that the large vector has an extra dimension. Every pixel belongs to some defined areas of the image. For every area it belongs to, it has an element in the big vector. So, if a pixel belongs in 4 areas(eg 3,9,12,13) then the data I want to access is: data[colorIndex][3],data[colorIndex][9],data[colorIndex][12],data[colorIndex][13].
I think that's enough to explain the code which is the following:
//Just filling with data for the sake of the example
int cols = 200; int rows = 200;
cv::Mat image(200, 200, CV_8UC3);
image.setTo(Scalar(100, 100, 100));
int numberOfAreas = 50;
//For every pixel (first dimension) we have a vector<int> containing ones for every area the pixel belongs to.
//For this example, every pixel belongs to every area.
vector<vector<int>> areasThePixelBelongs(200 * 200, vector<int>(numberOfAreas, 1));
int numberOfBins = 32;
int sizeOfBin = 256 / numberOfBins;
vector<vector<float>> data(pow(numberOfBins, 3), vector<float>(numberOfAreas, 1));
//Filling complete
//Part I need to optimize
uchar* matPointer;
for (int y = 0; y < rows; y++) {
matPointer = image.ptr<uchar>(y);
for (int x = 0; x < cols; x++) {
int red = matPointer[x * 3 + 2];
int green = matPointer[x * 3 + 1];
int blue = matPointer[x * 3];
int binNumberRed = red / sizeOfBin;
int binNumberGreen = green / sizeOfBin;
int binNumberBlue = blue / sizeOfBin;
//Instead of a 3d vector where I access the elements like: color[binNumberRed][binNumberGreen][binNumberBlue]
//I use a 1d vector where I just have to calculate the 1d index as follows
int index = binNumberRed * numberOfBins * numberOfBins + binNumberGreen * numberOfBins + binNumberBlue;
vector<int>& areasOfPixel = areasThePixelBelongs[y*cols+x];
int numberOfPixelAreas = areasOfPixel.size();
for (int i = 0; i < numberOfPixelAreas; i++) {
float valueOfInterest = data[index][areasOfPixel[i]];
//Some calculations here...
}
}
}
Would it be better accessing each mat element as a Vec3b? I think I'm essentially accessing an element 3 times for each pixel using uchar. Would accessing one Vec3b be faster?

First of all vector<vector<T>> is not efficiently stored in memory as it is not contiguous. This as often a big impact on performance and should be avoided as mush as possible (especially when the inner arrays are of the same size). Instead of this, you can use std::array for fixed-size arrays or a flatten std::vector (with the size dim1 * dim2 * ... dimN).
Moreover, the loop is a good candidate for parallelization. You can parallelize this code easily with OpenMP. This assumes Some calculations here can be implemented in a thread-safe way (you should be careful about shared writes if any). If this code is embarrassingly-parallel, then the resulting parallel code can be much faster. Still, using multi-threading introduces some overhead which may be too big compared to the overall computation time (which is highly dependent of the content in Some calculations here).
Finally, regarding the content in Some calculations here it may or may not be possible to adapt the code so the compiler use SIMD instructions. The data[index][areasOfPixel[i]] will likely prevent most compiler to do that, but the following computation could be. Note that software prefetching and gather instructions may help to speed up a bit the data[index][areasOfPixel[i]] operation.
Note that the way you access pixels should not have a significant impact on the runtime as the computation should be bounded by the speed of the inner loop iterating on areas containing some unknown code (unless this unknown code actually access pixels too).

Add a matrix of 2x2 into a vector in c++

I am trying to fill a vector with a matrix of values in c++. I'm not very self confident with this procedure (I don't know well about pointers and I don't know if I need it here) however I am trying this
int auxMat[gray.rows][gray.cols];
vector<int> collectionSum;
collectionSum.push_back(auxMat);
When I try to compile I receive an error which says
invalid arguments 'Candidates are: void push_back(const int &)
Can anyone tell me wether it's possible to do, how can I solve it?
I read something about erasing cache memory, changing my eclipse compiler, my c++ version, however I don't think the problem is so big.

You cannot push back a matrix into a vector. What you can do is preallocate memory for your vector (for speeding things up) then use the std::vector<>::assign member function to "copy" from the matrix into the vector:
vector<int> collectionSum(gray.rows * gray.cols); // reserve memory, faster
collectionSum.assign(*auxMat, *auxMat + gray.rows * gray.cols);
This should be pretty fast. Otherwise, you can push back each individual element in a loop.
EDIT
See May I treat a 2D array as a contiguous 1D array? for some technicalities regarding possible undefined behaviour (thanks #juanchopanza for the comment). I believe the code is safe, due to the fact that the storage of the matrix is contiguous.

Because the array auxMat is continuous in memory, you can just copy it directly from memory into your vector. Here, you are telling the vector constructor to copy from the start of auxMat until its end in memory using pointer arithmetic:
std::vector<int> collectionSum(auxMat, auxMat + (gray.rows * gray.cols));
EDIT:
Sorry, I read your question as being a 1D array (int*) rather than a 2D (int**) array. I honestly recommend switching over to a 1D array because often it is faster and easier to work with. Depending on whether your using row-first order or column-first order, you can access the element you want by:
elem = y * width + x; // for row-first order
elem = x * height + y; // for column-first order
For instance:
// Create a 3x3 matrix but represent it continuously as a 1D array
const int A[] = {1, 2, 3, 4, 5, 6, 7, 8, 9};
const unsigned width = 3;
const unsigned height = 3;
for (int y = 0; y < height; ++y)
{
for (int x = 0; x < width; ++x)
{
printf("%d ", A[y * width + x]);
}
printf("\n");
}

Add 1 to vector<unsigned char> value - Histogram in C++

I guess it's such an easy question (I'm coming from Java), but I can't figure out how it works.
I simply want to increment an vector element by one. The reason for this is, that I want to compute a histogram out of image values. But whatever I try I just can accomplish to assign a value to the vector. But not to increment it by one!
This is my histogram function:
void histogram(unsigned char** image, int height,
int width, vector<unsigned char>& histogramArray) {
for (int i = 0; i < width; i++) {
for (int j = 0; j < height; j++) {
// histogramArray[1] = (int)histogramArray[1] + (int)1;
// add histogram position by one if greylevel occured
histogramArray[(int)image[i][j]]++;
}
}
// display output
for (int i = 0; i < 256; i++) {
cout << "Position: " << i << endl;
cout << "Histogram Value: " << (int)histogramArray[i] << endl;
}
}
But whatever I try to add one to the histogramArray position, it leads to just 0 in the output. I'm only allowed to assign concrete values like:
histogramArray[1] = 2;
Is there any simple and easy way? I though iterators are hopefully not necesarry at this point, because I know the exakt index position where I want to increment something.
EDIT:
I'm so sorry, I should have been more precise with my question, thank you for your help so far! The code above is working, but it shows a different mean value out of the histogram (difference of around 90) than it should. Also the histogram values are way different than in a graphic program - even though the image values are exactly the same! Thats why I investigated the function and found out if I set the histogram to zeros and then just try to increase one element, nothing happens! This is the commented code above:
for (int i = 0; i < width; i++) {
for (int j = 0; j < height; j++) {
histogramArray[1]++;
// add histogram position by one if greylevel occured
// histogramArray[(int)image[i][j]]++;
}
}
So the position 1 remains 0, instead of having the value height*width. Because of this, I think the correct calculation histogramArray[image[i][j]]++; is also not working properly.
Do you have any explanation for this? This was my main question, I'm sorry.
Just for completeness, this is my mean function for the histogram:
unsigned char meanHistogram(vector<unsigned char>& histogram) {
int allOccurences = 0;
int allValues = 0;
for (int i = 0; i < 256; i++) {
allOccurences += histogram[i] * i;
allValues += histogram[i];
}
return (allOccurences / (float) allValues) + 0.5f;
}
And I initialize the image like this:
unsigned char** image= new unsigned char*[width];
for (int i = 0; i < width; i++) {
image[i] = new unsigned char[height];
}
But there shouldn't be any problem with the initialization code, since all other computations work perfectly and I am able to manipulate and safe the original image. But it's true, that I should change width and height - since I had only square images it didn't matter so far.
The Histogram is created like this and then the function is called like that:
vector<unsigned char> histogramArray(256);
histogram(array, adaptedHeight, adaptedWidth, histogramArray);
So do you have any clue why this part histogramArray[1]++; don't increases my histogram? histogramArray[1] remains 0 all the time! histogramArray[1] = 2; is working perfectly. Also histogramArray[(int)image[i][j]]++; seems to calculate something, but as I said, I think it's wrongly calculating.
I appreciate any help very much! The reason why I used a 2D Array is simply because it is asked for. I like the 1D version also much more, because it's way simpler!

You see, the current problem in your code is not incrementing a value versus assigning to it; it's the way you index your image. The way you've written your histogram function and the image access part puts very fine restrictions on how you need to allocate your images for this code to work.
For example, assuming your histogram function is as you've written it above, none of these image allocation strategies will work: (I've used char instead of unsigned char for brevity.)
char image [width * height]; // Obvious; "char[]" != "char **"
char * image = new char [width * height]; // "char*" != "char **"
char image [height][width]; // Most surprisingly, this won't work either.
The reason why the third case won't work is tough to explain simply. Suffice it to say that a 2D array like this will not implicitly decay into a pointer to pointer, and if it did, it would be meaningless. Contrary to what you might read in some books or hear from some people, in C/C++, arrays and pointers are not the same thing!
Anyway, for your histogram function to work correctly, you have to allocate your image like this:
char** image = new char* [height];
for (int i = 0; i < height; ++i)
image[i] = new char [width];
Now you can fill the image, for example:
for (int i = 0; i < height; ++i)
for (int j = 0; j < width; ++j)
image[i][j] = rand() % 256; // Or whatever...
On an image allocated like this, you can call your histogram function and it will work. After you're done with this image, you have to free it like this:
for (int i = 0; i < height; ++i)
delete[] image[i];
delete[] image;
For now, that's enough about allocation. I'll come back to it later.
In addition to the above, it is vital to note the order of iteration over your image. The way you've written it, you iterate over your columns on the outside, and your inner loop walks over the rows. Most (all?) image file formats and many (most?) image processing applications I've seen do it the other way around. The memory allocations I've shown above also assume that the first index is for the row, and the second is for the column. I suggest you do this too, unless you've very good reasons not to.
No matter which layout you choose for your images (the recommended row-major, or your current column-major,) it is in issue that you should always keep in your mind and take notice of.
Now, on to my recommended way of allocating and accessing images and calculating histograms.
I suggest that you allocate and free images like this:
// Allocate:
char * image = new char [height * width];
// Free:
delete[] image;
That's it; no nasty (de)allocation loops, and every image is one contiguous block of memory. When you want to access row i and column j (note which is which) you do it like this:
image[i * width + j] = 42;
char x = image[i * width + j];
And you'd calculate the histogram like this:
void histogram (
unsigned char * image, int height, int width,
// Note that the elements here are pixel-counts, not colors!
vector<unsigned> & histogram
) {
// Make sure histogram has enough room; you can do this outside as well.
if (histogram.size() < 256)
histogram.resize (256, 0);
int pixels = height * width;
for (int i = 0; i < pixels; ++i)
histogram[image[i]]++;
}
I've eliminated the printing code, which should not be there anyway. Note that I've used a single loop to go through the whole image; this is another advantage of allocating a 1D array. Also, for this particular function, it doesn't matter whether your images are row-major or column major, since it doesn't matter in what order we go through the pixels; it only matters that we go through all the pixels and nothing more.
UPDATE: After the question update, I think all of the above discussion is moot and notwithstanding! I believe the problem could be in the declaration of the histogram vector. It should be a vector of unsigned ints, not single bytes. Your problem seems to be that the value of the vector elements seem to stay at zero when your simplify the code and increment just one element, and are off from the values they need to be when you run the actual code. Well, this could be a symptom of numeric wrap-around. If the number of pixels in your image are a a multiple of 256 (e.g. 32x32 or 1024x1024 image) then it is natural that the sum of their number would be 0 mod 256.
I've already alluded to this point in my original answer. If you read my implementation of the histogram function, you see in the signature that I've declared my vector as vector<unsigned> and have put a comment above it that says this victor counts pixels, so its data type should be suitable.
I guess I should have made it bolder and clearer! I hope this solves your problem.

triangular matrix conversion and auto parallelization

I'm playing a bit with auto parallelization in ICC (11.1; old, but can't do anything about it) and I'm wondering why the compiler can't parallelize the inner loop for a simple gaussian elimination:
void makeTriangular(float **matrix, float *vector, int n) {
for (int pivot = 0; pivot < n - 1; pivot++) {
// swap row so that the row with the largest value is
// at pivot position for numerical stability
int swapPos = findPivot(matrix, pivot, n);
std::swap(matrix[pivot], matrix[swapPos]);
std::swap(vector[pivot], vector[swapPos]);
float pivotVal = matrix[pivot][pivot];
for (int row = pivot + 1; row < n; row++) { // line 72; should be parallelized
float tmp = matrix[row][pivot] / pivotVal;
for (int col = pivot + 1; col < n; col++) { // line 74
matrix[row][col] -= matrix[pivot][col] * tmp;
}
vector[row] -= vector[pivot] * tmp;
}
}
}
We're only writing to the arrays dependent on the private row (and col) variable and row is guaranteed to be larger than pivot, so it should be obvious to the compiler that we aren't overwriting anything.
I'm compiling with -O3 -fno-alias -parallel -par-report3 and get lots of dependencies ala: assumed FLOW dependence between matrix line 75 and matrix line 73. or assumed ANTI dependence between matrix line 73 and matrix line 75. and the same for line 75 alone. What problem does the compiler have? Obviously I could tell it exactly what to do with some pragmas, but I want to understand what the compiler can get alone.

Basically the compiler can't figure out that there's no dependency due to the name matrix and the name vector being both read from and written too (even though with different regions). You might be able to get around this in the following fashion (though slightly dirty):
void makeTriangular(float **matrix, float *vector, int n)
{
for (int pivot = 0; pivot < n - 1; pivot++)
{
// swap row so that the row with the largest value is
// at pivot position for numerical stability
int swapPos = findPivot(matrix, pivot, n);
std::swap(matrix[pivot], matrix[swapPos]);
std::swap(vector[pivot], vector[swapPos]);
float pivotVal = matrix[pivot][pivot];
float **matrixForWriting = matrix; // COPY THE POINTER
float *vectorForWriting = vector; // COPY THE POINTER
// (then parallelize this next for loop as you were)
for (int row = pivot + 1; row < n; row++) {
float tmp = matrix[row][pivot] / pivotVal;
for (int col = pivot + 1; col < n; col++) {
// WRITE TO THE matrixForWriting VERSION
matrixForWriting[row][col] = matrix[row][col] - matrix[pivot][col] * tmp;
}
// WRITE TO THE vectorForWriting VERSION
vectorForWriting[row] = vector[row] - vector[pivot] * tmp;
}
}
}
Bottom line is just give the ones you're writing to a temporarily different name to trick the compiler. I know that it's a little dirty and I wouldn't recommend this kind of programming in general. But if you're sure that you have no data dependency, it's perfectly fine.
In fact, I'd put some comments around it that are very clear to future people who see this code that this was a workaround, and why you did it.
Edit: I think the answer was basically touched on by #FPK and an answer was posted by #Evgeny Kluev. However, in #Evgeny Kluev's answer he suggests making this an input parameter and that might parallelize but won't give the correct value since the entries in matrix won't be updated. I think the code I posted above will give the correct answer too.

The same auto-parallelization problem is on icc 12.1. So I used this newer version for experiments.
Adding an output matrix to your function's parameter list and changing body of the third loop to this
out[row][col] = matrix[row][col] - matrix[pivot][col] * tmp;
fixed the "FLOW dependence" problem. Which means, "-fno-alias" affects only function parameters, while contents of the single parameter remain under suspicion of being aliased. I don't know why this option does not affect everything. Since different parts of your matrix do not really alias each other, you can just leave this additional parameter to the function and pass the same matrix through this parameter.
Interestingly, while complaining about 'matrix', compiler say nothing about 'vector', which really has aliasing problems: this line vector[row] -= vector[pivot] * tmp; may lead to false aliasing (writing to vector[row] in one thread may touch the cache line, storing vector[pivot], used by every thread).
"FLOW dependence" is not the only problem in this code. After it was fixed, compiler still refuses to parallelize second and third loops because of "insufficient computational work". So I tried to give it some extra work:
float tmp = matrix[row][pivot] * pivotVal;
...
out[row][col] = matrix[row][col] - matrix[pivot][col] *tmp /pivotVal /pivotVal;
And after all this, the second loop was at last parallelized, though I'm not sure if it gained any speed improvement.
Update: I found a better alternative to giving computer "some extra work". Option -par-threshold50 does the trick.

I have no access to an icc to test my idea but I suspect the compiler fears aliasing: matrix is defined as float**: an array of pointers pointing to arrays of floats. All those pointers could point to the same float array so parallizing this would be very dangerous. This would make no sense, but the compiler cannot know.

Why is this code so slow?

So I have this function used to calculate statistics (min/max/std/mean). Now the thing is this runs generally on a 10,000 by 15,000 matrix. The matrix is stored as a vector<vector<int> > inside the class. Now creating and populating said matrix goes very fast, but when it comes down to the statistics part it becomes so incredibly slow.
E.g. to read all the pixel values of the geotiff one pixel at a time takes around 30 seconds. (which involves a lot of complex math to properly georeference the pixel values to a corresponding point), to calculate the statistics of the entire matrix it takes around 6 minutes.
void CalculateStats()
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
for(size_t row = 0; row < vals.size(); row++)
{
for(size_t col = 0; col < vals.at(row).size(); col++)
{
double mean_prev = new_mean;
T value = get(row, col);
new_mean += (value - new_mean) / (cnt + 1);
new_standard_dev += (value - new_mean) * (value - mean_prev);
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
stats_standard_dev = sqrt(new_standard_dev / (vals.size() * vals.at(0).size()) + 1);
std::cout << stats_standard_dev << std::endl;
}
Am I doing something horrible here?
EDIT
To respond to the comments, T would be an int.
EDIT 2
I fixed my std algorithm, and here is the final product:
void CalculateStats(const std::vector<double>& ignore_values)
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
int n = 0;
double delta = 0.0;
double mean2 = 0.0;
std::vector<double>::const_iterator ignore_begin = ignore_values.begin();
std::vector<double>::const_iterator ignore_end = ignore_values.end();
for(std::vector<std::vector<T> >::const_iterator row = vals.begin(), row_end = vals.end(); row != row_end; ++row)
{
for(std::vector<T>::const_iterator col = row->begin(), col_end = row->end(); col != col_end; ++col)
{
// This method of calculation is based on Knuth's algorithm.
T value = *col;
if(std::find(ignore_begin, ignore_end, value) != ignore_end)
continue;
n++;
delta = value - new_mean;
new_mean = new_mean + (delta / n);
mean2 = mean2 + (delta * (value - new_mean));
// Find new max/min's.
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
}
}
stats_standard_dev = mean2 / (n - 1);
stats_min = new_min;
stats_max = new_max;
stats_mean = new_mean;
This still takes ~120-130 seconds to do this, but it's a huge improvement :)!

Have you tried to profile your code?
You don't even need a fancy profiler. Just stick some debug timing statements in there.
Anything I tell you would just be an educated guess (and probably wrong)
You could be getting lots of cache misses due to the way you're accessing the contents of the vector. You might want to cache some of the results to size() but I don't know if that's the issue.

I just profiled it. 90% of the execution time was in this line:
new_mean += (value - new_mean) / (cnt + 1);

You should calculate the sum of values, min, max and count in the first loop,
then calculate the mean in one operation by dividing sum/count,
then in a second loop calculate std_dev's sum
That would probably be a bit faster.

First thing I spotted is that you evaluate vals.at(row).size() in the loop, which, obviously, isn't supposed to improve performance. It also applies to vals.size(), but of course inner loop is worse. If vals is a vector of vector, you better use iterators or at least keep reference for the outer vector (because get() with indices parameters surely eats up quite some time as well).
This code sample is supposed to illustrate my intentions ;-)
for(TVO::const_iterator i=vals.begin(),ie=vals.end();i!=ie;++i) {
for(TVI::const_iterator ii=i->begin(),iie=i->end();ii!=iie;++ii) {
T value = *ii;
// the rest
}
}

First, change your row++ to ++row. A minor thing, but you want speed, so that will help
Second, make your row < vals.size into some const comparison instead. The compiler doesn't know that vals won't change, so it has to play nice and always call size.
what is the 'get' method in the middle there? What does that do? That might be your real problem.
I'm not too sure about your std dev calculation. Take a look at the wikipedia page on calculating variance in a single pass (they have a quick explanation of Knuth's algorithm, which is an expansion of a recursion relation).

It's slow because you're benchmarking debug code.
Building and running the code on Windows XP using VS2008:
a Release build with the default optimisation level, the code in the OP runs in 2734 ms.
a Debug build with the default of no optimisation, the code in the OP runs in a massive 398,531 ms.
In comments below you say you're not using optimisation, and this appears to make a big difference in this case - normally it's less that a factor of ten, but in this case it's over a hundred times slower.
I'm using VS2008 rather than 2005, but it's probably similar:
In the Debug build, there are two range checks on each access, each of which calls std::vector::size() using a non-inlined function call and requires a branch predicition. There is overhead involved both with function calls and with branches.
In the Release build, the compiler optimizes away the range checks ( I don't know whether it just drops them, or does flow analysis based on the limits of the loop ), and the vector access becomes a small amount of inline pointer arithmetic with no branches.
No-one cares how fast the debug build is. You should be unit testing the release build anyway, as that's the build which has to work correctly. Only use the Debug build if you don't all the information you want if you try and step through the code.
The code as posted runs in < 1.5 seconds on my PC with test data of 15000 x 10000 integers all equal to 42. You report that it's running in 230 times slower that that. Are you on a 10 MHz processor?
Though there are other suggestions for making it faster ( such as moving it to use SSE, if all the values are representable using 8bit types ), but there's clearly something else which is making it slow.
On my machine, neither a version which hoisted a reference to the vector for the row and hoisting the size of the row, nor a version which used iterator had any measurable benefit ( with g++ -O3 using iterators takes 1511ms repeatably; the hoisted and original version both take 1485ms ). Not optimising means it runs in 7487ms ( original ), 3496ms ( hoisted ) or 5331ms ( iterators ).
But unless you're running on a very low power device, or are paging, or a running non-optimised code with a debugger attached, it shouldn't be this slow, and whatever is making it slow is not likely to be the code you've posted.
( as a side note, if you test it with values with a deviation of zero your SD comes out as 1 )

There are far too many calculations in the inner loop:
For the descriptive statistics (mean, standard
deviation) the only thing required is to compute the sum
of value and the sum of squared value. From these
two sums the mean and standard deviation can be computed
after the outer loop (together with a third value, the
number of samples - n is your new/updated code). The
equations can be derived from the definitions or found
on the web, e.g. Wikipedia. For instance the mean is
just sum of value divided by n. For the n version (in
contrast to the n-1 version - however n is large in
this case so it doesn't matter which one is used) the
standard deviation is: sqrt( n * sumOfSquaredValue -
sumOfValue * sumOfValue). Thus only two floating point
additions and one multiplication are needed in the
inner loop. Overflow is not a problem with these sums as
the range for doubles is 10^318. In particular you will
get rid of the expensive floating point division that
the profiling reported in another answer has revealed.
A lesser problem is that the minimum and maximum are
rewritten every time (the compiler may or may not
prevent this). As the minimum quickly becomes small and
the maximum quickly becomes large, only the two comparisons
should happen for the majority of loop iterations: use
if statements instead to be sure. It can be argued, but
on the other hand it is trivial to do.

I would change how I access the data. Assuming you are using std::vector for your container you could do something like this:
vector<vector<T> >::const_iterator row;
vector<vector<T> >::const_iterator row_end = vals.end();
for(row = vals.begin(); row < row_end; ++row)
{
vector<T>::const_iterator value;
vector<T>::const_iterator value_end = row->end();
for(value = row->begin(); value < value_end; ++value)
{
double mean_prev = new_mean;
new_mean += (*value - new_mean) / (cnt + 1);
new_standard_dev += (*value - new_mean) * (*value - mean_prev);
// find new max/min's
new_min = min(*value, new_min);
new_max = max(*value, new_max);
cnt++;
}
}
The advantage of this is that in your inner loop you aren't consulting the outter vector, just the inner one.
If you container type is a list, this will be significantly faster. Because the look up time of get/operator[] is linear for a list and constant for a vector.
Edit, I moved the call to end() out of the loop.

Move the .size() calls to before each loop, and make sure you are compiling with optimizations turned on.

If your matrix is stored as a vector of vectors, then in the outer for loop you should directly retrieve the i-th vector, and then operate on that in the inner loop. Try that and see if it improves performance.

I'm nor sure of what type vals is but vals.at(row).size() could take a long time if itself iterates through the collection. Store that value in a variable. Otherwise it could make the algorithm more like O(n³) than O(n²)

I think that I would rewrite it to use const iterators instead of row and col indexes. I would set up a const const_iterator for row_end and col_end to compare against, just to make certain it wasn't making function calls at every loop end.

As people have mentioned, it might be get(). If it accesses neighbors, for instance, you will totally smash the cache which will greatly reduce the performance. You should profile, or just think about access patterns.

Coming a bit late to the party here, but a couple of points:
You're effectively doing numerical work here. I don't know much about numerical algorithms, but I know enough to know that references and expert support are often useful. This discussion thread offers some references; and Numerical Recipes is a standard (if dated) work.
If you have the opportunity to redesign your matrix, you want to try using a valarray and slices instead of vectors of vectors; one advantage that immediately comes to mind is that you're guaranteed a flat linear layout, which makes cache pre-fetching and SIMD instructions (if your compiler can use them) more effective.

In the inner loop, you shouldn't be testing size, you shouldn't be doing any divisions, and iterators can also be costly. In fact, some unrolling would be good in there.
And, of course, you should pay attention to cache locality.
If you get the loop overhead low enough, it might make sense to do it in separate passes: one to get the sum (which you divide to get the mean), one to get the sum of squares (which you combine with the sum to get the variance), and one to get the min and/or max. The reason is to simplify what is in the inner unrolled loop so the compiler can keep stuff in registers.
I couldn't get the code to compile, so I couldn't pinpoint issues for sure.

I have modified the algorithm to get rid of almost all of the floating-point division.
WARNING: UNTESTED CODE!!!
void CalculateStats()
{
//OHGOD
double accum_f;
double accum_sq_f;
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
const int oku = 100000000;
int accum_ichi = 0;
int accum_oku = 0;
int accum_sq_ichi = 0;
int accum_sq_oku = 0;
size_t cnt = 0;
int v1 = 0;
int v2 = 0;
v1 = vals.size();
for(size_t row = 0; row < v1; row++)
{
v2 = vals.at(row).size();
for(size_t col = 0; col < v2; col++)
{
T value = get(row, col);
int accum_ichi += value;
int accum_sq_ichi += (value * value);
// perform carries
accum_oku += (accum_ichi / oku);
accum_ichi %= oku;
accum_sq_oku += (accum_sq_ichi / oku);
accum_sq_ichi %= oku;
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
// now, and only now, do we use floating-point arithmetic
accum_f = (double)(oku) * (double)(accum_oku) + (double)(accum_ichi);
accum_sq_f = (double)(oku) * (double)(accum_sq_oku) + (double)(accum_sq_ichi);
new_mean = accum_f / (double)(cnt);
// standard deviation formula from Wikipedia
stats_standard_dev = sqrt((double)(cnt)*accum_sq_f - accum_f*accum_f)/(double)(cnt);
std::cout << stats_standard_dev << std::endl;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js