Variable block size sum of absolute difference calculation in C++ - c++

I would like to perform a variable block size sum of absolute difference calculation with a 2-D array of 16 bit integers in a C++ program as efficiently as possible. I am interested in a real time block matching code. I was wondering if there were any software libraries available to do this? The code is running on windows XP and I'm stuck using Visual Studio 2010 to do the compiling. The CPU is a 2-core AMD Athlon 64 x2 4850e.
By variable block size sum of absolute difference(SAD) calculation I mean the following.
I have one smaller 2-D array I will call the template_grid, and one larger 2-D array I will call the image. I want to find the region in the image that minimizes the sum of the absolute difference between the pixels in the template and the pixels in the region in the image.
The simplest way to calculate the SAD in C++ if would be the following:
for(int shiftY = 0; shiftY < rangeY; shiftY++) {
for(int shiftX = 0; shiftX < rangeX; shiftX++) {
for(int x = 0; x < lenTemplateX; x++) {
for(int y = 0; y < lenTemplateY; y++) {
SAD[shiftY][shiftX]=abs(template_grid[x][y] - image[y + shiftY][x + shiftX]);
}
}
}
}
The SAD calculation for specific array sizes has been optimized in the Intel performance primitives library. However, the arrays I'm working with don't fit the sizes in these libraries.
There are two search ranges I work with,
a large range: rangeY = 45, rangeX = 10
a small range: rangeY = 4, rangeX = 2
There is only one template size and it is:
lenTemplateY = 61, lenTemplateX = 7

Minor optimisation:
for(int shiftY = 0; shiftY < rangeY; shiftY++) {
for(int shiftX = 0; shiftX < rangeX; shiftX++) {
// if you can assume SAD is already filled with 0-es,
// you don't need the next line
SAD[shiftX][shiftY]=0;
for(int tx = 0, imx=shiftX; x < lenTemplateX; tx++,imx++) {
for(int ty = 0, imy=shiftY; y < lenTemplateY; ty++,imy++) {
// two increments of imx/imy may be cheaper than
// two addition with offsets
SAD[shiftY][shiftX]+=abs(template_grid[tx][ty] - image[imx][imy]);
}
}
}
}
Loop unrolling using C++ templates
May be a crazy idea for your configuration (C++ compiler worries me), but it may work. I offer no warranties, but give it a try.
The idea may work because your template_grid sizes and the ranges are constant - thus known at compilation time.Also, for this to work, your image and template_grid must be organised with the same layout (column first or row first) - the way your "sample code" is depicted in the question mixes the SAD x/y with template_grid y/x.
In the followings, I'll assume a "column first" organisation, so that SAD[ix] denotes the ixth column of your SAD** matrix. The code goes just the same for "row first", except the name of the variables won't match the meaning of your value arrays.
So, let's start:
template <
typename sad_type, typename val_type,
size_t template_len
> struct sad1D_simple {
void operator()(
const val_type* img, const val_type* templ,
sad_type& result
) {
// template specialization recursion, with one less element to add
sad1D_simple<sad_type, val_type, template_len-1> one_shorter;
// call it incrementing the img and template offsets
one_shorter(img+1, templ+1, result);
// the add the contribution of the first diff we skipped over above
result+=abs(*(img+template_len-1)-*(templ+template_len-1));
}
};
// at len of 0, the result is zero. We need it to stop the
template <
typename sad_type, typename val_type
>
struct sad1D_simple<sad_type, val_type, 0> {
void operator()(
const val_type* img, const val_type* templ,
sad_type& result
) {
result=0;
}
};
Why a functor struct - struct with operator? The C++ doesn't allow partial specialization of function templates.
What the sad1D_simple does: unrolls a for cycle that computes the SAD of two arrays in input without any offsetting, based on the fact that the length of your template_grid array is a constant known at compile time. It's in the same vein as "computing the factorial of compile time using C++ templates"
How this helps?
Example of use in the code below:
typedef ulong SAD_t;
typedef int16_t pixel_val_t;
const size_t lenTemplateX = 7; // number of cols in the template_grid
const size_t lenTemplateY = 61;
const size_t rangeX=10, rangeY=45;
pixel_val_t **image, **template_grid;
SAD_t** SAD;
// assume those are initialized somehow
for(size_t tgrid_col=0; tgrid_col<lenTemplateX; tgrid_col++) {
pixel_val_t* template_col=template_grid[tgrid_col];
// the X axis - horizontal - is the column axis, right?
for(size_t shiftX=0; shiftX < rangeX; shiftX++) {
pixel_val_t* img_col=image[shiftX];
for(size_t shiftY = 0; shiftY < rangeY; shiftY++) {
// the Y axis - vertical - is the "offset in a column"=row, isn't it?
pixel_val_t* img_col_offsetted=img_col+shiftY;
// this functor is made by recursive specialization
// there's no cycle inside it, it was unrolled into
// lenTemplateY individual subtractions, abs-es and additions
sad1D_simple<SAD_t, pixel_val_t, lenTemplateY> calc;
calc(img_col_offsetted, template_col, SAD[shiftX][shiftY]);
}
}
}
Mmmm... can we do better? No, it won't be the X-axis unrolling, we still want to stay in 1D area, but... well, maybe if we create a ranged sad1D and unroll one more loop on the same axis?It will work iff the rangeX is also constant.
template <
typename sad_type, typename val_type,
size_t range, size_t template_len
> struct sad1D_ranged {
void operator()(
const val_type* img, const val_type* templ,
// result is assumed to have at least `range` slots
sad_type* result
) {
// we'll compute here the first slot of the result
sad1D_simple<sad_type, val_type, template_len>
calculator_for_first_sad;
calculator_for_first_sad(img, templ, *(result));
// now, ask for a recursive specialization for
// the next (range-1) sad-s
sad1D_ranged<sad_type, val_type, range-1, template_len>
one_less_in_range;
// when calling, pass the shifted img and result
one_less_in_range(img+1, templ, result+1);
}
};
// for a range of 0, there's nothing to do, but we need it
// to stop the template specialization recursion
template <
typename sad_type, typename val_type,
size_t template_len
> struct sad1D_ranged<sad_type, val_type, 0, template_len> {
void operator()(
const val_type* img, const val_type* templ,
// result is assumed to have at least `range` slots
sad_type* result
) {
}
};
And here's how you use it:
for(size_t tgrid_col=0; tgrid_col<lenTemplateX; tgrid_col++) {
pixel_val_t* template_col=template_grid[tgrid_col];
for(size_t shiftX=0; shiftX < rangeX; shiftX++) {
pixel_val_t* img_col=image[shiftX];
SAD_t* sad_col=SAD[shiftX];
sad1D_ranged<SAD_t, pixel_val_t, rangeY, lenTemplateY> calc;
calc(img_col, template_col, sad_col);
}
}
Yes... but the question is: will this improve performance?
The heck if I know. For small number of loops within a cycle and for strong data locality (values close one to the other so that they are in the CPU caches), loop unrolling should improve the performance. For a larger number of loops, you may negatively interfere with the CPU branch prediction and other mumbo-jumbo-I-know-may-impact-performance-but-I-don't-know-how.
Feeling of guts: even if the same unrolling technique may work for the other two loops, using it may well result in a degradation of performance: we'll need to jump from one contiguous vector (an image column) to the other - the entire image may not fit into the CPU cache.
Note: if your template_grid data is constant as well (or you have a finite set of constant template grids), one may take one step further and create struct functors with dedicated masks. But I'm out of steam for today.

you could try with OpenCV template matching with the square difference parameter see the tutorial here. OpenCV is optimized with OpenCL but i don't know for this specific function. I think you should give it a try.

I'm not sure how much you are restricted to using SAD, or if you are generally interested in finding the region in the image that matches the template the best. In the last case, you can use a convolution instead of SAD. This can be solved in the Fourier domain in O(N log N), including the Fourier transform (FFT).
In short, you can use the FFT (for example using http://www.fftw.org/) to convert both the template and the image to the frequency domain, then multiply them, and convert back to the time domain.
Of course, this is all irrelevant if you are bound to using SAD.

Related

How can I optimize this function which handles large c++ vectors?

According to Visual Studio's performance analyzer, the following function is consuming what seems to me to be an abnormally large amount of processor power, seeing as all it does is add between 1 and 3 numbers from several vectors and store the result in one of those vectors.
//Relevant class members:
//vector<double> cache (~80,000);
//int inputSize;
//Notes:
//RealFFT::real is a typedef for POD double.
//RealFFT::RealSet is a wrapper class for a c-style array of RealFFT::real.
//This is because of the FFT library I'm using (FFTW).
//It's bracket operator is overloaded to return a const reference to the appropriate array element
vector<RealFFT::real> Convolver::store(vector<RealFFT::RealSet>& data)
{
int cr = inputSize; //'cache' read position
int cw = 0; //'cache' write position
int di = 0; //index within 'data' vector (ex. data[di])
int bi = 0; //index within 'data' element (ex. data[di][bi])
int blockSize = irBlockSize();
int dataSize = data.size();
int cacheSize = cache.size();
//Basically, this takes the existing values in 'cache', sums them with the
//values in 'data' at the appropriate positions, and stores them back in
//the cache at a new position.
while (cw < cacheSize)
{
int n = 0;
if (di < dataSize)
n = data[di][bi];
if (di > 0 && bi < inputSize)
n += data[di - 1][blockSize + bi];
if (++bi == blockSize)
{
di++;
bi = 0;
}
if (cr < cacheSize)
n += cache[cr++];
cache[cw++] = n;
}
//Take the first 'inputSize' number of values and return them to a new vector.
return Common::vecTake<RealFFT::real>(inputSize, cache, 0);
}
Granted, the vectors in question have sizes of around 80,000 items, but by comparison, a function which multiplies similar vectors of complex numbers (complex multiplication requires 4 real multiplications and 2 additions each) consumes about 1/3 the processor power.
Perhaps it has something to with the fact it has to jump around within the vectors rather then just accessing them linearly? I really have no idea though. Any thoughts on how this could be optimized?
Edit: I should mention I also tried writing the function to access each vector linearly, but this requires more total iterations and actually the performance was worse that way.
Turn on compiler optimization as appropriate. A guide for MSVC is here:
http://msdn.microsoft.com/en-us/library/k1ack8f1.aspx

Eigen MatrixXd push back in c++

Eigen is a well known matrix Library in c++. I am having trouble finding an in built function to simply push an item on to the end of a matrix. Currently I know that it can be done like this:
Eigen::MatrixXd matrix(10, 3);
long int count = 0;
long int topCount = 10;
for (int i = 0; i < listLength; ++i) {
matrix(count, 0) = list.x;
matrix(count, 1) = list.y;
matrix(count, 2) = list.z;
count++;
if (count == topCount) {
topCount *= 2;
matrix.conservativeResize(topCount, 3);
}
}
matrix.conservativeResize(count, 3);
And this will work (some of the syntax may be out). But its pretty convoluted for a simple thing to do. Is there already an in built function?
There is no such function for Eigen matrices. The reason for this is such a function would either be very slow or use excessive memory.
For a push_back function to not be prohibitively expensive it must increase the matrix's capacity by some factor when it runs out of space as you have done. However when dealing with matrices, memory usage is often a concern so having a matrix's capacity be larger than necessary could be problematic.
If it instead increased the size by rows() or cols() each time the operation would be O(n*m). Doing this to fill an entire matrix would be O(n*n*m*m) which for even moderately sized matrices would be quite slow.
Additionally, in linear algebra matrix and vector sizes are nearly always constant and known beforehand. Often when resizeing a matrix you don't care about the previous values in the matrix. This is why Eigen's resize function does not retain old values, unlike std::vector's resize.
The only case I can think of where you wouldn't know the matrix's size beforehand is when reading from a file. In this case I would either load the data first into a standard container such as std::vector using push_back and then copy it into an already sized matrix, or if memory is tight run through the file once to get the size and then a second time to copy the values.
There is no such function, however, you can build something like this yourself:
using Eigen::MatrixXd;
using Eigen::Vector3d;
template <typename DynamicEigenMatrix>
void push_back(DynamicEigenMatrix& m, Vector3d&& values, std::size_t row)
{
if(row >= m.rows()) {
m.conservativeResize(row + 1, Eigen::NoChange);
}
m.row(row) = values;
}
int main()
{
MatrixXd matrix(10, 3);
for (std::size_t i = 0; i < 10; ++i) {
push_back(matrix, Vector3d(1,2,3), i);
}
std::cout << matrix << "\n";
return 0;
}
If this needs to perform too many resizes though, it's going to be horrendously slow.

C++ performance: checking a block of memory for having specific values in specific cells

I'm doing a research on 2D Bin Packing algorithms. I've asked similar question regarding PHP's performance - it was too slow to pack - and now the code is converted to C++.
It's still pretty slow. What my program does is consequently allocating blocks of dynamic memory and populating them with a character 'o'
char* bin;
bin = new (nothrow) char[area];
if (bin == 0) {
cout << "Error: " << area << " bytes could not be allocated";
return false;
}
for (int i=0; i<area; i++) {
bin[i]='o';
}
(their size is between 1kb and 30kb for my datasets)
Then the program checks different combinations of 'x' characters inside of current memory block.
void place(char* bin, int* best, int width)
{
for (int i=best[0]; i<best[0]+best[1]; i++)
for (int j=best[2]; j<best[2]+best[3]; j++)
bin[i*width+j] = 'x';
}
One of the functions that checks the non-overlapping gets called millions of times during a runtime.
bool fits(char* bin, int* pos, int width)
{
for (int i=pos[0]; i<pos[0]+pos[1]; i++)
for (int j=pos[2]; j<pos[2]+pos[3]; j++)
if (bin[i*width+j] == 'x')
return false;
return true;
}
All other stuff takes only a percent of the runtime, so I need to make these two guys (fits and place) faster. Who's the culprit?
Since I only have two options 'x' and 'o', I could try to use just one bit instead of the whole byte the char takes. But I'm more concerned with the speed, you think it would make the things faster?
Thanks!
Update: I replaced int* pos with rect pos (the same for best), as MSalters suggested. At first I saw improvement, but I tested more with bigger datasets and it seems to be back to normal runtimes. I'll try other techniques suggested and will keep you posted.
Update: using memset and memchr sped up things about twice. Replacing 'x' and 'o' with '\1' and '\0' didn't show any improvement. __restrict wasn't helpful either. Overall, I'm satisfied with the performance of the program now since I also made some improvements to the algorithm itself. I'm yet to try using a bitmap and compiling with -02 (-03)... Thanks again everybody.
Best possibility would be to use an algorithm with better complexity.
But even your current algorithm could be sped up. Try using SSE instructions to test ~16 bytes at once, also you can make a single large allocation and split it yourself, this will be faster than using the library allocator (the library allocator has the advantage of letting you free blocks individually, but I don't think you need that feature).
[ Of course: profile it!]
Using a bit rather than a byte will not be faster in the first instance.
However, consider that with characters, you can cast blocks of 4 or 8 bytes to unsigned 32 bit or 64 bit integers (making sure you handle alignment), and compare that to the value for 'oooo' or 'oooooooo' in the block. That allows a very fast compare.
Now having gone down the integer approach, you can see that you could do that same with the bit approach and handle say 64 bits in a single compare. That should surely give a real speed up.
Bitmaps will increase the speed as well, since they involve touching less memory and thus will cause more memory references to come from the cache. Also, in place, you might want to copy the elements of best into local variables so that the compiler knows that your writes to bin will not change best. If your compiler supports some spelling of restrict, you might want to use that as well. You can also replace the inner loop in place with the memset library function, and the inner loop in fits with memchr; those may not be large performance improvements, though.
First of all, have you remembered to tell your compiler to optimize?
And turn off slow array index bounds checking and such?
That done, you will get substantial speed-up by representing your binary values as individual bits, since you can then set or clear say 32 or 64 bits at a time.
Also I would tend to assume that the dynamic allocations would give a fair bit of overhead, but apparently you have measured and found that it isn't so. If however the memory management actually contributes significantly to the time, then a solution depends a bit on the usage pattern. But possibly your code generates stack-like alloc/free behavior, in which case you can optimize the allocations down to almost nothing; just allocate a big chunk of memory at the start and then sub-allocate stack-like from that.
Considering your current code:
void place(char* bin, int* best, int width)
{
for (int i=best[0]; i<best[0]+best[1]; i++)
for (int j=best[2]; j<best[2]+best[3]; j++)
bin[i*width+j] = 'x';
}
Due to possible aliasing the compiler may not realize that e.g. best[0] will be constant during the loop.
So, tell it:
void place(char* bin, int const* best, int const width)
{
int const maxY = best[0] + best[1];
int const maxX = best[2] + best[3];
for( int y = best[0]; y < maxY; ++y )
{
for( int x = best[2]; x < maxX; ++x )
{
bin[y*width + x] = 'x';
}
}
}
Most probably your compiler will hoist the y*width computation out of the inner loop, but why not tell it do also that:
void place(char* bin, int* best, int const width)
{
int const maxY = best[0]+best[1];
int const maxX = best[2]+best[3];
for( int y = best[0]; y < maxY; ++y )
{
int const startOfRow = y*width;
for( int x = best[2]; x < maxX; ++x )
{
bin[startOfRow + x] = 'x';
}
}
}
This manual optimization (also applied to other routine) may or may not help, it depends on how smart your compiler is.
Next, if that doesn't help enough, consider replacing inner loop with std::fill (or memset), doing a whole row in one fell swoop.
And if that doesn't help or doesn't help enough, switch over to bit-level representation.
It is perhaps worth noting and trying out, that every PC has built-in hardware support for optimizing the bit-level operations, namely a graphics accelerator card (in old times called blitter chip). So, you might just use an image library and a black/white bitmap. But since your rectangles are small I'm not sure whether the setup overhead will outweight the speed of the actual operation – needs to be measured. ;-)
Cheers & hth.,
The biggest improvement I'd expect is from a non-trivial change:
// changed pos to class rect for cleaner syntax
bool fits(char* bin, rect pos, int width)
{
if (bin[pos.top()*width+pos.left()] == 'x')
return false;
if (bin[(pos.bottom()-1*width+pos.right()] == 'x')
return false;
if (bin[(pos.bottom()*width+pos.left()] == 'x')
return false;
if (bin[pos.top()*width+pos.right()] == 'x')
return false;
for (int i=pos.top(); i<=pos.bottom(); i++)
for (int j=pos.left(); j<=pos.right(); j++)
if (bin[i*width+j] == 'x')
return false;
return true;
}
Sure, you're testing bin[(pos.bottom()-1*width+pos.right()] twice. But the first time you do so is much earlier in the algorithm. You add boxes, which means that there is a strong correlation between adjacent bins. Therefore, by checking the corners first, you often return a lot earlier. You could even consider adding a 5th check in the middle.
Beyond the obligatory statement about using a profiler,
The advice above about replacing things with a bit map is a very good idea. If that does not appeal to you..
Consider replacing
for (int i=0; i<area; i++) {
bin[i]='o';
}
By
memset(bin, 'o', area);
Typically a memset will be faster, as it compiles into less machine code.
Also
void place(char* bin, int* best, int width)
{
for (int i=best[0]; i<best[0]+best[1]; i++)
for (int j=best[2]; j<best[2]+best[3]; j++)
bin[i*width+j] = 'x';
}
has a bit of room.for improvement
void place(char* bin, int* best, int width)
{
for (int i=best[0]; i<best[0]+best[1]; i++)
memset( (i * width) + best[2],
'x',
(best[2] + best[3]) - (((i * width)) + best[2]) + 1);
}
by eliminating one of the loops.
A last idea is to change your data representation.
Consider using the '\0' character as a replacement for your 'o' and '\1' as a replacement for your 'x' character. This is sort of like using a bit map.
This would enable you to test like this.
if (best[1])
{
// Is a 'x'
}
else
{
// Is a 'o'
}
Which might produce faster code. Again the profiler is your friend :)
This representation would also enable you to simply sum a set of character to determine how many 'x's and 'o's there are.
int sum = 0;
for (int i = 0; i < 12; i++)
{
sum += best[i];
}
cout << "There are " << sum << "'x's in the range" << endl;
Best of luck to you
Evil.
If you have 2 values for your basic type, I would first try to use bool. Then the compiler knows you have 2 values and might be able to optimize some things better.
Appart from that add const where possible (for example the parameter of fits( bool const*,...)).
I'd think about memory cache breaks. These functions run through sub-matrices inside a bigger matrix - I suppose many times much bigger on both width and height.
That means the small matrix lines are contiguous memory but between lines it might break memory cache pages.
Consider representing the big matrix cells in memory in an order that would keep sub-matrices elements close to each other as possible. That is instead of keeping a vector of contiguous full lines. First option comes to my mind, is to break your big matrix recursively to matrices of size [ 2^i, 2^i ] ordered { top-left, top-right, bottom-left, bottom-right }.
1)
i.e. if your matrix is size [X,Y], represented in an array of size X*Y, then element [x,y] is at position(x,y) in the array:
use instead of (y*X+x):
unsigned position( rx, ry )
{
unsigned x = rx;
unsigned y = rx;
unsigned part = 1;
unsigned pos = 0;
while( ( x != 0 ) && ( y != 0 ) ) {
unsigned const lowest_bit_x = ( x % 2 );
unsigned const lowest_bit_y = ( y % 2 );
pos += ( ((2*lowest_bit_y) + lowest_bit_x) * part );
x /= 2; //throw away lowest bit
y /= 2;
part *= 4; //size grows by sqare(2)
}
return pos;
}
I didn't check this code, just to explain what I mean.
If you need, also try to find a faster way to implement.
but note that the array you allocate will be bigger than X*Y, it has to be the smaller possible (2^(2*k)), and that would be wastefull unless X and Y are about same size scale. But it can be solved by further breaking the big matrix to sqaures first.
And then cache benfits might outwight the more complex position(x,y).
2) then try to find the best way to run through the elements of a sub-matrix in fits() and place(). Not sure yet what it is, not necessarily like you do now. Basically a sub-matrix of size [x,y] should break into no more than y*log(x)*log(y) blocks that are contiguous in the array representation, but they all fit inside no more than 4 blocks of size 4*x*y. So finally, for matrices that are smaller than a memory cache page, you'll get no more than 4 memory cache breaks, while your original code could break y times.

Determining template type when accessing OpenCV Mat elements

I'm using the following code to add some noise to an image (straight out of the OpenCV reference, page 449 -- explanation of cv::Mat::begin):
void
simulate_noise(Mat const &in, double stddev, Mat &out)
{
cv::Size s = in.size();
vector<double> noise = generate_noise(s.width*s.height, stddev);
typedef cv::Vec<unsigned char, 3> V4;
cv::MatConstIterator_<V4> in_itr = in.begin<V4>();
cv::MatConstIterator_<V4> in_end = in.end<V4>();
cv::MatIterator_<V4> out_itr = out.begin<V4>();
cv::MatIterator_<V4> out_end = out.end<V4>();
for (; in_itr != in_end && out_itr != out_end; ++in_itr, ++out_itr)
{
int noise_index = my_rand(noise.size());
for (int j = 0; j < 3; ++j)
(*out_itr)[j] = (*in_itr)[j] + noise[noise_index];
}
}
Nothing overly complicated:
in and out are allocated cv::Mat objects of the same dimensions and type
iterate over the input image in
at each position, pick a random value from noise (my_rand(int n) returns a random number in [0..n-1]
sum the pixel from in with the random noise value
put the summation result into out
I don't like this code because the following statement seems unavoidable:
typedef cv::Vec<unsigned char, 3> V4;
It has hard-coded two things:
The images have 3 channels
The channel depth is 8bpp
If I get this typedef wrong (e.g. wrong channel depth or wrong number of channels), then my program segfaults. I originally used typedef cv::Vec<unsigned char, 4> V4 to handle images with an arbitrary number of channels (the max OpenCV supports is 4), but this caused a segfault.
Is there any way I can avoid hard-coding the two things above? Ideally, I want something that's as generic as:
typedef cv::Vec<in.type(), in.size()> V4;
I know this comes late. However, the real solution to your problem is to use OpenCV functionality to do what you want to do.
create noise vector as you do already (or use the functions that OpenCV provides hint!)
shuffle noise vector so you don't need individual noise_index for each pixel; or create vector of randomised noise beforehand
build a matrix header around your shuffled/random vector: cv::Mat_<double>(noise);
use matrix operations for computation: out = in + noise; or cv::add(in, noise, out);
PROFIT!
Another advantage of this method is that OpenCV might employ multithreading, SSE or whatever to speed-up this massive-element operation, which you do not. Your code is simpler, cleaner, and OpenCV does all the nasty type handling for you.
The problem is that you need determine to determine type and number of channels at runtime, but templates need the information at compile time. You can avoid hardcoding the number of channels by either using cv::split and cv::merge, or by changing the iteration to
for(int row = 0; row < in.rows; ++row) {
unsigned char* inp = in.ptr<unsigned char>(row);
unsigned char* outp = out.ptr<unsigned char>(row);
for (int col = 0; col < in.cols; ++col) {
for (int c = 0; c < in.channels(); ++c) {
*outp++ = *inp++ + noise();
}
}
}
If you want to get rid of the dependance of the type, I'd suggest putting the above in a templated function and calling that from your function, depending on the type of the matrix.
They are hardcoded because performance is better that way.
In OpenCV1.x there is cvGet2D() , which can be used here since Mat can be casted as an IplImage.
But it's slow since each time you access a pixel the function will find out the type, size, etc. Specially inefficient in loops.

Why is this code so slow?

So I have this function used to calculate statistics (min/max/std/mean). Now the thing is this runs generally on a 10,000 by 15,000 matrix. The matrix is stored as a vector<vector<int> > inside the class. Now creating and populating said matrix goes very fast, but when it comes down to the statistics part it becomes so incredibly slow.
E.g. to read all the pixel values of the geotiff one pixel at a time takes around 30 seconds. (which involves a lot of complex math to properly georeference the pixel values to a corresponding point), to calculate the statistics of the entire matrix it takes around 6 minutes.
void CalculateStats()
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
for(size_t row = 0; row < vals.size(); row++)
{
for(size_t col = 0; col < vals.at(row).size(); col++)
{
double mean_prev = new_mean;
T value = get(row, col);
new_mean += (value - new_mean) / (cnt + 1);
new_standard_dev += (value - new_mean) * (value - mean_prev);
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
stats_standard_dev = sqrt(new_standard_dev / (vals.size() * vals.at(0).size()) + 1);
std::cout << stats_standard_dev << std::endl;
}
Am I doing something horrible here?
EDIT
To respond to the comments, T would be an int.
EDIT 2
I fixed my std algorithm, and here is the final product:
void CalculateStats(const std::vector<double>& ignore_values)
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
int n = 0;
double delta = 0.0;
double mean2 = 0.0;
std::vector<double>::const_iterator ignore_begin = ignore_values.begin();
std::vector<double>::const_iterator ignore_end = ignore_values.end();
for(std::vector<std::vector<T> >::const_iterator row = vals.begin(), row_end = vals.end(); row != row_end; ++row)
{
for(std::vector<T>::const_iterator col = row->begin(), col_end = row->end(); col != col_end; ++col)
{
// This method of calculation is based on Knuth's algorithm.
T value = *col;
if(std::find(ignore_begin, ignore_end, value) != ignore_end)
continue;
n++;
delta = value - new_mean;
new_mean = new_mean + (delta / n);
mean2 = mean2 + (delta * (value - new_mean));
// Find new max/min's.
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
}
}
stats_standard_dev = mean2 / (n - 1);
stats_min = new_min;
stats_max = new_max;
stats_mean = new_mean;
This still takes ~120-130 seconds to do this, but it's a huge improvement :)!
Have you tried to profile your code?
You don't even need a fancy profiler. Just stick some debug timing statements in there.
Anything I tell you would just be an educated guess (and probably wrong)
You could be getting lots of cache misses due to the way you're accessing the contents of the vector. You might want to cache some of the results to size() but I don't know if that's the issue.
I just profiled it. 90% of the execution time was in this line:
new_mean += (value - new_mean) / (cnt + 1);
You should calculate the sum of values, min, max and count in the first loop,
then calculate the mean in one operation by dividing sum/count,
then in a second loop calculate std_dev's sum
That would probably be a bit faster.
First thing I spotted is that you evaluate vals.at(row).size() in the loop, which, obviously, isn't supposed to improve performance. It also applies to vals.size(), but of course inner loop is worse. If vals is a vector of vector, you better use iterators or at least keep reference for the outer vector (because get() with indices parameters surely eats up quite some time as well).
This code sample is supposed to illustrate my intentions ;-)
for(TVO::const_iterator i=vals.begin(),ie=vals.end();i!=ie;++i) {
for(TVI::const_iterator ii=i->begin(),iie=i->end();ii!=iie;++ii) {
T value = *ii;
// the rest
}
}
First, change your row++ to ++row. A minor thing, but you want speed, so that will help
Second, make your row < vals.size into some const comparison instead. The compiler doesn't know that vals won't change, so it has to play nice and always call size.
what is the 'get' method in the middle there? What does that do? That might be your real problem.
I'm not too sure about your std dev calculation. Take a look at the wikipedia page on calculating variance in a single pass (they have a quick explanation of Knuth's algorithm, which is an expansion of a recursion relation).
It's slow because you're benchmarking debug code.
Building and running the code on Windows XP using VS2008:
a Release build with the default optimisation level, the code in the OP runs in 2734 ms.
a Debug build with the default of no optimisation, the code in the OP runs in a massive 398,531 ms.
In comments below you say you're not using optimisation, and this appears to make a big difference in this case - normally it's less that a factor of ten, but in this case it's over a hundred times slower.
I'm using VS2008 rather than 2005, but it's probably similar:
In the Debug build, there are two range checks on each access, each of which calls std::vector::size() using a non-inlined function call and requires a branch predicition. There is overhead involved both with function calls and with branches.
In the Release build, the compiler optimizes away the range checks ( I don't know whether it just drops them, or does flow analysis based on the limits of the loop ), and the vector access becomes a small amount of inline pointer arithmetic with no branches.
No-one cares how fast the debug build is. You should be unit testing the release build anyway, as that's the build which has to work correctly. Only use the Debug build if you don't all the information you want if you try and step through the code.
The code as posted runs in < 1.5 seconds on my PC with test data of 15000 x 10000 integers all equal to 42. You report that it's running in 230 times slower that that. Are you on a 10 MHz processor?
Though there are other suggestions for making it faster ( such as moving it to use SSE, if all the values are representable using 8bit types ), but there's clearly something else which is making it slow.
On my machine, neither a version which hoisted a reference to the vector for the row and hoisting the size of the row, nor a version which used iterator had any measurable benefit ( with g++ -O3 using iterators takes 1511ms repeatably; the hoisted and original version both take 1485ms ). Not optimising means it runs in 7487ms ( original ), 3496ms ( hoisted ) or 5331ms ( iterators ).
But unless you're running on a very low power device, or are paging, or a running non-optimised code with a debugger attached, it shouldn't be this slow, and whatever is making it slow is not likely to be the code you've posted.
( as a side note, if you test it with values with a deviation of zero your SD comes out as 1 )
There are far too many calculations in the inner loop:
For the descriptive statistics (mean, standard
deviation) the only thing required is to compute the sum
of value and the sum of squared value. From these
two sums the mean and standard deviation can be computed
after the outer loop (together with a third value, the
number of samples - n is your new/updated code). The
equations can be derived from the definitions or found
on the web, e.g. Wikipedia. For instance the mean is
just sum of value divided by n. For the n version (in
contrast to the n-1 version - however n is large in
this case so it doesn't matter which one is used) the
standard deviation is: sqrt( n * sumOfSquaredValue -
sumOfValue * sumOfValue). Thus only two floating point
additions and one multiplication are needed in the
inner loop. Overflow is not a problem with these sums as
the range for doubles is 10^318. In particular you will
get rid of the expensive floating point division that
the profiling reported in another answer has revealed.
A lesser problem is that the minimum and maximum are
rewritten every time (the compiler may or may not
prevent this). As the minimum quickly becomes small and
the maximum quickly becomes large, only the two comparisons
should happen for the majority of loop iterations: use
if statements instead to be sure. It can be argued, but
on the other hand it is trivial to do.
I would change how I access the data. Assuming you are using std::vector for your container you could do something like this:
vector<vector<T> >::const_iterator row;
vector<vector<T> >::const_iterator row_end = vals.end();
for(row = vals.begin(); row < row_end; ++row)
{
vector<T>::const_iterator value;
vector<T>::const_iterator value_end = row->end();
for(value = row->begin(); value < value_end; ++value)
{
double mean_prev = new_mean;
new_mean += (*value - new_mean) / (cnt + 1);
new_standard_dev += (*value - new_mean) * (*value - mean_prev);
// find new max/min's
new_min = min(*value, new_min);
new_max = max(*value, new_max);
cnt++;
}
}
The advantage of this is that in your inner loop you aren't consulting the outter vector, just the inner one.
If you container type is a list, this will be significantly faster. Because the look up time of get/operator[] is linear for a list and constant for a vector.
Edit, I moved the call to end() out of the loop.
Move the .size() calls to before each loop, and make sure you are compiling with optimizations turned on.
If your matrix is stored as a vector of vectors, then in the outer for loop you should directly retrieve the i-th vector, and then operate on that in the inner loop. Try that and see if it improves performance.
I'm nor sure of what type vals is but vals.at(row).size() could take a long time if itself iterates through the collection. Store that value in a variable. Otherwise it could make the algorithm more like O(n³) than O(n²)
I think that I would rewrite it to use const iterators instead of row and col indexes. I would set up a const const_iterator for row_end and col_end to compare against, just to make certain it wasn't making function calls at every loop end.
As people have mentioned, it might be get(). If it accesses neighbors, for instance, you will totally smash the cache which will greatly reduce the performance. You should profile, or just think about access patterns.
Coming a bit late to the party here, but a couple of points:
You're effectively doing numerical work here. I don't know much about numerical algorithms, but I know enough to know that references and expert support are often useful. This discussion thread offers some references; and Numerical Recipes is a standard (if dated) work.
If you have the opportunity to redesign your matrix, you want to try using a valarray and slices instead of vectors of vectors; one advantage that immediately comes to mind is that you're guaranteed a flat linear layout, which makes cache pre-fetching and SIMD instructions (if your compiler can use them) more effective.
In the inner loop, you shouldn't be testing size, you shouldn't be doing any divisions, and iterators can also be costly. In fact, some unrolling would be good in there.
And, of course, you should pay attention to cache locality.
If you get the loop overhead low enough, it might make sense to do it in separate passes: one to get the sum (which you divide to get the mean), one to get the sum of squares (which you combine with the sum to get the variance), and one to get the min and/or max. The reason is to simplify what is in the inner unrolled loop so the compiler can keep stuff in registers.
I couldn't get the code to compile, so I couldn't pinpoint issues for sure.
I have modified the algorithm to get rid of almost all of the floating-point division.
WARNING: UNTESTED CODE!!!
void CalculateStats()
{
//OHGOD
double accum_f;
double accum_sq_f;
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
const int oku = 100000000;
int accum_ichi = 0;
int accum_oku = 0;
int accum_sq_ichi = 0;
int accum_sq_oku = 0;
size_t cnt = 0;
int v1 = 0;
int v2 = 0;
v1 = vals.size();
for(size_t row = 0; row < v1; row++)
{
v2 = vals.at(row).size();
for(size_t col = 0; col < v2; col++)
{
T value = get(row, col);
int accum_ichi += value;
int accum_sq_ichi += (value * value);
// perform carries
accum_oku += (accum_ichi / oku);
accum_ichi %= oku;
accum_sq_oku += (accum_sq_ichi / oku);
accum_sq_ichi %= oku;
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
// now, and only now, do we use floating-point arithmetic
accum_f = (double)(oku) * (double)(accum_oku) + (double)(accum_ichi);
accum_sq_f = (double)(oku) * (double)(accum_sq_oku) + (double)(accum_sq_ichi);
new_mean = accum_f / (double)(cnt);
// standard deviation formula from Wikipedia
stats_standard_dev = sqrt((double)(cnt)*accum_sq_f - accum_f*accum_f)/(double)(cnt);
std::cout << stats_standard_dev << std::endl;
}