Count 'white' pixels in opencv binary image (efficiently) - c++

I am trying to count all the white pixels in an OpenCV binary image. My current code is as follows:
whitePixels = 0;
for (int i = 0; i < height; ++i)
for (int j = 0; j < width; ++j)
if (binary.at<int>(i, j) != 0)
++whitePixels;
However, after profiling with gprof I've found that this is a very slow piece of code, and a large bottleneck in the program.
Is there a method which can compute the same value faster?

cv::CountNonZero. Usually the OpenCV implementation of a task is heavily optimized.

You can use paralell computing. You divide the image in N parts and run your code in differents threads then you get the result of each threads and after this you can add this results for obtain the finally amount.

The last pixel in a row is usually followed by the first pixel in the next row (C code):
limit=width*height;
i=0;
while (i<limit)
{
if (binary.at<int>(0,i) != 0) ++whitePixels;
++i;
}

Actually binary.at<int>(i, j) is slow access!
Here is simple code that access faster than yours.
for (int i = 0; i < height; ++i)
{
uchar * pixel = image.ptr<uchar>(i);
for (int j = 0; j < width; ++j)
{
if(pixel[j]!=0)
{
//do your job
}
}
}

Related

How to calculate Matrix efficiently in C++?

I am new to C++ and programming so I think I am making inefficient codes.
I was wondering whether there is any way I can speed up the matrix calculation process.
For example, this is the sample code I write which finds the maximum differences(in absolute value) between 3d array 'V' and 'Vnew'.
First, I take subtraction.
And then, I put the value of tempdiff[0][0][0] to 'dif'
Then, I compare 'dif' and tempdiff[i][j][k] and replace if the latter is larger than the former.
This is just a part of my code and there are lots of matrix calculations inside so that I have too many 'for' statements.
So I was wondering whether there is any way I could avoid using 'for' in the matrix calculations.
Thanks in advance.
for (int i = 0; i < Na; i++) {
for (int j = 0; j < Nd; j++) {
for (int k = 0; k < Ny; k++) {
tempdiff[i][j][k] = abs(V[i][j][k] - Vnew[i][j][k]);
}
}
}
dif = tempdiff[0][0][0];
for (int i = 0; i < Na; i++) {
for (int j = 0; j < Nd; j++) {
for (int k = 0; k < Ny; k++) {
if (tempdiff[i][j][k] > dif) {
dif = tempdiff[i][j][k];
}
else {
dif = dif;
}
}
}
}
There's not much you can do with the for loops, as the maximum difference can locate at all possible places. You have already succeeded in iterating the array in the correct, linear, order.
Compilers are generally quite efficient in optimising, but they apparently fail to flatten a contiguous array, such as float V[Na][Nd][Ny];. After you flatten it manually to float V[Na*Nd*Ny], at least clang can auto-vectorise and produce SIMD code for x64 and arm.
A further optimisation is to avoid making this in two steps, as the total memory throughput is exactly doubled with the temporary array compared to a one-pass solution.
I was assuming your matrices are of type float -- if you can select int, gcc can auto-vectorise this as well (relates to NaN handling); furthermore int16_t or int8_t types are even quicker to evaluate, as more operations can be packed to a single SIMD instruction.

Implementation of sequential LU decomposition in C++

I am trying to follow the Guassian Elimination algorithm in https://courses.engr.illinois.edu/cs554/fa2015/notes/06_lu_8up.pdf in order to implement LU factorization and eventually parallelize it with openmp. Does the following algorithm look correct, where l is the multiplier and m is the matrix?
void decompose2(double **m) {
begin =clock();
int i=0, j=0, k=0;
for(k = 1; k < size - 1; k++)
{
for(i = k + 1; i < size; i++)
{
l[i][k] = m[i][k]/m[k][k];
}
for(j = k + 1; j < size; j++)
{
for(i = k + 1; k < size; k++)
{
m[i][j] = m[i][j] - (l[i][k]*m[k][j]);
}
}
}
end = clock();
}
I don't think it is correct because according to a different paper the times I am getting after parallelization on the same number of processors are completely different.
"Does the following algorithm look correct, …" -- No, because
arrays are 0-index in C++,
double[size][size] (which you are likely using) is not convertible to double**,
int is not a good type for iterators (use size_t instead),
you don't check if m[k][k] might be (close to) zero, when you might have to swap rows.
Please notice that I only looked at the obvious implementation errors, not at possible instances to make the code better, e.g. increasing the stability of the calculation.

Tetris shifting 2D array

I'm currently writing a tetris in C++ and I am at the last stage, I need to delete a row when it is full. Once a piece falls it is stored in a boolean array grid[20][10]. For example I check which row is full (or true), if it is I call a method deleteRow, where n is a number of row:
void Grid::deleteRow(int n)
{
for (j = 0; j < WIDTH; j++)
{
grid[n][j] = false;
}
}
Once the row is deleted I call a method moveRowDown:
void Grid::moveRowDown()
{
for (i = 0; i < HEIGHT; i++)
{
for (j = 0; j < WIDTH; j++)
{
grid[i+1][j]=grid[i][j];
}
}
}
So this method does not work, and all of the pieces disappear. I know I am missing the logic. Thanks for the help in advance!
They disappear because you copy 1st empty row to 2nd, then to 3rd and etc.
You need to rewrite your first loop in Grid::moveRowDown() to work from the bottom of a glass:
for (i = HEIGHT-2; i>=0; i--)

Can race conditions lower the code's performance?

I'm running the following code for matrix multiplication the performance of which I'm supposed to measure:
for (int j = 0; j < COLUMNS; j++)
#pragma omp for schedule(dynamic, 10)
for (int k = 0; k < COLUMNS; k++)
for (int i = 0; i < ROWS; i++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
Yes, I know it's really slow, but that's not the point - it's purely for performance measuring purposes. I'm running 3 versions of the code depending on where I put the #pragma omp directive, and therefore depending on where the parallelization happens. The code is run in Microsoft Visual Studio 2012 in release mode and profiled in CodeXL.
One thing I've noticed from the measurements is that the option in the code snippet (with parallelization before the k loop) is the slowest, then the version with the directive before the j loop, then the one with it before the i loop. The presented version is also the one which calculates a wrong result because of race conditions - multiple threads accessing the same cell of the result matrix at the same time. I understand why the i loop version is the fastest - all the particular threads process only part of the range of the i variable, increasing the temporal locality. However, I don't understand what causes the k loop version to be the slowest - does it have something to do with the fact that it produces the wrong result?
Of course race conditions can slow the code down. When two or more threads access the same part of memory (same cache line), that part must be loaded into the cache of the given cores over and over again as the the other thread invalidates the content of the cache by writing into it. They compete for a shared resource.
When two variables located too close in memory are written and read by more threads, it also results in a slowdown. This is known as false sharing. In your case it is even worse, they are not just too close, they even coincide.
Your assumption is correct. But if we are talking about performance, and not just validating your assumption, there is more to the story.
The order of your indexes is a big issue, multi-threaded or not. Given than the distance between mat[x][y] and mat[x][y+1] is one, while the distance between mat[x][y] and mat[x+1][y] is dim(mat[x]) You want x to be the outer index and y the inner to have the minimal distance between iteration. Given __[i][j] += __[i][k] * __[k][j]; you see that the proper order for spacial locality is i -> k -> j.
Whatever the order, there is one value which can be saved for later. Given your snippet
for (int j = 0; j < COLUMNS; j++)
for (int k = 0; k < COLUMNS; k++)
for (int i = 0; i < ROWS; i++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
matrix_b[k][j] value will be fetched from memory i times. You could have started from
for (int j = 0; j < COLUMNS; j++)
for (int k = 0; k < COLUMNS; k++)
int temp = matrix_b[k][j];
for (int i = 0; i < ROWS; i++)
matrix_r[i][j] += matrix_a[i][k] * temp;
But given that you are writing to matrix_r[i][j], the best access to optimize is matrix_r[i][j], given that writing is slower than reading
Unnecessary write accesses to memory
for (int i = 0; i < ROWS; i++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
will write to the memory of matrix_r[i][j] ROWS times. Using a temporary variable would reduce the accesses to one.
for (int i = 0; i < ...; j++)
for (int j = 0; j < ...; k++)
int temp = 0;
for (int k = 0; k < ...; i++)
temp += matrix_a[i][k] * matrix_b[k][j];
matrix_r[i][j] = temp;
This decreases write accesses from n^3 to n^2.
Now you are using threads. To maximize the efficiency of multithreading you should isolate as much a thread memory access from the others. One way to do it would be to give each thread a column, and prefect that column once. One simple way would be to have the transpose of matrix_b such that
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j]; becomes
matrix_r[i][j] += matrix_a[i][k] * matrix_b_trans[j][k];
such that the most inner loop on k always deal with contiguous memory respective to matrix_a and matrix_b_trans
for (int i = 0; i < ROWS; j++)
for (int j = 0; j < COLS; k++)
int temp = 0;
for (int k = 0; k < SAMEDIM; i++)
temp += matrix_a[i][k] * matrix_b_trans[j][k];
matrix_r[i][j] = temp;

OpenMP even/odd decomposition of a nested loop

I have part in my code that could be done parallel, so I started to read about openMP and did these introduction examples. Now I am trying to apply it to the following problem, schematically presented here:
Grid.h
class Grid
{
public:
// has a grid member variable
std::vector<std::vector<int>> 2Dgrid;
// modifies the components of the 2Dgrid, no push_back() etc. used what could possibly disturbe the use of openMP
update_grid(int,int,int,in);
};
Test.h
class Test
{
public:
Grid grid1;
Grid grid2;
update();
repeat_update();
};
Test.cc
.
.
.
Test::repeat_update() {
for(int i=0;i<100000;i++)
update();
}
Test::update() {
int colIndex = 0;
int rowIndex = 0;
int rowIndexPlusOne = rowIndex + 1;
int colIndexPlusOne = colIndex + 1;
// DIRECTION_X (grid[0].size()), DIRECTION_Y (grid.size) are the size of the grid
for (int i = 0; i < DIRECTION_Y; i++) {
// periodic boundry conditions
if (rowIndexPlusOne > DIRECTION_Y - 1)
rowIndexPlusOne = 0;
// The following could be done parallel!!!
for (int j = 0; j < DIRECTION_X - 1; j++) {
grid1.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
grid2.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
colIndexPlusOne++;
colIndex++;
}
colIndex = 0;
colIndexPlusOne = 1;
rowIndex++;
rowIndexPlusOne++;
}
}
.
.
.
The thing is, the updates done in Test::update(...) could be done in a parallel manner, since the Grid::update(...) only depends on the nearest neighbour of the grid. So for example in the inner loop multiple threads could do the work for colIndex = 0,2,4,..., independetly, that would be the even decomposition. After That the odd indices colIndex=1,3,5,... could be updated. Then the outerloop iterates one forward and the updates in direction x could again be done parallel. I have 16 cores at disposel and doing the parallelization could be a nice time save. But I totally dont have the perspective to see how this could be done, mainly because I dont know how to keep track of the colIndex, rowIndex, etc, since #pragma omp parallel for is applied to the i,j indices. I Would be grateful if somebody can show me the path out of the darkness.
Without knowing exactly what update_grid(int,int,int,int) does, it's kinda tricky to give a definitive answer. You show an embedded pair of loops of the type
for(int i = 0; i < Y; i++)
{
for(int j = 0; j < X; j++)
{
//...
}
}
and assert that the j loop can be done in parallel. This would be an example of fine grained parallelism. You could alternatively parallelize the i loop, in what would be a more coarse grained parallelization. If the amount of work of each individual thread is roughly equal, the coarse graining method has the advantage of less overhead (assuming that the parallelization of the two loops is equivalent).
There are a few things that you have to be careful of when parallelizing the loops. For starters, you increment colIndexPlusOne and colIndex in the inner loop. If you have multiple threads and a single variable for colIndexPlusOne and colIndex, then each thread will increment the variable and/or have race conditions. You can bypass that in several manners, either giving each thread a copy of the variable, or making the increment atomic or critical, or by removing the dependency of the variable altogether and calculating what it should be for each step of the loop on the fly.
I would start with parallelizing the entire update function as such:
Test::update()
{
#pragma omp parallel
{
int colIndex = 0;
int colIndexPlusOne = colIndex + 1;
// DIRECTION_X (grid[0].size()), DIRECTION_Y (grid.size) are the size of the grid
#pragma omp for
for (int i = 0; i < DIRECTION_Y; i++)
{
int rowIndex = i;
int rowIndexPlusOne = rowIndex + 1;
// periodic boundary conditions
if (rowIndexPlusOne > DIRECTION_Y - 1)
rowIndexPlusOne = 0;
// The following could be done parallel!!!
for (int j = 0; j < DIRECTION_X - 1; j++)
{
grid1.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
grid2.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
// The following two can be replaced by j and j+1...
colIndexPlusOne++;
colIndex++;
}
colIndex = 0;
colIndexPlusOne = 1;
// No longer needed:
// rowIndex++;
// rowIndexPlusOne++;
}
}
}
By placing #pragma omp parallel at the beginning, all the variables are local to each thread. Also, at the beginning of the i loop, I assigned rowIndex = i, as at least in the code shown, that is the case. The same could be done for the j loop and colIndex.