UBLAS Matrix Finding Surrounding Values of a Cell? - c++

I am looking for an elegant way to implement this. Basically i have a m x n matrix. Where each cell represents the pixel value, and the rows and columns represent the pixel rows and pixel columns of the image.
Since i basically mapped points from a HDF file, along with their corresponding pixel values. We basically have alot of empty pixels. Which are filled with 0.
Now what i need to do is take the average of the surrounding cell's, to average out of a pixel value for the missing cell.
Now i can brute force this but it becomes ugly fast. Is there any sort of elegant solution for this?

There's a well-known optimization to this filtering problem.
Integrate the cells in one direction (say horizontally)
Integrate the cells in the other direction (say vertically)
Take the difference between each cell and it's N'th neighbor to the left.
Take the difference between each cell and it's N'th lower neighbor
Like this:
for (i = 0; i < h; ++i)
for (j = 0; j < w-1; ++j)
A[i][j+1] += A[i][j];
for (i = 0; i < h-1; ++i)
for (j = 0; j < w; ++j)
A[i+1][j] += A[i][j]
for (i = 0; i < h; ++i)
for (j = 0; j < w-N; ++j)
A[i][j] -= A[i][j+N];
for (i = 0; i < h-N; ++i)
for (j = 0; j < w; ++j)
A[i][j] -= A[i-N][j];
What this does is:
The first pass makes each cell the sum of all of the cells on that row to it's left, including itself.
After the 2nd pass , each cell is the sum of all of the cells in a rectangle above and left of itselt (including it's own row and column)
After the 3rd pass, each cell is the sum of a rectangle above and to the right of itself, N columns wide.
After the 4th pass each cell is the sum of an NxN rectangle below and to the right of itself.
This takes 4 operations per cell to compute the sum, as opposed to 8 for brute force (assuming you're doing a 3x3 averaging filter).
The cool thing is that if you use ordinary two's-complement arithmetic, you don't have to worry about any overflows in the first two passes; they cancel out in the last two passes.

The main issues here are utilizing all available cores and cache effeciency.
You might be interested in checking fast implementation of convolution.
However, since you do it with Boost, you can check how this is done in this Boost example
I beleive you have to change only the convolution kernel for your specialized task.

Related

Euclidean distance between each record with other records in an array

so i have an array [nm] and i need to code in c++ the Euclidean distance between each row and the other rows in the array and store it in a new distance-array [nn] which every cell's value is the distance between the intersected rows.
distance-array:
r0 r1 .....rn
r0 0
r1 0
. 0
. 0
rn 0
the Euclidean distance between tow rows or tow records is:
assume we have these tow records:
r0: 1 8 7
r1: 2 5 3
r2
.
.
rn
Euclidean distance between r0 and r1 = sqrt((1-2)^2+(8-5)^2+(7-3)^2)
to code this i used 4 loops(which i think is too much) but i couldn't do it right, can someone help me to code this without using 3-D array ??
this is my code:
int norarr1[row][column] = { 1,1,1,2,2,2,3,3,3 };
int i = 0; int j = 0; int k = 0; int l = 0;
for (i = 0; i < column; i++){
for(j = 0; j < column; j++){
sumd = 0;
for (k = 0; k < row; k++) {
for (l = 0; l < row; l++) {
dist = sqrt((norarr1[i][k] - norarr1[j][l]) ^ 2);
sumd = sumd + dist;
cout << "sumd =" << sumd << " ";
}
cout << endl;
}
disarr[j][i] = sumd;
disarr[i][j] = sumd;
cout << disarr[i][j];
}
cout << endl;
}
There are several problems with your code. For now, let's ignore the for loops. We'll get to that later.
The first thing is that ^ is the bitwise exclusive or (XOR) operator. It does not do exponentiation like in some other languages. Instead, you need to use std::pow().
Second, you are summing square roots, which is not the correct way to calculate Euclidean distance. Instead, you need to calculate a sum and then take the square root.
Now let's think about the for loops. Assume that you already know which two rows you want to calculate the distance between. Call these r1 and r2. Now you just need to pair one coordinate from r1 with one coordinate from r2. Note that these coordinates will always be in the same column. This means that you only need one loop to calculate the squares of the differences of each pair of coordinates. Then you sum these squares. Finally after this single loop you take the square root.
With that out of the way, we need to iterate over the rows to choose each r1 and r2. Okay, this will take two loops since we want each of these to take on the value of each row.
In total, we will need three for loops. You can make this easier to understand by designing your code well. For example, you can create a class or struct that holds each row. If you know that every row is only three dimensions, then create a point or vector3 class. Now you can write a function which calculates the distance between two points. Finally, store the list of points as a 1D array. In fact, breaking up the data and calculation in this way makes the previous discussion about calculating the distance even easier to understand.

How can I most efficiently map a kernel range for a hermitian (symmetric) matrix in OpenCL?

I'm working on an OpenCL project to generate very large hermitian (symmetric) matrices, and I am trying to determine the best way to generate the work IDs.
A hermitian matrix is symmetric along the diagonal, so that M(i,j) = M*(j,i).
In the brute force way, the for loop looks like:
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
{
complex<float> result = doSomeCalculation();
M(i,j) = result;
}
}
However, taking advantage of the hermitian property, the loop can be made to be twice as efficient by only calculating the upper triangular part of the matrix and duplicating the result in the lower triangular part:
for(int i = 0; i < N; i++)
{
for(int j = i; j < N; j++)
{
complex<float> result = doSomeCalculation();
M(i,j) = result;
M(j,i) = conj(result);
}
}
In both loops, doSomeCalculation() is an expensive operation, and each entry in the matrix is completely uncorrelated from every other entry (i.e. the problem is stupidly parallel).
My question is this:
How can I implement the second loop with doSomeCalculation as an OpenCL kernel so that the thread IDs are most efficiently used (i.e. so that the thread calculates both M(i,j) and M(j,i) without having to call doSomeCalculation() twice)?
You need to use a linear index, for example you can index every element of your matrix in this way:
0 1 2 ... N-1
* N-2 ... 2N-2
....
* * 2N-1 ... N(N+1)/2 -1
That is, the index K is given by:
k=iN-i*(i+1)/2+j
Where N is the size of the matrix and (i,j) are respectively the 0-based indices of the row and the column.
This relationship can be inverted; see the answer of this question, which I report here for completeness:
i = floor( ( 2*N+1 - sqrt( (2N+1)*(2N+1) - 8*k ) ) / 2 ) ;
j = k - N*i + i*(i+1)/2 ;
So you need to enqueue a 1D kernel with N(N+1)/2 work items, and you can decide by yourself the size of the workgroup (usually 64 items per work group is a good choice).
Then in the OpenCL code you can retrieve the index K by using:
int k = get_group_id(0)*64 + get_local_id(0);
And then use the two relationships above the index of the matrix element you need to compute.
Moreover, notice that you can also save space by representing your hermitian matrix as a linear vector with N(N+1)/2 elements.
If your matrices are really big, than you can dice up your NxN matrix into (N/k)x(N/k) tiles, each of size kxk. As soon as you need only a half of the data, you create 1D NDRange of size local_group_size * (N/k)x(N/k)/2 roughly.
Every tile of matrix is processed by one LocalGroup (size of LocalGroup is of your choice). The idea is that you create an array on Host side, which contain position of every WorkGroup in matrix. Kernel stub should look like follows:
void __kernel myKernel(
__global int* coords,
....)
{
int2 WorkGroupPositionInMatrix = vload2(get_group_id(0), coords);
...
DoCalculation();
...
WriteResultTwice();
...
return;
}
What you need to do by hand - is to cope with thouse WorkGroups, which will be placed on the matrix diagonal. If matrix size is big, than overhead for LocalGroups, placed on diagonal is negligible.
A right triangle can be cut in half vertically and the smaller portion rotated to fit with the larger portion to form a rectangle of equal area. Therefore it is easy to make your triangular global work area into one that is rectangular, which fits OpenCL.
See my answer here: OpenCL efficient way to group a lower triangular matrix

Iterate through 2D Array block by block in C++

I'm working on a homework assignment for an image shrinking program in C++. My picture is represented by a 2D array of pixels; each pixel is an object with members "red", "green" and "blue." To solve the problem I am trying to access the 2D array one block at a time and then call a function which finds the average RGB value of each block and adds a new pixel to a smaller image array. The size of each block (or scale factor) is input by the user.
As an example, imagine a 100-item 2D array like myArray[10][10]. If the user input a shrink factor of 3, I would need to break out mini 2D arrays of size 3 by 3. I do not have to account for overflow, so in this example I can ignore the last row and the last column.
I have most of the program written, including the function to find the average color. I am confused about how to traverse the 2D array. I know how to cycle through a 2D array sequentially (one row at a time), but I'm not sure how to get little squares within an array.
Any help would be greatly appreciated!
Something like this should work:
for(size_t bx = 0; bx < width; bx += block_width)
for(size_t by = 0; by < height; by += block_height) {
float sum = 0;
for(size_t x = 0; x < block_width; ++x)
for(size_t y = 0; y < block_height; ++y) {
sum += array[bx + x][by + y];
}
average = sum / (block_width * block_height);
new_array[bx][by] = average;
}
width is the whole width, block_width is the length of your blue squares on diagram
This is how you traverse an array in C++:
for(i=0; i < m; i++) {
for(j=0; j < n; j++) {
// do something with myArray[i][j] where i represents the row and j the column
}
}
I'll leave figuring out how to cylcle through the array in different ways as an exercise to the reader.
you could use two nested loops one for x and one for y and move the start point of those loops across the image. As this is homework I wont put any code up but you should be able to work it out.

Histogram approximation for streaming data

This question is a slight extension of the one answered here. I am working on re-implementing a version of the histogram approximation found in Section 2.1 of this paper, and I would like to get all my ducks in a row before beginning this process again. Last time, I used boost::multi_index, but performance wasn't the greatest, and I would like to avoid the logarithmic in number of buckets insert/find complexity of a std::set. Because of the number of histograms I'm using (one per feature per class per leaf node of a random tree in a random forest), the computational complexity must be as close to constant as possible.
A standard technique used to implement a histogram involves mapping the input real value to a bin number. To accomplish this, one method is to:
initialize a standard C array of size N, where N = number of bins; and
multiply the input value (real number) by some factor and floor the result to get its index in the C array.
This works well for histograms with uniform bin size, and is quite efficient. However, Section 2.1 of the above-linked paper provides a histogram algorithm without uniform bin sizes.
Another issue is that simply multiplying the input real value by a factor and using the resulting product as an index fails with negative numbers. To resolve this, I considered identifying a '0' bin somewhere in the array. This bin would be centered at 0.0; the bins above/below it could be calculated using the same multiply-and-floor method just explained, with the slight modification that the floored product be added to two or subtracted from two as necessary.
This then raises the question of merges: the algorithm in the paper merges the two closest bins, as measured from center to center. In practice, this creates a 'jagged' histogram approximation, because some bins would have extremely large counts and others would not. Of course, this is due to non-uniform-sized bins, and doesn't result in any loss of precision. A loss of precision does, however, occur if we try to normalize the non-uniform-sized bins to make the uniform. This is because of the assumption that m/2 samples fall to the left and right of the bin center, where m = bin count. We could model each bin as a gaussian, but this will still result in a loss of precision (albeit minimal)
So that's where I'm stuck right now, leading to this major question: What's the best way to implement a histogram accepting streaming data and storing each sample in bins of uniform size?
Keep four variables.
int N; // assume for simplicity that N is even
int count[N];
double lower_bound;
double bin_size;
When a new sample x arrives, compute double i = floor(x - lower_bound) / bin_size. If i >= 0 && i < N, then increment count[i]. If i >= N, then repeatedly double bin_size until x - lower_bound < N * bin_size. On every doubling, adjust the counts (optimize this by exploiting sparsity for multiple doublings).
for (int j = 0; j < N / 2; j++) count[j] = count[2 * j] + count[2 * j + 1];
for (int j = N / 2; j < N; j++) count[j] = 0;
The case i < 0 is trickier, since we need to decrease lower_bound as well as increase bin_size (again, optimize for sparsity or adjust the counts in one step).
while (lower_bound > x) {
lower_bound -= N * bin_size;
bin_size += bin_size;
for (int j = N - 1; j > N / 2 - 1; j--) count[j] = count[2 * j - N] + count[2 * j - N + 1];
for (int j = 0; j < N / 2; j++) count[j] = 0;
}
The exceptional cases are expensive but happen only a logarithmic number of times in the range of your data over the initial bin size.
If you implement this in floating-point, be mindful that floating-point numbers are not real numbers and that statements like lower_bound -= N * bin_size may misbehave (in this case, if N * bin_size is much smaller than lower_bound). I recommend that bin_size be a power of the radix (usually two) at all times.

How do you multiply a matrix by itself?

This is what i have so far but I do not think it is right.
for (int i = 0 ; i < 5; i++)
{
for (int j = 0; j < 5; j++)
{
matrix[i][j] += matrix[i][j] * matrix[i][j];
}
}
Suggestion: if it's not a homework don't write your own linear algebra routines, use any of the many peer reviewed libraries that are out there.
Now, about your code, if you want to do a term by term product, then you're doing it wrong, what you're doing is assigning to each value it's square plus the original value (n*n+n or (1+n)*n, whatever you like best)
But if you want to do an authentic matrix multiplication in the algebraic sense, remember that you had to do the scalar product of the first matrix rows by the second matrix columns (or the other way, I'm not very sure now)... something like:
for i in rows:
for j in cols:
result(i,j)=m(i,:)·m(:,j)
and the scalar product "·"
v·w = sum(v(i)*w(i)) for all i in the range of the indices.
Of course, with this method you cannot do the product in place, because you'll need the values that you're overwriting in the next steps.
Also, explaining a little bit further Tyler McHenry's comment, as a consecuence of having to multiply rows by columns, the "inner dimensions" (I'm not sure if that's the correct terminology) of the matrices must match (if A is m x n, B is n x o and A*C is m x o), so in your case, a matrix can be squared only if it's square (he he he).
And if you just want to play a little bit with matrices, then you can try Octave, for example; squaring a matrix is as easy as M*M or M**2.
I don't think you can multiply a matrix by itself in-place.
for (i = 0; i < 5; i++) {
for (j = 0; j < 5; j++) {
product[i][j] = 0;
for (k = 0; k < 5; k++) {
product[i][j] += matrix[i][k] * matrix[k][j];
}
}
}
Even if you use a less naïve matrix multiplication (i.e. something other than this O(n3) algorithm), you still need extra storage.
That's not any matrix multiplication definition I've ever seen. The standard definition is
for (i = 1 to m)
for (j = 1 to n)
result(i, j) = 0
for (k = 1 to s)
result(i, j) += a(i, k) * b(k, j)
to give the algorithm in a sort of pseudocode. In this case, a is a m x s matrix and b is an s x n, the result is a m x n, and subscripts begin with 1..
Note that multiplying a matrix in place is going to get the wrong answer, since you're going to be overwriting values before using them.
It's been too long since I've done matrix math (and I only did a little bit of it, on top), but the += operator takes the value of matrix[i][j] and adds to it the value of matrix[i][j] * matrix[i][j], which I don't think is what you want to do.
Well it looks like what it's doing is squaring the row/column, then adding it to the row/column. Is that what you want it to do? If not, then change it.