The opencv documentation says GpuMat works only for two-dimensional matrix, however I abstracts an matrix of three dimensions on a two, and I use a loop to multiply these two matrices using gpu::gemm
for (int k = 0; k <C, k ++) {
gpu::gemm (matrix_a.colRange (Range (k * M (k + 1) * M)),
matrix_b.rowRange (Range (k * M (k + 1) * M)),
N [k + 1], matrix_c, 1, matrix_c);
}
Where M is used to shift for the other matrix.
matrix_a matrix_b and matrix_c are gpu::GpuMat
As you can see, the loop is serial, what I would like to do is to parallelize the whole operation, has someone any suggestion?
Related
I am attempting to calculate a pointcloud from an opencv Mat depth image and an intrinsic matrix. Currently I do it as follows (k matrix values were extracted into fx,fy,cx,cy earlier):
for(int i=0; i<depth.rows; i++)
{
const float* row_ptr = depth.ptr<float>(i);
for(int j=0; j<depth.cols; j++)
{
// Only add valid depth points
if(row_ptr[j] != 0)
{
const float x = ((j - cx) * row_ptr[j]/focal_length_x);
const float y = ((i - cy) * row_ptr[j]/focal_length_y);
pointcloud[cnt] = pcl::PointXYZ(x/1000, y/1000, row_ptr[j]/1000);
cnt++;
}
}
}
However I am wondering is it possible to turn this into a matmul operation and use eigen for better performance. I am aware that:
[x, y, z] = depth value * inv(k) * [u, v, 1] with
[fx, 0, cx]
k = [0, fy, cy]
[0, 0, 1]
How would I go about turning this into a full matrix multiplication, my depth image is 1280x800, obviously directly multiplying a 1280x800 matrix with a 3x3 with a 3x1 wont work so what ways can this be done if any?
In my program I have a function that performs the fast Fourier transform. I know there are very good implementations freely available, but this is a learning thing so I don't want to use those. I ended up finding this comment with the following implementation (it originated from the Italian entry for the FFT):
void transform(complex<double>* f, int N) //
{
ordina(f, N); //first: reverse order
complex<double> *W;
W = (complex<double> *)malloc(N / 2 * sizeof(complex<double>));
W[1] = polar(1., -2. * M_PI / N);
W[0] = 1;
for(int i = 2; i < N / 2; i++)
W[i] = pow(W[1], i);
int n = 1;
int a = N / 2;
for(int j = 0; j < log2(N); j++) {
for(int k = 0; k < N; k++) {
if(!(k & n)) {
complex<double> temp = f[k];
complex<double> Temp = W[(k * a) % (n * a)] * f[k + n];
f[k] = temp + Temp;
f[k + n] = temp - Temp;
}
}
n *= 2;
a = a / 2;
}
free(W);
}
I've made a lot of changes by now but this was my starting point. One of the changes I made was to not cache the twiddle factors, because I decided to see if it's needed first. Now I've decided I do want to cache them. The way this implementation seems to do it is it has this array W of length N/2, where every index k has the value . What I don't understand is this expression:
W[(k * a) % (n * a)]
Note that n * a is always equal to N/2. I get that this is supposed to be equal to , and I can see that , which this relies on. I also get that modulo can be used here because the twiddle factors are cyclic. But there's one thing I don't get: this is a length-N DFT, and yet only N/2 twiddle factors are ever calculated. Shouldn't the array be of length N, and the modulo should be by N?
But there's one thing I don't get: this is a length-N DFT, and yet only N/2 twiddle factors are ever calculated. Shouldn't the array be of length N, and the modulo should be by N?
The twiddle factors are equally spaced points on the unit circle, and there is an even number of points because N is a power-of-two. After going around half of the circle (starting at 1, going counter clockwise above the X-axis), the second half is a repeat of the first half but this time it's below the X-axis (the points can be reflected through the origin). That is why Temp is subtracted the second time. That subtraction is the negation of the twiddle factor.
I'm trying to implement a Cholesky decomposition in Halide. Part of common algorithm such as crout consists of an iteration over a triangular matrix. In a way that, the diagonal elements of the decomposition are computed by subtracting a partial column sum from the diagonal element of the input matrix. Column sum is calculated over squared elements of a triangular part of the input matrix, excluding the diagonal element.
Using BLAS the code would in C++ look as follows:
double* a; /* input matrix */
int n; /* dimension */
const int c__1 = 1;
const double c_b12 = 1.;
const double c_b10 = -1.;
for (int j = 0; j < n; ++j) {
double ajj = a[j + j * n] - ddot(&j, &a[j + n], &n, &a[j + n], &n);
ajj = sqrt(ajj);
a[j + j * n] = ajj;
if (j < n) {
int i__2 = n - j;
dgemv("No transpose", &i__2, &j, &c_b10, &a[j + 1 + n], &n, &a[j + n], &b, &c_b12, &a[j + 1 + j * n], &c__1);
double d__1 = 1. / ajj;
dscal(&i__2, &d__1, &a[j + 1 + j * n], &c__1);
}
}
My question is if a pattern like this is in general expressible by Halide? And if so, how would it look like?
I think Andrew may have a more complete answer, but in the interest of a timely response, you can use an RDom predicate (introduced via RDom::where) to enumerate triangular regions (or their generalization to more dimensions). A sketch of the pattern is:
Halide::RDom triangular(0, extent, 0, extent);
triangular.where(triangular.x < triangular.y);
Then use triangular in a reduction.
I once had a fast Cholesky written in Halide. Unfortunately I can't find the code. I put the outer loop in C and wrote a good block-panel update routine that operated on something like a 32-wide panel at a time. This was before Halide had triangular iteration, so maybe you can do better now.
Given a n-by-m matrix, I would like to build a n-sized vector containing the minimums of each matrix row, in CUDA.
So far I've come through this:
__global__ void OnMin(float * Mins, const float * Matrix, const int n, const int m) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < n) {
Mins[i] = Matrix[m * i];
for (int j = 1; j < m; ++j){
if (Matrix[m * i + j] < Mins[i])
Mins[i] = Matrix[m * i + j];
}
}
}
called in:
OnMin<<<(n + TPB - 1) / TPB, TPB>>>(Mins, Matrix, n, m);
However I think that something more optimized could exist.
I tried invoking cublasIsamin in a loop, but it is slower.
I also tried launching a kernel (global) from OnMin kernel without success... (sm_35, compute_35 raises compile errors... I have a GTX670)
Any ideas ?
Thanks!
Finding the min of array rows in a row-major matrix is a parallel reduction question that has been discussed many times on stack overflow. For exmaple, this one.
Reduce matrix rows with CUDA
The basic idea is to use n blocks in a grid. Each block contains a fixed number of threads, typically 256. Each block of threads will do the parallel reduction on a row of the m elements to find the minimum collaboratively.
For a large enough matrix where the GPU can be fully utilized, the performance upper bound is half the time of copying the matrix once.
I am trying to write my own codes for Gaussian pyramid using c++.
I tried both reduce and expand equations as stated in http://persci.mit.edu/pub_pdfs/pyramid83.pdf, the equation (1) and (2). However, my array index is out of bounds when I am trying to access
[2i + m][2j + n] and [(i - m) / 2][(j - n) / 2], respectively.
My Gaussian kernel: the 5x5 matrix; g1Image: the original image reduced by 1 level, both row and column are half of the dimensions of the original image's.
My m and n are set to -2 < m/n <= 2, thus when i access my Gaussian kernel, i add 2 to the index, becoming
w[m + 2][n + 2] * original_image[2i + m][2j + n]
I did try to set my m and n to 0 < m/n <=4 as well, equation becomes
w[m][n] * original_image[2i + m][2j + n] or w[m][n] * original_image[2i + m - 2][2j + n - 2]
Any of the mentioned equations are out of bounds.
w[m][n] * original_image[2i][2j] for reduce equation and
w[m][n] * g1Image[i / 2][j / 2] for expand equation are working though.
However, the displayed image seems like there is no smoothing effect.
Can anyone explain to me how should I set my image dimension for each Gaussian Pyramid Reduction, Gaussian Pyramid Expansion and the m and n boundaries?
I have solved the problem by including this line
index1 = (2 * h) + m; index2 = (2 * w) + n;
if(index1 >= 0 && index1 < Height && index2 >= 0 && index2 < Width)
temp = w[m + 2][n + 2] * original_image[index1][index2];
More information at :
http://www.songho.ca/dsp/convolution/convolution.html