Mat.inv() yielding all zeroes in opencv - c++

I have the following code :
cv::Mat temp0 = R.t();
cv::Mat temp1 = R * temp0;
cv::Mat temp2(960, 960, CV_32FC1);
temp2 = temp1.inv();
cv::Size s = temp1.size();
std::cout<<s.height<<" "<<s.width<<std::endl;
std::cout<<cv::format(temp1, "numpy" ) << std::endl;
std::cout<<cv::format(temp2, "numpy" ) << std::endl;
The Transpose works correctly, so does the matrix multiplication. Thus the Mat temp1 has a size of 960x960. However, when I do temp2 =temp1.inv(), I recieve all zeroes in temp2. I mean zeroes is all of the 960x960 cells. Also, R is of type CV_32FC1 only. So it is probably not a datatype issue. I cannot understand the issue here. I googled so much. Can you please help.
EDIT
I am copying below the gdb output for the Mat::inv() function. I am having a hard time figuring it all out, but if someone is more familiar with OpenCV, maybe it will be of help :)
Breakpoint 1, CreateShares::ConstructShares (this=0x80556d0, channel=..., k=2, n=4) at CreateShares.cpp:165
165 temp2 = temp1.inv();
(gdb) step
cv::Mat::operator= (this=0xbffff294, e=...) at /usr/include/opencv2/core/mat.hpp:1373
1373 e.op->assign(e, *this);
(gdb)
1374 return *this;
(gdb) step
1375 }
(gdb) step
cv::MatExpr::~MatExpr (this=0xbfffef64, __in_chrg=<optimized out>) at /usr/include/opencv2/core/mat.hpp:1167
1167 class CV_EXPORTS MatExpr
(gdb) step
cv::Mat::~Mat (this=0xbfffefdc, __in_chrg=<optimized out>) at /usr/include/opencv2/core/mat.hpp:295
295 release();
(gdb) step
cv::Mat::release (this=0xbfffefdc) at /usr/include/opencv2/core/mat.hpp:381
381 if( refcount && CV_XADD(refcount, -1) == 1 )
(gdb) step
383 data = datastart = dataend = datalimit = 0;
(gdb) step
384 size.p[0] = 0;
(gdb) step
385 refcount = 0;
(gdb) step
386 }
(gdb) step
cv::Mat::~Mat (this=0xbfffefdc, __in_chrg=<optimized out>) at /usr/include/opencv2/core/mat.hpp:296
296 if( step.p != step.buf )
(gdb) step
298 }
(gdb) step
cv::Mat::~Mat (this=0xbfffefa4, __in_chrg=<optimized out>) at /usr/include/opencv2/core/mat.hpp:295
295 release();
(gdb) step
cv::Mat::release (this=0xbfffefa4) at /usr/include/opencv2/core/mat.hpp:381
381 if( refcount && CV_XADD(refcount, -1) == 1 )
(gdb) step
383 data = datastart = dataend = datalimit = 0;
(gdb) step
384 size.p[0] = 0;
(gdb) step
385 refcount = 0;
(gdb) step
386 }
(gdb) step
cv::Mat::~Mat (this=0xbfffefa4, __in_chrg=<optimized out>) at /usr/include/opencv2/core/mat.hpp:296
296 if( step.p != step.buf )
(gdb) step
298 }
(gdb) step
cv::Mat::~Mat (this=0xbfffef6c, __in_chrg=<optimized out>) at /usr/include/opencv2/core/mat.hpp:295
295 release();
(gdb) step
cv::Mat::release (this=0xbfffef6c) at /usr/include/opencv2/core/mat.hpp:381
381 if( refcount && CV_XADD(refcount, -1) == 1 )
(gdb) step
383 data = datastart = dataend = datalimit = 0;
(gdb) step
384 size.p[0] = 0;
(gdb) step
385 refcount = 0;
(gdb) step
386 }
(gdb) step
cv::Mat::~Mat (this=0xbfffef6c, __in_chrg=<optimized out>) at /usr/include/opencv2/core/mat.hpp:296
296 if( step.p != step.buf )
(gdb) step
298 }
(gdb) step
CreateShares::ConstructShares (this=0x80556d0, channel=..., k=2, n=4) at CreateShares.cpp:167
167 cv::Size s = temp1.size();
(gdb) step
cv::Mat::MSize::operator() (this=0xbffff284) at /usr/include/opencv2/core/mat.hpp:705
705 return Size(p[1], p[0]);
(gdb) step
cv::Size_<int>::Size_ (this=0xbffff2f8, _width=960, _height=960) at /usr/include/opencv2/core/operations.hpp:1624
1624 : width(_width), height(_height) {}
(gdb) step
cv::Mat::MSize::operator() (this=0xbffff284) at /usr/include/opencv2/core/mat.hpp:706
706 }
(gdb) step

Most likely, the determinant is zero.
From Wikipedia:
A square matrix that is not invertible is called singular or
degenerate. A square matrix is singular if and only if its determinant
is 0.
You can display the determinant like so...
std::cout<<"determinant(temp1)="<<cv::determinant(temp1)<<"\n";
From the documentation for Mat::inv(), there are three methods to choose from:
DECOMP_LU (default) is the LU decomposition. The matrix must be non-singular.
DECOMP_CHOLESKY is the Cholesky decomposition for symmetrical positively defined matrices only. This type is about twice faster than LU on big matrices.
DECOMP_SVD is the SVD decomposition. If the matrix is singular or even non-square, the pseudo inversion is computed.
From the documentation for invert(), which is presumably used internally by Mat::inv():
In the case of DECOMP_LU method, the function returns the src
determinant ( src must be square). If it is 0, the matrix is not
inverted and dst is filled with zeros.
This agrees with the results that you are seeing.
notes about the math
I'm no mathematician, but I get the impression that inverting a matrix can be a messy business -- all the more so if your matrix is very large. In fact, it may be true that these inverses exist in principle, but are practically impossible to calculate with any accuracy. In running some experiments with your code, I found that in many cases I would get determinants that were not exactly zero, but were very close to zero -- perhaps indicating that numerical precision may be the limiting factor. I tried specifying the matrices using 64-bit values instead of 32, and got different, but not necessarily better answers.
It may be useful to recognize that, based on the way you are calculating the temp1 matrix, it will always be symmetric. The DECOMP_CHOLESKY method is specifically designed to work on symmetric positive definite matrices, so using that might provide some advantages.
Experimentally, I found that normalizing (as suggested by #cedrou) makes it more likely that the inverse function returns a non-zero matrix (with DECOMP_LU but not with DECOMP_CHOLESKY). However, based on my guesses of how you might be initializing the R matrix, the resulting matrices never seemed to satisfy the definition of an inverse: A*inverse(A)=Identity. But you don't necessarily care about that -- which is perhaps why the SVD method computes a pseudo inverse.
Finally, it seems that this deeper question of why inversion is failing might be a math question rather than a programming question. Based on that I did some searching on the math site, and it turns out that someone has already asked how to do this very thing: https://math.stackexchange.com/questions/182662
notes about debugging
Based on your debug trace, I am inclined to think that the part that you are interested in was compiled into a non-traceable library and skipped over when you ran step. In other words, that mysterious blank line after your first step represents the part where it actually ran the inv() function. After that it is assigning the result to temp2 and destructing temporary objects. So your debug trace doesn't tell us anything about what is happening inside of inv().
165 temp2 = temp1.inv();
(gdb) step
cv::Mat::operator= (this=0xbffff294, e=...) at /usr/include/opencv2/core/mat.hpp:1373
1373 e.op->assign(e, *this);
I ran a debugger on this myself and was able to trace through the inner call to invert() and watch it decide to fail based on an internal analysis of the matrix (determining that it was not invertible) -- and therefore return a matrix filled with zeros, matching what you have reported.
The invert() function is defined in cxlapack.cpp, in case you are interested in taking a look at the source code.

For random matrix R product R^T*R may be singular. So any kind of LU-decomposition will stop prematurely, resulting in zero output.
To overcome this, one may invert matrix R^T*R+alpha*I. Here I is identity matrix, alpha - some positive number. If alpha is close zero and R^T*R is not singular, inverse of R^T*R+alpha*I is close to inverse of R^T*R. For details, see Tikhonov regularization
Another case is matrix R^T*R being not singular but ill-conditioned. Condition number for large unstructured matrix may be tremendous, resulting in weird behavior of matrix inversion (LU-decomposition works correctly only for relatively small condition numbers).
About normalizing
Normalizing matrix will improve inversion behavior because it decreases condition number.

I've came to the same conclusion than #berak. I hope the following experiments will help you.
I tried your code with a matrix filled with some random values (normal distribution centered on 0 with a standard dev of sigma. As sigma is increasing, the values of the final matrix decrease.
Here is the code I used:
cv::Mat R(960,960, CV_32FC1);
double sigma[] = { 1.0, 10.0, 100.0, 1000.0, 10000.0, 100000.0, 1000000.0, 10000000.0 };
cv::Scalar mean[_countof(sigma)] = {0};
cv::Scalar stdv[_countof(sigma)] = {0};
for (int i = 0; i < _countof(sigma); i++)
{
cv::randn(R, cv::Scalar::all(0.0), cv::Scalar::all(sigma[i]));
cv::Mat temp2 = (R * R.t()).inv();
cv::meanStdDev(temp2, mean[i], stdv[i]);
}
Here are the mean and standard dev of the output matrix for increasing sigmas:
sigma mean stddev
1.0 3.94e-004 1.32
10.0 1.25e-004 3.82e-002
100.0 3.32e-007 1.09e-004
1000.0 2.40e-009 2.23e-006
10000.0 9.82e-012 1.05e-008
100000.0 2.23e-013 1.73e-010
1000000.0 1.44e-015 2.88e-012
10000000.0 9.61e-017 2.77e-014
So, a solution for you would be to normalize your input matrix so that all the values fit into [0;1] or [-1;1].

Related

matrix manipulation when debugging Fortran with gdb

I want to use some intrinsic functions and matrix operations in debugging, but failed. It looks like this:
(gdb) p tau
$12 = (( 0, 0, 0) ( -0.23499999999999999, 0.23499999999999999, 0.25) )
(gdb) p alat
$13 = 10.2631
(gdb) p tau * alat
Argument to arithmetic operation not a number or boolean.
(gdb) p tau + 1.0
Argument to arithmetic operation not a number or boolean.
(gdb) p SUM(tau)
No symbol "SUM" in current context.
According to this post it seems there is no general way to use intrinsic functions, but there may be hacks for some specific ones.
Any suggestions on how to use SUM or do matrix operations in Fortran way? Many thanks.

combined Scharr derivatives in opencv

I have few questions regarding Scharr derivatives and its OpenCV implementation.
I am interested in second order image derivatives with (3X3) kernels.
I started with Sobel second derivative, which failed to find some thin lines in the images. After reading the Sobel and Charr comparison in the bottom of this page, I decided to try Scharr instead by changing this line:
Sobel(gray, grad, ddepth, 2, 2, 3, scale, delta, BORDER_DEFAULT);
to this line:
Scharr(img, gray, ddepth, 2, 2, scale, delta, BORDER_DEFAULT );
My problem is that it seems like cv::Scharr allows performing an only first order of one partial derivative at a time, So I get the following error:
error: (-215) dx >= 0 && dy >= 0 && dx+dy == 1 in function getScharrKernels
(see assertion line here)
Following this restriction, I have a few questions regarding Scharr derivatives:
Is it considered bad-practice to use high order Scharr derivatives? Why did OpenCV choose to assert dx+dy == 1?
If I am to call Scharr twice for each axis, What is the correct way to combine the results?
I am currently using:
addWeighted( abs_grad_x, 0.5, abs_grad_y, 0.5, 0, grad );
but I am not sure that this how the Sobel function combines the two axis and in what order it should be done for all 4 derivatives.
If I am to compute the (dx=2,dy=2) derivative by using 4 different kernels, I would like to reduce processing time by unifying all 4 kernels into 1 before applying it on the image (I assume that this is what cv::Sobel does). Is there a reasonable way to create such combined Shcarr kernel and convolve it with my image?
Thanks!
I've never read the original Scharr paper (the dissertation is in German) so I don't know the answer to why the Scharr() function doesn't allow higher order derivatives. Maybe because of the first point I make in #3 below?
The Scharr function is supposed to be a derivative. And the total derivative of a multivariable function f(x) = f(x0, ..., xN) is
df/dx = dx0*df/dx0 + ... + dxN*df/dxN
That is, the sum of the partials each multiplied by the change. In the case of images of course, the change dx in the input is a single pixel, so it's equivalent to 1. In other words, just sum the partials; not weighting them by half. You can use addWeighted() with 1s as the weights, or you can just sum them, but to make sure you won't saturate your image you'll need to convert to a float or 16-bit image first. However, it's also pretty common to compute the Euclidean magnitude of the derivatives, too, if you're trying to get the gradient instead of the derivative.
However, that's just for the first-order derivative. For higher orders, you need to apply some chain ruling. See here for the details of combining a second order.
Note that an optimized kernel for first-order derivatives is not necessarily the optimal kernel for second-order derivatives by applying it twice. Scharr himself has a paper on optimizing second-order derivative kernels, you can read it here.
With that said, filters are split into x and y directions to make linear separable filters, which basically turn your 2d convolution problem into two 1d convolutions with smaller kernels. Think of the Sobel and Scharr kernels: for the x direction, they both just have the single column on either side with the same values (except one is negative). When you slide the kernel across the image, at the first location, you're multiplying the first column and the third column by the values in your kernel. And then two steps later, you're multiplying the third and the fifth. But the third was already computed, so that's wasteful. Instead, since both sides are the same, just multiply each column by the vector since you know you need those values, and then you can just look up the values for the results in column 1 and 3 and subtract them.
In short, I don't think you can combine them with built-in separable filter functions, because certain values are positive sometimes, and negative otherwise; and the only way to know when applying a filter linearly is to do them separately. However, we can examine the result of applying both filters and see how they affect a single pixel, construct the 2D kernel, and then convolve with OpenCV.
Suppose we have a 3x3 image:
image
=====
a b c
d e f
g h i
And we have the Scharr kernels:
kernel_x
========
-3 0 3
-10 0 10
-3 0 3
kernel_y
========
-3 -10 -3
0 0 0
3 10 3
The result of applying each kernel to this image gives us:
image * kernel_x
================
-3a -10b -3c
+0d +0e +0f
+3g +10h +3i
image * kernel_y
================
-3a +0b +3c
-10d +0e +10f
-3g +0h +3i
These values are summed and placed into pixel e. Since the sum of both of these is the total derivative, we sum all these values into pixel e at the end of the day.
image * kernel_x + image * kernel y
===================================
-3a -10b -3c +3g +10h +3i
-3a +3c -10d +10f -3g +3i
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
-6a -10b +0c -10d +10f +0g +10h +6i
And this is the same result we'd have gotten if we multiplied by the kernel
kernel_xy
=============
-6 -10 0
-10 0 10
0 10 6
So there's a 2D kernel that does a single-order derivative. Notice anything interesting? It's just the addition of the two kernels. Is that surprising? Not really, as x(a+b) = ax + bx. Now we can pass that into filter2D()
to compute the addition of the derivatives. Does that actually give the same result?
import cv2
import numpy as np
img = cv2.imread('cameraman.png', 0).astype(np.float32)
kernel = np.array([[-6, -10, 0],
[-10, 0, 10],
[0, 10, 6]])
total_first_derivative = cv2.filter2D(img, -1, kernel)
scharr_x = cv2.Scharr(img, -1, 1, 0)
scharr_y = cv2.Scharr(img, -1, 0, 1)
print((total_first_derivative == (scharr_x + scharr_y)).all())
True
Yep. Now I guess you can just do it twice.

C++ Eigen lib - eig(A,B) function computation time

I just wanted to ask if someone could tell me the reason behind why my eig function computation time is different for the same size of matrices?
I am performing a lot of eig(A,B) operations on 4x4, 4x4 matrices to get eigenvalues and eigen vectors.
It happens 557 * 4 * 267 times.
std::pair<MatrixXcd, VectorXd> eig(const MatrixXcd& A, const MatrixXcd& B)
{
Eigen::GeneralizedSelfAdjointEigenSolver<MatrixXcd> solver(A, B);
MatrixXcd V = solver.eigenvectors();
VectorXd D = solver.eigenvalues();
return std::make_pair(V, D);
}
(...)
for (int t = 0; t < 557; t++)
{
(...)
for(int sig = 0; sig < 4; sig++)
{
for(int f = 0; f < 267; f++)
{
// calculation of Xi and Xfs
mRt = mXfs * mXfs.adjoint();
mRj = mXfi * mXfi.adjoint();
eValVec = eig(mRt, mRj);
(...)
}
(...)
}
}
For t<=7 eig computes 4 * 267 iterations in around 3 sec.
For t>7 it suddenly becomes really fast and computes 4 * 267 iterations in ~0.1 sec.
I checked for sure that its eig that slows my program by commenting other parts of code and leaving only that function. I always use 4x4 matrices.
I guess important notice might be that all values or Rj matrix in t<=7 equals 0. So its like:
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
While Rx has always values. After t>7 Rj also gain non zero values and computation becomes faster. So eig function might have problem to deal with zeros? I don't change anything related to eig function - only numbers in 4x4 matrices I insert to it changes. Still computation time differs. And this lag never happens after.
Do anyone have idea how could I fix it/debug it or make it more stable? I am doing processing in real time and 3 sec is quite too long for me.

Advice on CUDA algorithm to sum columns of a matrix [duplicate]

Windows 7, NVidia GeForce 425M.
I wrote a simple CUDA code which calculates the row sums of a matrix.
The matrix has uni-dimensional representation (pointer to a float).
The serial version of code is below (it has 2 loops, as expected):
void serial_rowSum (float* m, float* output, int nrow, int ncol) {
float sum;
for (int i = 0 ; i < nrow ; i++) {
sum = 0;
for (int j = 0 ; j < ncol ; j++)
sum += m[i*ncol+j];
output[i] = sum;
}
}
Inside the CUDA code, I call the kernel function sweeping the matrix by rows. Below, the kernel call snippet:
dim3 threadsPerBlock((unsigned int) nThreadsPerBlock); // has to be multiple of 32
dim3 blocksPerGrid((unsigned int) ceil(nrow/(float) nThreadsPerBlock));
kernel_rowSum<<<blocksPerGrid, threadsPerBlock>>>(d_m, d_output, nrow, ncol);
and the kernel function which performs the parallel sum of the rows (still has 1 loop):
__global__ void kernel_rowSum(float *m, float *s, int nrow, int ncol) {
int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;
if (rowIdx < nrow) {
float sum=0;
for (int k = 0 ; k < ncol ; k++)
sum+=m[rowIdx*ncol+k];
s[rowIdx] = sum;
}
}
So far so good. The serial and parallel (CUDA) results are equal.
The whole point is that the CUDA version takes almost twice the time of the serial one to compute, even if I change the nThreadsPerBlock parameter: I tested nThreadsPerBlock from 32 to 1024 (maximum number of threads per block allowed for my card).
IMO, the matrix dimension is big enough to justify parallelization: 90,000 x 1,000.
Below, I report the time elapsed for the serial and parallel versions using different nThreadsPerBlock. Time reported in msec over an average of 100 samples:
Matrix: nrow = 90000 x ncol = 1000
Serial: Average Time Elapsed per Sample in msec (100 samples): 289.18.
CUDA (32 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 497.11.
CUDA (1024 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 699.66.
Just in case, the version with 32/1024 nThreadsPerBlock is the fastest/slowest one.
I understand that there is a kind of overhead when copying from Host to Device and the other way around, but maybe the slowness is because I am not implementing the fastest code.
Since I am far from being a CUDA expert:
Am I coding the fastest version for this task? How could I improve my code?
Can I get rid of the loop in the kernel function?
Any thoughts appreciated.
EDIT 1
Although I describe a standard rowSum, I am interested in the AND/OR operation of rows which have (0;1} values, like rowAND/rowOR. That said, it doesn't allow me to exploit the cuBLAS multiply by 1's COL column vector trick, as suggested by some commentators.
EDIT 2
As suggest by users other users and here endorsed:
FORGET ABOUT TRYING TO WRITE YOUR OWN FUNCTIONS, use Thrust library instead and the magic comes.
Since you mentioned you need general reduction algorithm other than sum only. I will try to give 3 approaches here. kernel approach may have the highest performance. thrust approach is easiest to implement. cuBLAS approach works only with sum and have good performance.
Kernel Approach
Here's a very good doc introducing how to optimize standard parallel reduction. Standard reduction can be divide into 2 stages.
Multiple thread blocks each reduces one part of the data;
One thread block reduces from result of stage 1 to the final 1 element.
For your multi-reduction (reduce rows of mat) problem, only stage 1 is enough. The idea is to reduce 1 row per thread block. For further considerations like multi-row per thread block or 1 row per multiple thread blocks, you can refer to the paper provided by #Novak. This may improve the performance more, especially for matrices with bad shape.
Thrust Approach
General multi-reduction can be done by thrust::reduction_by_key in a few minutes. You can find some discussions here Determining the least element and its position in each matrix column with CUDA Thrust.
However thrust::reduction_by_key does not assume each row has the same length, so you will get performance penalty. Another post How to normalize matrix columns in CUDA with max performance? gives profiling comparison between thrust::reduction_by_key and cuBLAS approach on sum of rows. It may give you a basic understanding about the performance.
cuBLAS Approach
Sum of rows/cols of a matrix A can be seen as a matrix-vector multiplication where the elements of the vector are all ones. it can be represented by the following matlab code.
y = A * ones(size(A,2),1);
where y is the sum of rows of A.
cuBLAS libary provides a high performance matrix-vector multiplication function cublas<t>gemv() for this operation.
Timing result shows that this routine is only 10~50% slower than simply read all the elements of A once, which can be seen as the theoretical upper limit of the performance for this operation.
Reducing the rows of a matrix can be solved by using CUDA Thrust in three ways (they may not be the only ones, but addressing this point is out of scope). As also recognized by the same OP, using CUDA Thrust is preferable for such a kind of problem. Also, an approach using cuBLAS is possible.
APPROACH #1 - reduce_by_key
This is the approach suggested at this Thrust example page. It includes a variant using make_discard_iterator.
APPROACH #2 - transform
This is the approach suggested by Robert Crovella at CUDA Thrust: reduce_by_key on only some values in an array, based off values in a “key” array.
APPROACH #3 - inclusive_scan_by_key
This is the approach suggested by Eric at How to normalize matrix columns in CUDA with max performance?.
APPROACH #4 - cublas<t>gemv
It uses cuBLAS gemv to multiply the relevant matrix by a column of 1's.
THE FULL CODE
Here is the code condensing the two approaches. The Utilities.cu and Utilities.cuh files are mantained here and omitted here. The TimingGPU.cu and TimingGPU.cuh are maintained here and are omitted as well.
#include <cublas_v2.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>
#include <stdio.h>
#include <iostream>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
// --- Required for approach #2
__device__ float *vals;
/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {
T Ncols; // --- Number of columns
__host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}
__host__ __device__ T operator()(T i) { return i / Ncols; }
};
/******************************************/
/* ROW_REDUCTION - NEEDED FOR APPROACH #2 */
/******************************************/
struct row_reduction {
const int Ncols; // --- Number of columns
row_reduction(int _Ncols) : Ncols(_Ncols) {}
__device__ float operator()(float& x, int& y ) {
float temp = 0.f;
for (int i = 0; i<Ncols; i++)
temp += vals[i + (y*Ncols)];
return temp;
}
};
/**************************/
/* NEEDED FOR APPROACH #3 */
/**************************/
template<typename T>
struct MulC: public thrust::unary_function<T, T>
{
T C;
__host__ __device__ MulC(T c) : C(c) { }
__host__ __device__ T operator()(T x) { return x * C; }
};
/********/
/* MAIN */
/********/
int main()
{
const int Nrows = 5; // --- Number of rows
const int Ncols = 8; // --- Number of columns
// --- Random uniform integer distribution between 10 and 99
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist(10, 99);
// --- Matrix allocation and initialization
thrust::device_vector<float> d_matrix(Nrows * Ncols);
for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);
TimingGPU timerGPU;
/***************/
/* APPROACH #1 */
/***************/
timerGPU.StartCounter();
// --- Allocate space for row sums and indices
thrust::device_vector<float> d_row_sums(Nrows);
thrust::device_vector<int> d_row_indices(Nrows);
// --- Compute row sums by summing values with equal row indices
//thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)),
// thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
// d_matrix.begin(),
// d_row_indices.begin(),
// d_row_sums.begin(),
// thrust::equal_to<int>(),
// thrust::plus<float>());
thrust::reduce_by_key(
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
d_matrix.begin(),
thrust::make_discard_iterator(),
d_row_sums.begin());
printf("Timing for approach #1 = %f\n", timerGPU.GetCounter());
// --- Print result
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums[i] << "\n";
}
/***************/
/* APPROACH #2 */
/***************/
timerGPU.StartCounter();
thrust::device_vector<float> d_row_sums_2(Nrows, 0);
float *s_vals = thrust::raw_pointer_cast(&d_matrix[0]);
gpuErrchk(cudaMemcpyToSymbol(vals, &s_vals, sizeof(float *)));
thrust::transform(d_row_sums_2.begin(), d_row_sums_2.end(), thrust::counting_iterator<int>(0), d_row_sums_2.begin(), row_reduction(Ncols));
printf("Timing for approach #2 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_2[i] << "\n";
}
/***************/
/* APPROACH #3 */
/***************/
timerGPU.StartCounter();
thrust::device_vector<float> d_row_sums_3(Nrows, 0);
thrust::device_vector<float> d_temp(Nrows * Ncols);
thrust::inclusive_scan_by_key(
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
d_matrix.begin(),
d_temp.begin());
thrust::copy(
thrust::make_permutation_iterator(
d_temp.begin() + Ncols - 1,
thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))),
thrust::make_permutation_iterator(
d_temp.begin() + Ncols - 1,
thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))) + Nrows,
d_row_sums_3.begin());
printf("Timing for approach #3 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_3[i] << "\n";
}
/***************/
/* APPROACH #4 */
/***************/
cublasHandle_t handle;
timerGPU.StartCounter();
cublasSafeCall(cublasCreate(&handle));
thrust::device_vector<float> d_row_sums_4(Nrows);
thrust::device_vector<float> d_ones(Ncols, 1.f);
float alpha = 1.f;
float beta = 0.f;
cublasSafeCall(cublasSgemv(handle, CUBLAS_OP_T, Ncols, Nrows, &alpha, thrust::raw_pointer_cast(d_matrix.data()), Ncols,
thrust::raw_pointer_cast(d_ones.data()), 1, &beta, thrust::raw_pointer_cast(d_row_sums_4.data()), 1));
printf("Timing for approach #4 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_4[i] << "\n";
}
return 0;
}
TIMING RESULTS (tested on a Kepler K20c)
Matrix size #1 #1-v2 #2 #3 #4 #4 (no plan)
100 x 100 0.63 1.00 0.10 0.18 139.4 0.098
1000 x 1000 1.25 1.12 3.25 1.04 101.3 0.12
5000 x 5000 8.38 15.3 16.05 13.8 111.3 1.14
100 x 5000 1.25 1.52 2.92 1.75 101.2 0.40
5000 x 100 1.35 1.99 0.37 1.74 139.2 0.14
It seems that approaches #1 and #3 outperform approach #2, except in the cases of small numbers of columns. The best approach, however, is approach #4, which is significantly more convenient than the others, provided that the time needed to create the plan can be amortized during the computation.
If this is the extent (summing the rows) of the operations you need to do with this data, I wouldn't expect a sizable benefit from the GPU. You have exactly one arithmetic operation per data element, and for that you are paying the cost of transferring that data element to the GPU. And beyond a certain problem size (whatever it takes to keep the machine busy) you get no added benefit from larger problem sizes, because the arithmetic intensity is O(n).
So this isn't a particularly exciting problem to solve on the GPU.
But as talonmies has indicated, you have a coalescing problem in the way you have crafted it, which will further slow things down. Let's take a look at a small example:
C1 C2 C3 C4
R1 11 12 13 14
R2 21 22 23 24
R3 31 32 33 34
R4 41 42 43 44
Above is a simple pictorial example of a small portion of your matrix. The machine data storage is such that elements (11), (12), (13), and (14) are stored in adjacent memory locations.
For coalesced access, we want an access pattern such that adjacent memory locations are requested from the same instruction, executed across the warp.
We need to think about execution of your code from the standpoint of a warp, that is 32 threads executing in lock-step. What is your code doing? Which elements is it retrieving (asking for) at each step/instruction? Let's take a look at this line of code:
sum+=m[rowIdx*ncol+k];
Adjacent threads in the warp have adjacent (i.e. consecutive) values for rowIdx as you have created that variable. So when k = 0, which data element is being asked for by each thread when we try to retrieve the value m[rowIdx*ncol+k] ?
In block 0, thread 0 has a rowIdx of 0. Thread 1 has a rowIdx of 1, etc. So the values being asked for by each thread at this instruction are:
Thread: Memory Location: Matrix Element:
0 m[0] (11)
1 m[ncol] (21)
2 m[2*ncol] (31)
3 m[3*ncol] (41)
But this is not coalesced access! Elements (11), (21), etc. are not adjacent in memory. For coalesced access, we would like that Matrix Element row to read like this:
Thread: Memory Location: Matrix Element:
0 m[?] (11)
1 m[?] (12)
2 m[?] (13)
3 m[?] (14)
If you then work backwards to determine what the value of ? should be, you will come up with an instruction something like this:
sum+=m[k*ncol+rowIdx];
This will give coalesced access, but it will not give you the correct answer, because we are now summing matrix columns instead of matrix rows. We can fix this by re-organizing your data storage to be in column-major order rather than row-major order. (You should be able to google that for ideas, right?) Conceptually, this is equivalent to transposing your matrix m. Whether this is convenient for you to do or not is outside the scope of your question, as I see it, and not really a CUDA issue. It may be a simple thing for you to do as you are creating the matrix on the host or transferring the matrix from host to device. But in summary, I don't know of a way to sum the matrix rows with 100% coalesced access, if the matrix is stored in row-major order. (You could resort to a sequence of row-reductions but that looks painful to me.)
It's not uncommon, when we are thinking about ways to accelerate code on the GPU, to consider re-organizing our data storage to facilitate the GPU. This is one example.
And, yes, what I'm outlining here still retains a loop in the kernel.
As an additional comment, I would suggest timing the data copy portions, and kernel (compute) portions separately. I can't tell from your question whether you are timing just the kernel or the entire (GPU) operation, including the data copies. If you time the data copies separately, you may discover that just the data copy time exceeds your CPU time. Any effort put into optimizing your CUDA code will not affect the data copy time. This might be a useful data point before you spend much time on this.

2D Discrete laplacian (del2) in C++

I am trying to figure out how to port the del2() function in matlab to C++.
I have a couple of masks that I am working with that are ones and zeros, so I wrote code liket his:
for(size_t i = 1 ; i < nmax-1 ; i++)
{
for(size_t j = 1 ; j < nmax-1 ; j++)
{
transmask[i*nmax+j] = .25*(posmask[(i+1)*nmax + j]+posmask[(i-1)*nmax+j]+posmask[i*nmax+(j+1)]+posmask[i*nmax+(j-1)]);
}
}
to compute the interior points of the laplacians. I think according to some info in "doc del2" in matlab, the border conditions just use the available info to compute, right? SO i guess I just need to write cases for the border conditions at i,j = 0 and nmax
However, i would think these values from the code I have posted here would be correct for the interior points as is, but it seems like the del2 results are different!
I dug through the del2 source, and I guess I am not enough of a matlab wizard to figure out what is going on with some of the code for the interior computation
You can see the code of del2 by edit del2 or type del2.
Note that del2 does cubic interpolation on the boundaries.
The problem is that the line you have there:
transmask[i*nmax+j] = .25*(posmask[(i+1)*nmax + j]+posmask[(i-1)*nmax+j]+posmask[i*nmax+(j+1)]+posmask[i*nmax+(j-1)]);
isn't the discrete Laplacian at all.
What you have is (I(i+1,j) + I(i-1,j) + I(i,j+1) + I(i,j-1) ) / 4
I dont' know what this mask is, but the discrete Laplacian (assuming the spacing between each pixel in each dimension is 1) is:
(-4 * I(i,j) + I(i+1,j) + I(i-1,j) + I(i,j+1) + I(i,j-1) )
So basically, you missed a term, and you don't need to divide by 4. I suggest going back and rederiving the discrete Laplacian from its definition, which is the second x derivative of the image plus the second y derivative of the image.
Edit: I see where you got the /4 from, as Matlab uses this definition for some reason (even though this isn't standard mathematically).
I think that with the Matlab compiler you can convert the m code into C code. Have you tried that?
I found this link where another methot to convert to C is explained.
http://www.kluid.com/mlib/viewtopic.php?t=337
Good luck.