In-Place CUDA Kernel for Rectangular Matrix Transpose

In-Place CUDA Kernel for Rectangular Matrix Transpose - c++

I have perused around for a while, but was unable to find a proper answer for this:
Is there an implementation for in-place diagonal matrix transpose in CUDA?
I am aware of cublas geam, but that requires creating another matrix. I tried a naive implementation from: CUDA In-place Transpose Error
However, that only works for square matrices. Can someone explain to me why exactly this logic does not work for diagonal matrices? The 'naive' approach for transposition works though, however it is not in place.

After looking around for a while I found the following github page which does have code related to the nvidia research paper for an in-place transpose:
https://github.com/BryanCatanzaro/inplace
This seems to be the correct way to solve this question.

Have a look at the following paper: A Decomposition for In-place Matrix Transposition
Sequential algorithm for in-place matrix transpose is as follows (> O(n*m) running time):
// in: n rows; m cols
// out: n cols; m rows
void matrix_transpose(int *a, int n, int m) {
int i, j;
for(int k = 0; k < n*m; k++) {
int idx = k;
do { // calculate index in the original array
idx = (idx % n) * m + (idx / n);
} while(idx < k); // make sure we don't swap elements twice
std::swap(a[k], a[idx]);
}
}

Related

Is Eigen library matrix/vector manipulation faster than .net ones if the matrix is dense and unsymmetrical?

I have some matrix operations, mostly dealing with operations like running over all the each of the rows and columns of the matrix and perform multiplication a*mat[i,j]*mat[ii,j]:
public double[] MaxSumFunction()
{
var maxSum= new double[vector.GetLength(1)];
for (int j = 0; j < matrix.GetLength(1); j++)
{
for (int i = 0; i < matrix.GetLength(0); i++)
{
for (int ii = 0; ii < matrix.GetLength(0); ii++)
{
double wi= Math.Sqrt(vector[i]);
double wii= Math.Sqrt(vector[ii]);
maxSum[j] += SomePowerFunctions(wi, wii) * matrix[i, j]*matrix[ii, j];
}
}
}
}
private double SomePowerFunctions(double wi, double wj)
{
var betaij = wi/ wj;
var numerator = 8 * Math.Sqrt(wi* wj) * Math.Pow(betaij, 3.0 / 2)
* (wi+ betaij * wj);
var dominator = Math.Pow(1 - betaij * betaij, 2) +
4 * wi* wj* betaij * (1 + Math.Pow(betaij, 2)) +
4 * (wi* wi+ wj* wj) * Math.Pow(betaij, 2);
if (wi== 0 && wj== 0)
{
if (Math.Abs(betaij - 1) < 1.0e-8)
return 1;
else
return 0;
}
return numerator / dominator;
}
I found such loops to be particularly slow if the matrix size is big.
I want the speed to be fast. So I am thinking about re-implementing these algorithms using the Eigen library.
My matrix is not symmetrical, not sparse and contains no regularity that any solver can exploit reliably.
I read that Eigen solver can be fast because of:
Compiler optimization
Vectorization
Multi-thread support
But I wonder those advantages are really applicable given my matrix characteristics?
Note: I could have just run a sample or two to find out, but I believe that asking the question here and have it documented on the Internet is going to help others as well.

Before thinking about low level optimizations, look at your code and observe that many quantities are recomputed many time. For instance, f(wi,wii) does not depend on j, so they could either be precomputed once (see below) or you can rewrite your loop to make the loop on j the nested one. Then the nested loop will simply be a coefficient wise product between a constant scalar and two columns of your matrix (I don't .net and assume j is indexing columns). If the storage if column-major, then this operation should be fully vectorized by your compiler (again, I don't know .net, but any C++ compiler will do, and if you Eigen, it will be vectorized explicitly). This should be enough to get a huge performance boost.
Depending on the sizes of matrix, you might also try to leverage optimized matrix-matrix implementation by precomputed f(wi,wii) into a MatrixXd F; (using Eigen's language), and then observe that the whole computation amount to:
VectorXd v = your_vector;
MatrixXd F = MatrixXd::nullaryExpr(n,n,[&](Index i,Index j) {
return SomePowerFunctions(sqrt(v(i)), sqrt(v(j)));
});
MatrixXd M = your_matrix;
MatrixXd FM = F * M;
VectorXd maxSum = (M.array() * FM.array()).colwise().sum();

Min of array rows in CUDA

Given a n-by-m matrix, I would like to build a n-sized vector containing the minimums of each matrix row, in CUDA.
So far I've come through this:
__global__ void OnMin(float * Mins, const float * Matrix, const int n, const int m) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < n) {
Mins[i] = Matrix[m * i];
for (int j = 1; j < m; ++j){
if (Matrix[m * i + j] < Mins[i])
Mins[i] = Matrix[m * i + j];
}
}
}
called in:
OnMin<<<(n + TPB - 1) / TPB, TPB>>>(Mins, Matrix, n, m);
However I think that something more optimized could exist.
I tried invoking cublasIsamin in a loop, but it is slower.
I also tried launching a kernel (global) from OnMin kernel without success... (sm_35, compute_35 raises compile errors... I have a GTX670)
Any ideas ?
Thanks!

Finding the min of array rows in a row-major matrix is a parallel reduction question that has been discussed many times on stack overflow. For exmaple, this one.
Reduce matrix rows with CUDA
The basic idea is to use n blocks in a grid. Each block contains a fixed number of threads, typically 256. Each block of threads will do the parallel reduction on a row of the m elements to find the minimum collaboratively.
For a large enough matrix where the GPU can be fully utilized, the performance upper bound is half the time of copying the matrix once.

Eigen: random binary vector with t 1s

I want to compute K*es where K is an Eigen matrix (dimension pxp) and es is a px1 random binary vector with 1s.
For example if p=5 and t=2 a possible es is [1,0,1,0,0]' or [0,0,1,1,0]' and so on...
How do I easily generate es with Eigen?

I came up with even a better solution, which is a combination of std::vector, Egien::Map and std::shuffle.
std::vector<int> esv(p,0);
std::fill_n(esv.begin(),t,1);
Eigen::Map<Eigen::VectorXi> es (esv.data(), esv.size());
std::random_device rd;
std::mt19937 g(rd());
std::shuffle(std::begin(esv), std::end(esv), g);
This solution is memory efficient (since Eigen::Map doesn't copy esv) and has the big advantage that if we want to permute es several times (like in this case), then we just need to repeat std::shuffle(std::begin(esv), std::end(esv), g);
Maybe I'm wrong, but this solution seems more elegant and efficient than the previous ones.

So you're using Eigen. I'm not sure what matrix type you're using, but I'll go off the class Eigen::MatrixXd.
What you need to do is:
Create a 1xp matrix that's all 0
Choose random spots to flip from 0 to 1 that are between 0-p, and make sure that spot is unique.
The following code should do the trick, although you could implement it other ways.
//Your p and t
int p = 5;
int t = 2;
//px1 matrix
MatrixXd es(1, p);
//Initialize the whole 1xp matrix
for (int i = 0; i < p; ++i)
es(1, i) = 0;
//Get a random position in the 1xp matrix from 0-p
for (int i = 0; i < t; ++i)
{
int randPos = rand() % p;
//If the position was already a 1 and not a 0, get a different random position
while (es(1, randPos) == 1)
randPos = rand() % p;
//Change the random position from a 0 to a 1
es(1, randPos) = 1;
}

When t is close to p, Ryan's method need to generate much more than t random numbers. To avoid this performance degrade, you could solve your original problem
find t different numbers from [0, p) that are uniformly distributed
by the following steps
generate t uniformly distributed random numbers idx[t] from [0, p-t+1)
sort these numbers idx[t]
idx[i]+i, i=0,...,t-1 are the result
The code:
VectorXi idx(t);
VectorXd es(p);
es.setConstant(0);
for(int i = 0; i < t; ++i) {
idx(i) = int(double(rand()) / RAND_MAX * (p-t+1));
}
std::sort(idx.data(), idx.data() + idx.size());
for(int i = 0; i < t; ++i) {
es(idx(i)+i) = 1.0;
}

How can I most efficiently map a kernel range for a hermitian (symmetric) matrix in OpenCL?

I'm working on an OpenCL project to generate very large hermitian (symmetric) matrices, and I am trying to determine the best way to generate the work IDs.
A hermitian matrix is symmetric along the diagonal, so that M(i,j) = M*(j,i).
In the brute force way, the for loop looks like:
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
{
complex<float> result = doSomeCalculation();
M(i,j) = result;
}
}
However, taking advantage of the hermitian property, the loop can be made to be twice as efficient by only calculating the upper triangular part of the matrix and duplicating the result in the lower triangular part:
for(int i = 0; i < N; i++)
{
for(int j = i; j < N; j++)
{
complex<float> result = doSomeCalculation();
M(i,j) = result;
M(j,i) = conj(result);
}
}
In both loops, doSomeCalculation() is an expensive operation, and each entry in the matrix is completely uncorrelated from every other entry (i.e. the problem is stupidly parallel).
My question is this:
How can I implement the second loop with doSomeCalculation as an OpenCL kernel so that the thread IDs are most efficiently used (i.e. so that the thread calculates both M(i,j) and M(j,i) without having to call doSomeCalculation() twice)?

You need to use a linear index, for example you can index every element of your matrix in this way:
0 1 2 ... N-1
* N-2 ... 2N-2
....
* * 2N-1 ... N(N+1)/2 -1
That is, the index K is given by:
k=iN-i*(i+1)/2+j
Where N is the size of the matrix and (i,j) are respectively the 0-based indices of the row and the column.
This relationship can be inverted; see the answer of this question, which I report here for completeness:
i = floor( ( 2*N+1 - sqrt( (2N+1)*(2N+1) - 8*k ) ) / 2 ) ;
j = k - N*i + i*(i+1)/2 ;
So you need to enqueue a 1D kernel with N(N+1)/2 work items, and you can decide by yourself the size of the workgroup (usually 64 items per work group is a good choice).
Then in the OpenCL code you can retrieve the index K by using:
int k = get_group_id(0)*64 + get_local_id(0);
And then use the two relationships above the index of the matrix element you need to compute.
Moreover, notice that you can also save space by representing your hermitian matrix as a linear vector with N(N+1)/2 elements.

If your matrices are really big, than you can dice up your NxN matrix into (N/k)x(N/k) tiles, each of size kxk. As soon as you need only a half of the data, you create 1D NDRange of size local_group_size * (N/k)x(N/k)/2 roughly.
Every tile of matrix is processed by one LocalGroup (size of LocalGroup is of your choice). The idea is that you create an array on Host side, which contain position of every WorkGroup in matrix. Kernel stub should look like follows:
void __kernel myKernel(
__global int* coords,
....)
{
int2 WorkGroupPositionInMatrix = vload2(get_group_id(0), coords);
...
DoCalculation();
...
WriteResultTwice();
...
return;
}
What you need to do by hand - is to cope with thouse WorkGroups, which will be placed on the matrix diagonal. If matrix size is big, than overhead for LocalGroups, placed on diagonal is negligible.

A right triangle can be cut in half vertically and the smaller portion rotated to fit with the larger portion to form a rectangle of equal area. Therefore it is easy to make your triangular global work area into one that is rectangular, which fits OpenCL.
See my answer here: OpenCL efficient way to group a lower triangular matrix

How do you multiply a matrix by itself?

This is what i have so far but I do not think it is right.
for (int i = 0 ; i < 5; i++)
{
for (int j = 0; j < 5; j++)
{
matrix[i][j] += matrix[i][j] * matrix[i][j];
}
}

Suggestion: if it's not a homework don't write your own linear algebra routines, use any of the many peer reviewed libraries that are out there.
Now, about your code, if you want to do a term by term product, then you're doing it wrong, what you're doing is assigning to each value it's square plus the original value (n*n+n or (1+n)*n, whatever you like best)
But if you want to do an authentic matrix multiplication in the algebraic sense, remember that you had to do the scalar product of the first matrix rows by the second matrix columns (or the other way, I'm not very sure now)... something like:
for i in rows:
for j in cols:
result(i,j)=m(i,:)·m(:,j)
and the scalar product "·"
v·w = sum(v(i)*w(i)) for all i in the range of the indices.
Of course, with this method you cannot do the product in place, because you'll need the values that you're overwriting in the next steps.
Also, explaining a little bit further Tyler McHenry's comment, as a consecuence of having to multiply rows by columns, the "inner dimensions" (I'm not sure if that's the correct terminology) of the matrices must match (if A is m x n, B is n x o and A*C is m x o), so in your case, a matrix can be squared only if it's square (he he he).
And if you just want to play a little bit with matrices, then you can try Octave, for example; squaring a matrix is as easy as M*M or M**2.

I don't think you can multiply a matrix by itself in-place.
for (i = 0; i < 5; i++) {
for (j = 0; j < 5; j++) {
product[i][j] = 0;
for (k = 0; k < 5; k++) {
product[i][j] += matrix[i][k] * matrix[k][j];
}
}
}
Even if you use a less naïve matrix multiplication (i.e. something other than this O(n3) algorithm), you still need extra storage.

That's not any matrix multiplication definition I've ever seen. The standard definition is
for (i = 1 to m)
for (j = 1 to n)
result(i, j) = 0
for (k = 1 to s)
result(i, j) += a(i, k) * b(k, j)
to give the algorithm in a sort of pseudocode. In this case, a is a m x s matrix and b is an s x n, the result is a m x n, and subscripts begin with 1..
Note that multiplying a matrix in place is going to get the wrong answer, since you're going to be overwriting values before using them.

It's been too long since I've done matrix math (and I only did a little bit of it, on top), but the += operator takes the value of matrix[i][j] and adds to it the value of matrix[i][j] * matrix[i][j], which I don't think is what you want to do.

Well it looks like what it's doing is squaring the row/column, then adding it to the row/column. Is that what you want it to do? If not, then change it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

In-Place CUDA Kernel for Rectangular Matrix Transpose - c++

After looking around for a while I found the following github page which does have code related to the nvidia research paper for an in-place transpose: https://github.com/BryanCatanzaro/inplace This seems to be the correct way to solve this question.

Related

Is Eigen library matrix/vector manipulation faster than .net ones if the matrix is dense and unsymmetrical?

Min of array rows in CUDA

Eigen: random binary vector with t 1s

How can I most efficiently map a kernel range for a hermitian (symmetric) matrix in OpenCL?

How do you multiply a matrix by itself?

Categories

Resources