understanding Matrix multiplication in CUDA - c++

I am trying to learn CUDA. I started to try matrix multiplication with the help of this article based on GPU.
My main problem is that I am unable too understand how to access 2D array in Kernel since accessing a 2D array is a bit different than the conventional method (matrix[i][j]).
This is the part where i am stuck:
for (int i = 0; i < N; i++) {
tmpSum += A[ROW * N + i] * B[i * N + COL];
}
C[ROW * N + COL] = tmpSum;
I could understand how ROW and COLUMN were derived.
int ROW = blockIdx.y*blockDim.y+threadIdx.y;
int COL = blockIdx.x*blockDim.x+threadIdx.x;
Any explanation with an example is highly appreciated. Thanks!

Matrices are stored contiguously, i.e. every row after the other at consecutive locations. What you see here is called flat adressing, i.e turning the two element index to an offset from the first element.

Related

Min of array rows in CUDA

Given a n-by-m matrix, I would like to build a n-sized vector containing the minimums of each matrix row, in CUDA.
So far I've come through this:
__global__ void OnMin(float * Mins, const float * Matrix, const int n, const int m) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < n) {
Mins[i] = Matrix[m * i];
for (int j = 1; j < m; ++j){
if (Matrix[m * i + j] < Mins[i])
Mins[i] = Matrix[m * i + j];
}
}
}
called in:
OnMin<<<(n + TPB - 1) / TPB, TPB>>>(Mins, Matrix, n, m);
However I think that something more optimized could exist.
I tried invoking cublasIsamin in a loop, but it is slower.
I also tried launching a kernel (global) from OnMin kernel without success... (sm_35, compute_35 raises compile errors... I have a GTX670)
Any ideas ?
Thanks!
Finding the min of array rows in a row-major matrix is a parallel reduction question that has been discussed many times on stack overflow. For exmaple, this one.
Reduce matrix rows with CUDA
The basic idea is to use n blocks in a grid. Each block contains a fixed number of threads, typically 256. Each block of threads will do the parallel reduction on a row of the m elements to find the minimum collaboratively.
For a large enough matrix where the GPU can be fully utilized, the performance upper bound is half the time of copying the matrix once.

How can I most efficiently map a kernel range for a hermitian (symmetric) matrix in OpenCL?

I'm working on an OpenCL project to generate very large hermitian (symmetric) matrices, and I am trying to determine the best way to generate the work IDs.
A hermitian matrix is symmetric along the diagonal, so that M(i,j) = M*(j,i).
In the brute force way, the for loop looks like:
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
{
complex<float> result = doSomeCalculation();
M(i,j) = result;
}
}
However, taking advantage of the hermitian property, the loop can be made to be twice as efficient by only calculating the upper triangular part of the matrix and duplicating the result in the lower triangular part:
for(int i = 0; i < N; i++)
{
for(int j = i; j < N; j++)
{
complex<float> result = doSomeCalculation();
M(i,j) = result;
M(j,i) = conj(result);
}
}
In both loops, doSomeCalculation() is an expensive operation, and each entry in the matrix is completely uncorrelated from every other entry (i.e. the problem is stupidly parallel).
My question is this:
How can I implement the second loop with doSomeCalculation as an OpenCL kernel so that the thread IDs are most efficiently used (i.e. so that the thread calculates both M(i,j) and M(j,i) without having to call doSomeCalculation() twice)?
You need to use a linear index, for example you can index every element of your matrix in this way:
0 1 2 ... N-1
* N-2 ... 2N-2
....
* * 2N-1 ... N(N+1)/2 -1
That is, the index K is given by:
k=iN-i*(i+1)/2+j
Where N is the size of the matrix and (i,j) are respectively the 0-based indices of the row and the column.
This relationship can be inverted; see the answer of this question, which I report here for completeness:
i = floor( ( 2*N+1 - sqrt( (2N+1)*(2N+1) - 8*k ) ) / 2 ) ;
j = k - N*i + i*(i+1)/2 ;
So you need to enqueue a 1D kernel with N(N+1)/2 work items, and you can decide by yourself the size of the workgroup (usually 64 items per work group is a good choice).
Then in the OpenCL code you can retrieve the index K by using:
int k = get_group_id(0)*64 + get_local_id(0);
And then use the two relationships above the index of the matrix element you need to compute.
Moreover, notice that you can also save space by representing your hermitian matrix as a linear vector with N(N+1)/2 elements.
If your matrices are really big, than you can dice up your NxN matrix into (N/k)x(N/k) tiles, each of size kxk. As soon as you need only a half of the data, you create 1D NDRange of size local_group_size * (N/k)x(N/k)/2 roughly.
Every tile of matrix is processed by one LocalGroup (size of LocalGroup is of your choice). The idea is that you create an array on Host side, which contain position of every WorkGroup in matrix. Kernel stub should look like follows:
void __kernel myKernel(
__global int* coords,
....)
{
int2 WorkGroupPositionInMatrix = vload2(get_group_id(0), coords);
...
DoCalculation();
...
WriteResultTwice();
...
return;
}
What you need to do by hand - is to cope with thouse WorkGroups, which will be placed on the matrix diagonal. If matrix size is big, than overhead for LocalGroups, placed on diagonal is negligible.
A right triangle can be cut in half vertically and the smaller portion rotated to fit with the larger portion to form a rectangle of equal area. Therefore it is easy to make your triangular global work area into one that is rectangular, which fits OpenCL.
See my answer here: OpenCL efficient way to group a lower triangular matrix

Put a multidimensional array into a one-dimensional array

I've got a question. I'm writing a simple application in C++ and I have the following problem:
I want to use a two-dimensional array to specify the position of an object (x and y coordinates). But when I created such an array, I got many access violation problems, when I accessed it. I'm not pretty sure, where that violations came from, but I think, my stack is not big enough and I shuld use pointers. But when I searched for a solution to use a multidimensional array in heap and point on it, the solutions where too complicated for me.
So I remembered there's a way to use a "normal" one-dimensional array as an multidimensional array. But I do not remember exactly, how I can access it the right way. I declared it this way:
char array [SCREEN_HEIGHT * SCREEN_WIDTH];
Then I tried to fill it this way:
for(int y = 0; y < SCREEN_HEIGHT; y++) {
for(int x = 0; x < SCREEN_WIDTH; x++) {
array [y + x * y] = ' ';
}
}
But this is not right, because the char that is at position y + x * y is not exactly specified (because y + y * x points to the same position)
But I am pretty sure, there was a way to do this. Maybe I am wrong, so tell it to me :D
In this case, a solution to use multidimensional array would be great!
You don't want y + x*y, you want y * SCREEN_WIDTH + x. That said, a 2D array declared as:
char array[SCREEN_HEIGHT][SCREEN_WIDTH];
Has exactly the same memory layout, and you could just access it directly the way you want:
array[y][x] = ' ';
char array2D[ROW_COUNT][COL_COUNT] = { {...} };
char array1D[ROW_COUNT * COL_COUNT];
for (int row = 0; row < ROW_COUNT; row++)
{
for (int col = 0; col < COL_COUNT; col++)
{
array1D[row * COL_COUNT + col] = array2D[row][col];
}
}
You access the correct element for your 1D array by taking "current row * total columns + current column," or vice-versa if you're looping through columns first.

Calculate squared Euclidean distance matrix on GPU

Let p be a matrix of first set of locations where each row gives the coordinates of a particular point. Similarly, let q be a matrix of second set of locations where each row gives the coordinates of a particular point.
Then formula for pairwise squared Euclidean distance is:
k(i,j) = (p(i,:) - q(j,:))*(p(i,:) - q(j,:))',
where p(i,:) denotes i-th row of matrix p, and p' denotes the transpose of p.
I would like to compute matrix k on CUDA-enabled GPU (NVidia Tesla) in C++. I have OpenCV v.2.4.1 with GPU support but I'm open to other alternatives, like Thrust library. However, I'm not too familiar with GPU programming. Can you suggest an efficient way to accomplish this task? What C++ libraries should I use?
The problem looks simple enough to make a library overkill.
Without knowing the range of i and j, I'd suggest you partition k into blocks of a multiple of 32 threads each and in each block, compute
float sum, myp[d];
int i = blockIdx.x*blockDim.x + threadIdx.x;
for ( int kk = 0 ; kk < d ; kk++ )
myp[kk] = p(i,kk);
for ( j = blockIdx.y*blockDim.y ; j < (blockIdx.y+1)*blockDim ; j++ ) {
sum = 0.0f;
#pragma unroll
for ( int kk = 0 ; kk < d ; kk++ ) {
temp = myp[kk] - q(j,kk);
sum += temp*temp;
}
k(i,j) = sum;
}
where I am assuming that your data has d dimensions and writing p(i,k), q(j,k) and k(i,j) to mean an access to a two-dimensional array. I also took the liberty in assuming that your data is of type float.
Note that depending on how k is stored, e.g. row-major or column-major, you may want to loop over i per thread instead to get coalesced writes to k.

Iterate through 2D Array block by block in C++

I'm working on a homework assignment for an image shrinking program in C++. My picture is represented by a 2D array of pixels; each pixel is an object with members "red", "green" and "blue." To solve the problem I am trying to access the 2D array one block at a time and then call a function which finds the average RGB value of each block and adds a new pixel to a smaller image array. The size of each block (or scale factor) is input by the user.
As an example, imagine a 100-item 2D array like myArray[10][10]. If the user input a shrink factor of 3, I would need to break out mini 2D arrays of size 3 by 3. I do not have to account for overflow, so in this example I can ignore the last row and the last column.
I have most of the program written, including the function to find the average color. I am confused about how to traverse the 2D array. I know how to cycle through a 2D array sequentially (one row at a time), but I'm not sure how to get little squares within an array.
Any help would be greatly appreciated!
Something like this should work:
for(size_t bx = 0; bx < width; bx += block_width)
for(size_t by = 0; by < height; by += block_height) {
float sum = 0;
for(size_t x = 0; x < block_width; ++x)
for(size_t y = 0; y < block_height; ++y) {
sum += array[bx + x][by + y];
}
average = sum / (block_width * block_height);
new_array[bx][by] = average;
}
width is the whole width, block_width is the length of your blue squares on diagram
This is how you traverse an array in C++:
for(i=0; i < m; i++) {
for(j=0; j < n; j++) {
// do something with myArray[i][j] where i represents the row and j the column
}
}
I'll leave figuring out how to cylcle through the array in different ways as an exercise to the reader.
you could use two nested loops one for x and one for y and move the start point of those loops across the image. As this is homework I wont put any code up but you should be able to work it out.