Gram matrix using VexCL - c++

I have a pretty large data (does not fit into a GPU memory) containing many vectors where each vector is several MBs.
I'd like to calculate, using multiple GPU devices, the Gram matrix using a gaussian kernel.
In other words, for every pair of vectors x,y, I need to calculate the norm of x-y. So if I have N vectors, I have (N^2+N)/2 such pairs. I don't care about saving space or time by taking advantage of the symmetry, it can do the whole N^2.
How can I do it with VexCL? I know its the only library supporting multiple GPUs, and I did pretty much doing it effectively with plain OpenCL with no success so far.
Please note that the dataset won't even fit the machine's RAM, I'm reading blocks of vectors from a memory mapped file.
Thanks much!!

You will obviously need to split your vectors into groups of, say, m, load the groups one by one (or, rather, two by two) onto your GPUs and do the computations. Here is a complete program that does the computation (as I understood it) for the two currently loaded chunks:
#include <vexcl/vexcl.hpp>
int main() {
const size_t n = 1024; // Each vector size.
const size_t m = 4; // Number of vectors in a chunk.
vex::Context ctx( vex::Filter::Count(1) );
// The input vectors...
vex::vector<double> chunk1(ctx, m * n);
vex::vector<double> chunk2(ctx, m * n);
// ... with some data.
chunk1 = vex::element_index();
chunk2 = vex::element_index();
vex::vector<double> gram(ctx, m * m); // The current chunk of Gram matrix to fill.
/*
* chunk1 and chunk2 both have dimensions [m][n].
* We want to take each of chunk2 m rows, subtract those from each of
* chunk1 rows, and reduce the result along the dimension n.
*
* In order to do this, we create two virtual 3D matrices (x and y below,
* those are just expressions and are never instantiated) sized [m][m][n],
* where
*
* x[i][j][k] = chunk1[i][k] for each j, and
* y[i][j][k] = chunk2[j][k] for each i.
*
* Then what we need to compute is
*
* gram[i][j] = sum_k( (x[i][j][k] - y[i][j][k])^2 );
*
* Here it goes:
*/
using vex::extents;
auto x = vex::reshape(chunk1, extents[m][m][n], extents[0][2]);
auto y = vex::reshape(chunk2, extents[m][m][n], extents[1][2]);
// The single OpenCL kernel is generated and launched here:
gram = vex::reduce<vex::SUM>(
extents[m][m][n], // The dimensions of the expression to reduce.
pow(x - y, 2.0), // The expression to reduce.
2 // The dimension to reduce along.
);
// Copy the result to host, spread it across your complete gram matrix.
// I am lazy though, so let's just dump it to std::cout:
std::cout << gram << std::endl;
}
I suggest you load chunk1 once, then in sequence load all chunk2 variants and do the computations, then load next chunk1, etc. etc. Note that slicing, reshaping, and multidimensional reduction operations are only supported for a context with a single compute device in it. So what is left is how to spread the computations across all of your compute devices. The easiest way to do this is probably to create single VexCL context that would grab all available GPUs, and then create vectors of command queues out of it:
vex::Context ctx( vex::Filter::Any );
std::vector<std::vector<vex::command_queue>> q;
for(size_t d = 0; d < ctx.size(); ++d)
q.push_back({ctx.queue(d)});
//...
// In a std::thread perhaps:
chunk1(q[d], m * n);
chunk2(q[d], m * n);
// ...
I hope this is enough to get you started.

Related

Combine neural network layer kernels into one kernel CUDA

I am working on a CUDA implementation of a neural network and I'm wondering how the calculations within a fully connected layer can be optimized more.
My current CUDA kernel for a fully connected layer in a neural network consists of the following steps:
Set the output neuron accumulators (input) to 0
Multiply the output data from the previous layer (in) with the weights of the current layer and sum the result in the accumulator
Calculate the output of the current layer (out) by applying an activation function to the accumulated data
These are general steps in a single layer of neural network, but are currently (see below) implemented as separate kernels. For small output sizes (outSizeX equals 10 for example), the first and third step are relatively slow, especially combined with launching the three kernels.
Thus, my question is: how can I combine these three kernels into one kernel which performs all of the three above mentioned steps?
// Step 1
__global__ void set_to_zero_cuda(float *__restrict__ input, int outSizeX)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i >= outSizeX)
return;
input[i] = 0;
}
// Step 2
__global__ void activate_cuda_fc(const float *__restrict__ in, float *__restrict__ input, const float *__restrict__ weights,
int totalInSize, int outSizeX, int weightSizeX)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int nx = blockDim.x * gridDim.x;
int ny = blockDim.y * gridDim.y;
for (int n = x; n < outSizeX; n += nx)
{
for (int i = y; i < totalInSize; i += ny)
{
atomicAdd(&input[n], in[i] * weights[i + n * weightSizeX]);
}
}
}
// Step 3
__global__ void perform_activation_function_cuda_fc(float *__restrict__ out, float *input,
int outSizeX)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i >= outSizeX)
return;
out[i] = activator_function_cuda(input[i]);
}
For reference, the current profile looks like this:
Thus, my question is: how can I combine these three kernels into one kernel which performs all of the three above mentioned steps?
Unless you are using a linear activation function, you can't "collapse" a sequence of fully connected layers like this.
Applying the weights and biases to the inputs of each layer is exactly the kind of trivially parallelizable linear algebra operation that are the bread and butter of GPUs. However, for that to work efficiently, you need to have all inputs of a layer ready before you launch it. Anything that precludes doing that operation in bulk will hurt performance immediately.
At the same time, since most activation functions introduce nonlinearity, they cannot be embedded directly into a linear algebra process, so you don't have much choice but to perform them separately.
However, there's still a lot of gains to be made in the code you posted. As I said, applying the weights and biases is the bread and butter of GPUs. In fact, it's effectively the exact same thing as transforming a vector by a matrix, but you are going about that in a rather roundabout way. Using a ready-made function M*V function such as cublasSgemv() would most likely give you some immediate benefits.
Addendum:
If you are using a linear activation function, then you are effectively doing y = A3 * L3 * A2 * L2 * A1 * L1 * x where Ln is the matrix associated with a layer, and the activation function An are just scalars. You can premultiply all the A's and L's together ahead of time and treat it as one big matrix multiplication.

Matrix multiplication very slow in Eigen

I have implemented a Gauss-Newton optimization process which involves calculating the increment by solving a linearized system Hx = b. The H matrx is calculated by H = J.transpose() * W * J and b is calculated from b = J.transpose() * (W * e) where e is the error vector. Jacobian here is a n-by-6 matrix where n is in thousands and stays unchanged across iterations and W is a n-by-n diagonal weight matrix which will change across iterations (some diagonal elements will be set to zero). However I encountered a speed issue.
When I do not add the weight matrix W, namely H = J.transpose()*J and b = J.transpose()*e, my Gauss-Newton process can run very fast in 0.02 sec for 30 iterations. However when I add the W matrix which is defined outside the iteration loop, it becomes so slow (0.3~0.7 sec for 30 iterations) and I don't understand if it is my coding problem or it normally takes this long.
Everything here are Eigen matrices and vectors.
I defined my W matrix using .asDiagonal() function in Eigen library from a vector of inverse variances. then just used it in the calculation for H ad b. Then it gets very slow. I wish to get some hints about the potential reasons for this huge slowdown.
EDIT:
There are only two matrices. Jacobian is definitely dense. Weight matrix is generated from a vector by the function vec.asDiagonal() which comes from the dense library so I assume it is also dense.
The code is really simple and the only difference that's causing the time change is the addition of the weight matrix. Here is a code snippet:
for (int iter=0; iter<max_iter; ++iter) {
// obtain error vector
error = ...
// calculate H and b - the fast one
Eigen::MatrixXf H = J.transpose() * J;
Eigen::VectorXf b = J.transpose() * error;
// calculate H and b - the slow one
Eigen::MatrixXf H = J.transpose() * weight_ * J;
Eigen::VectorXf b = J.transpose() * (weight_ * error);
// obtain delta and update state
del = H.ldlt().solve(b);
T <- T(del) // this is pseudo code, meaning update T with del
}
It is in a function in a class, and weight matrix now for debug purposes is defined as a class variable that can be accessed by the function and is defined before the function is called.
I guess that weight_ is declared as a dense MatrixXf? If so, then replace it by w.asDiagonal() everywhere you use weight_, or make the later an alias to the asDiagonal expression:
auto weight = w.asDiagonal();
This way Eigen will knows that weight is a diagonal matrix and computations will be optimized as expected.
Because the matrix multiplication is just the diagonal, you can change it to use coefficient wise multiplication like so:
MatrixXd m;
VectorXd w;
w.setLinSpaced(5, 2, 6);
m.setOnes(5,5);
std::cout << (m.array().rowwise() * w.array().transpose()).matrix() << "\n";
Likewise, the matrix vector product can be written as:
(w.array() * error.array()).matrix()
This avoids the zero elements in the matrix. Without an MCVE for me to base this on, YMMV...

A better way to access n-d array element with a 1-d index array in C++?

Recently, I'm doing something about C++ pointers, I got this question when I want to access elements in multi-dimensional array with a 1-dimensional array which contains index.
Say I have a array arr, which is a 4-dimensional array with all elements set to 0 except for arr[1][2][3][4] is 1, and a array idx which contains index in every dimension for arr, I can access this element by using arr[idx[0]][idx[1]][idx[2]][idx[3]], or by using *(*(*(*(arr + idx[0]) + idx[1]) + idx[2]) + idx[3]).
The question comes with when n is large, this would be not so good, so I wonder if there is a better way to work with multi-dimensional accessing?
#include <bits/stdc++.h>
using namespace std;
#define N 10
int main()
{
int arr[N][N][N][N] = {0};
int idx[4] = {1, 2, 3, 4};
arr[1][2][3][4] = 1;
cout<<"Expected: "<<arr[1][2][3][4]<<" at "<<&arr[1][2][3][4]<<endl;
cout<<"Got with ****: ";
cout<<*(*(*(*(arr + idx[0]) + idx[1]) + idx[2]) + idx[3])<<endl;
return 0;
}
output
Expected: 1 at 0x7fff54c61f28
Got with ****: 1
The way you constructor your algorithm for indexing a multi dimensional array will vary depending on the language of choice; you have tagged this question with both C and C++. I will stick with the latter since my answer would pertain to C++. For a little while now I've been working on something similar but different so this becomes an interesting question as I was building a multipurpose multidimensional matrix class template.
What I have discovered about higher levels of multi dimensional vectors and matrices is that the order of 3 repetitiously works miracles in understanding the nature of higher dimensions. Think of this in the geometrical perspective before considering the algorithmic software implementation side of it.
Mathematically speaking Let's consider the lowest dimension of 0 with the first shape that is a 0 Dimensional object. This happens to be any arbitrary point where this point can have an infinite amount of coordinate location properties. Points such as p0(0), p1(1), p2(2,2), p3(3,3,3),... pn(n,n,...n) where each of these objects point to a specific locale with the defined number of dimensional attributes. This means that there is no linear distance such as length, width, or height and conversely a magnitude in any direction or dimension where this shape or object that has no bounds of magnitude does not define any area, volume or higher dimensions of volume. Also with these 0 dimensional points there is no awareness of direction which also implies that there is no angle of rotation that defines magnitude. Another thing to consider is that any arbitrary point is also the zero vector. Another thing to help in understand this is by the use of algebraic polynomials such that f(x) = mx+b which is linear is a One Dimensional equation, shape(in this case a line) or graph, f(x) = x^2 is Two Dimensional, f(x) = x^3 is Three Dimensional, f(x) = x^4 is Four Dimensional and so on up to f(x) = x^n where this would be N Dimensional. Length or Magnitude, Direction or Angle of Rotation, Area, Volume, and others can not be defined until you relate two distinct points to give you at least 1 line segment or vector with a specified direction. Once you have an implied direction you then have slope.
When looking at operations in mathematics the simplest is addition and it is nothing more than a linear translation and once you introduce addition you also introduce all other operations such as subtraction, multiplication, division, powers, and radicals; once you have multiplication and division you define rotation, angles of rotation, area, volume, rates of change, slope (also tangent function), which thus defines geometry and trigonometry which then also leads into integrations and derivatives. Yes, we have all had our math lessons but I think that this is important in to understanding how to construct the relationships of one order of magnitude to another, which then will help us to work through higher dimensional orders with ease once you know how to construct it. Once you can understand that even your higher orders of operations are nothing more than expansions of addition and subtraction you will begin to learn that their continuous operations are still linear in nature it is just that they expand into multiple dimensions.
Early I stated that the order of 3 repetitiously works miracles so let me explain my meaning. Since we perceive things on a daily basis in the perspective of 3D; we can only visualize 3 distinct vectors that are orthogonal to each other giving you our natural 3 Dimensions of Space such as Left & Right, Forward & Backward giving you the Horizontal axis and planes and Up & Down giving you the Vertical axis and planes. We can not visualize anything higher so dimensions of the order of x^4, x^5, x^6 etc... we can not visualize but yet they do exist. If we begin to look at the graphs of the mathematical polynomials we can begin to see a pattern between odd and even functions where x^4, x^6, x^8 are similar where they are nothing more than expansions of x^2 and functions of x^5, x^7 & x^9 are nothing more than expansions of x^3. So I consider the first few dimensions as normal: Zero - Point, 1st - Linear, 2nd - Area, and 3rd - Volume and as for the 4th and higher dimensions I call all of them Volumetric.
So if you see me use Volume then it relates directly to the 3rd Dimension where if I refer to Volumetric it relates to any Dimension higher than the 3rd. Now lets consider a matrix such that you have seen in regular algebra where the common matrices are defined by MxN. Well this is a 2D flat matrix that has M * N elements and this matrix also has an area of M * N as well. Let's expand to a higher dimensional matrix such as MxNxO this is a 3D Matrix with M * N * O elements and now has M * N * O Volume. So when you visualize this think of the MxN 2D part as being a page to a book and the O components represents each page of a book or slice of a box. The elements of these matrices can be anything from a simple value, to an applied operation, to an equation, system of equations, sets or just an arbitrary object as in a storage container. So now when we have a matrix that is of the 4th order such as MxNxOxP this now has a 4th dimensional aspect but the easiest way to visualize this is that This would be a 1 dimensional array or vector to where all of its P elements would be a 3D Matrix of a Volume of MxNxO. When you have a matrix of MxNxOxPxQ now you have a 2D Area Matrix of PxQ where each of those elements are a MxNxO Volume Matrix. Then again if you have a MxNxOxPxQxR you now have a 6th dimensional matrix and this time you have a 3D Volume Matrix where each of the PxQxR elements are in fact 3D Matrices of MxNxO. And once you go higher and higher this patter repeats and merges again. So the order of how arbitrary matrices behave is that these dimensionalities repeat: 1D are Linear Vectors or Matrices, 2D are Area or Planar Matrices and 3D is Volume Matrices and any thing of a higher repeats this process compressing the previous step of Volumes thus the terminology of Volumetric Matrices. Take a Look at this table:
// Order of Magnitude And groupings
-----------------------------------
Linear Area Volume
x^1 x^2 x^3
x^4 x^5 x^6
x^7 x^8 x^9
x^10 x^11 x^12
... ... ...
----------------------------------
Now it is just a matter of using a little bit of calculus to know which order of magnitude to index into which higher level of dimensionality. Once you know a specific dimension it is simple to take multiple derivatives to give you a linear expression; then traverse the space, then integrate to the same orders of the multiple derivatives to give the results. This should eliminate a good amount of intermediate work by at first ignoring the least significant lower dimensions in a high dimensional order. If you are working in something that has 12 dimensions you can assume that the first 3 dimensions that define the first set of volume is packed tight being an element to another 3D Volumetric Matrix and then once again that 2d order of Volumetric Matrix is itself an element of another 3D Volumetric Matrix. Thus we have a repeating pattern and now it's just a matter of apply this to construct an algorithm and once you have an algorithm; it should be quite easy to implement the methods in any programmable language. So you may have to have a 3 case switch to determine which algorithmic approach to use knowing the overall dimensionality of your matrix or n-d array where one handles orders of linearity, another to handle area, and the final to handle volumes and if they are 4th+ then the overall process becomes recursive in nature.
I figured out a way to solve this myself.
The idea is that use void * pointers, we know that every memory cell holds value or an address of a memory cell, so we can directly compute the offset of the target to the base address.
In this case, we use void *p = arr to get the base address of the n-d array, and then loop over the array idx, to calculate the offset.
For arr[10][10][10][10], the offset between arr[0] and arr[1] is 10 * 10 * 10 * sizeof(int), since arr is 4-d, arr[0] and arr[1] is 3-d, so there is 10 * 10 * 10 = 1000 elements between arr[0] and arr[1], after that, we should know that the offset between two void * adjacent addresses is 1 byte, so we should multiply sizeof(int) to get the correct offset, according to this, we finally get the exact address of the memory cell we want to access.
Finally, we have to cast void * pointer to int * pointer and access the address to get the correct int value, that's it!
With void *(not so good)
#include <bits/stdc++.h>
using namespace std;
#define N 10
int main()
{
int arr[N][N][N][N] = {0};
int idx[4] = {1, 2, 3, 4};
arr[1][2][3][4] = 1;
cout<<"Expected: "<<arr[1][2][3][4]<<" at "<<&arr[1][2][3][4]<<endl;
cout<<"Got with ****: ";
cout<<*(*(*(*(arr + idx[0]) + idx[1]) + idx[2]) + idx[3])<<endl;
void *p = arr;
for(int i = 0; i < 4; i++)
p += idx[i] * int(pow(10, 3-i)) * sizeof(int);
cout<<"Got with void *:";
cout<<*((int*)p)<<" at "<<p<<endl;
return 0;
}
Output
Expected: 1 at 0x7fff5e3a3f18
Got with ****: 1
Got with void *:1 at 0x7fff5e3a3f18
Notice:
There is a warning when compiling it, but I choose to ignore it.
test.cpp: In function 'int main()':
test.cpp:23:53: warning: pointer of type 'void *' used in arithmetic [-Wpointer-arith]
p += idx[i] * int(pow(10, 3-i)) * sizeof(int);
Use char * instead of void *(better)
Since we want to manipulate pointer byte by byte, it would be better to use char * to replace void *.
#include <bits/stdc++.h>
using namespace std;
#define N 10
int main()
{
int arr[N][N][N][N] = {0};
int idx[4] = {1, 2, 3, 4};
arr[1][2][3][4] = 1;
cout<<"Expected: "<<arr[1][2][3][4]<<" at "<<&arr[1][2][3][4]<<endl;
char *p = (char *)arr;
for(int i = 0; i < 4; i++)
p += idx[i] * int(pow(10, 3-i)) * sizeof(int);
cout<<"Got with char *:";
cout<<*((int*)p)<<" at "<<(void *)p<<endl;
return 0;
}
Output
Expected: 1 at 0x7fff4ffd7f18
Got with char *:1 at 0x7fff4ffd7f18
With int *(In this specific case)
I have been told it's not a good practice for void * used in arithmetic, it would be better to use int *, so I cast arr into int * pointer and also replace pow.
#include <bits/stdc++.h>
using namespace std;
#define N 10
int main()
{
int arr[N][N][N][N] = {0};
int idx[4] = {1, 2, 3, 4};
arr[1][2][3][4] = 1;
cout<<"Expected: "<<arr[1][2][3][4]<<" at "<<&arr[1][2][3][4]<<endl;
cout<<"Got with ****: ";
cout<<*(*(*(*(arr + idx[0]) + idx[1]) + idx[2]) + idx[3])<<endl;
int *p = (int *)arr;
int offset = 1e3;
for(int i = 0; i < 4; i++)
{
p += idx[i] * offset;
offset /= 10;
}
cout<<"Got with int *:";
cout<<*p<<" at "<<p<<endl;
return 0;
}
Output
Expected: 1 at 0x7fff5eaf9f08
Got with ****: 1
Got with int *:1 at 0x7fff5eaf9f08

CUDA estimating threads per blocks and block numbers for 2D grid data

Let me start by saying that I've read carefully all similar questions on SO:
Determining threads per block and block per grid
Threads per SM, threads per block
CUDA Blocks and Threads
Warps and optimal number of blocks
My intention is to try and calculate dynamically (rather than hardcoding values) for a feed-forward neural net library I am developing.
My data is not a square lattice (a matrix) as is often with most examples I've seen, it is instead two vectors producing a matrix, with unequal rows to columns:
float x[6] {1.f, 1.f, 0.f, 1.f, 1.f, 0.f};
thrust::device_vector<float> in_vec( x, x+6 );
float y[9] {1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f};
thrust::device_vector<float> w_vec( y, y+9 );
thrust::device_vector<float> o_wec(9);
thrust::device_vector<float> mtx_vec( 9 * 6 );
float * i_ptr = thrust::raw_pointer_cast( in_vec.data() );
float * w_ptr = thrust::raw_pointer_cast( w_vec.data() );
float * out_ptr = thrust::raw_pointer_cast( mtx_vec.data() );
dim3 threadsPerBlock(9,6);
dim3 numBlocks(1,1);
prop_mtx<<<numBlocks,threadsPerBlock>>>( w_ptr, i_ptr, out_ptr, 6 );
and the kernel:
__global__ void prop_mtx( float * w, float * i, float * o, int s )
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
o[y + x * s] = w[x] * i[y];
}
The reason why I've taken this approach is because it makes sense in ANN computation, when it comes to vector/matrix calculations.
I'd like to keep this consistent, and AFAIK using a 2D grid for Weight * Input calculations is reasonable.
I have to compute my threads per block as a 2D with unequal numbers of threads in the grid.
I am ussing a GTX 660, which has:
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2047 MBytes
( 5) Multiprocessors, (192) CUDA Cores/MP: 960 CUDA Cores
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
I am trying to understand how I can deduce/compute the grid size, threads per block, and number of blocks.
Let us assume I have a weight vector of 800 items, and an input vector of 6500 items.
Does this imply that what I really need, is a 2D grid of 800,6500? As far as I understand, anything else will provide incorrect results?
I know my maximum threads per block is 1024, but because its a 2D grid, it would more likely be:
dim3 threadPerBlock(X,Y);
Due to the fact that my grid is not a square matrix, I need to calculate the X, Y threads per block in a different way?
Or I need to deduce the number of blocks needed first?
Finally, since my thread warp size is 32,
Does the minimum grid size, regardless of all other parameters need to be at least 32, or a multiple of 32? Do I need at least 32 threads per block, or a grid size where the smallest number is 32?
Any pseudo-code, or explanation of how I should go about this, would be greatly appreciated.
What I have tried, is to calculate my 2D grid size, by dividing my data by 32 wrap size.
Then I considered calculating the grid threads by using the available SMs. For example
800 weights / 5 SM, = 160 x's per SM
6500 inputs / 5 SM, = 1300 y's per SM
But I didn't know what to do from there on.
Finally, I considered finding the input-weight ratio first:
6500/800 = 8.125
Implying that using the 32 minimum grid size for X,
Y would have to be multiplied by 8.125 * 32
Hence, my threadsPerBlock would be:
dim3 threadsPerBlock(32,260);
That is of course, 8320 threads per block, which far exceeds the 1024 per block.
So this is my issue: how do I not exceed the 1024 threads per block, whilst retaining the correct grid size of my data?
PS: My question is not about optimising the code, but understanding how to distribute the threads and grid data over the device.
One approach to categorizing computation problems is to discuss transformations and reductions.
A reduction is a category of problem which takes a large input data set size, and produces a small output data set size. For example, taking an image and finding the maximum pixel value would be a reduction. For this discussion, we will ignore reductions.
A transformation is a category of computation where the output data set size (number of elements) is either "large" or "approximately the same" as the input data set size. For example, taking an image and producing a blurred image would be a transformation.
For transformations, a common approach ("thread strategy") to writing a cuda kernel (the thread code) will be to make one unique thread responsible for each point in the output array. Therefore, the total minimum number of threads that I must have is equal to the size of my output array. The thread code is just the set of computations needed on the input data, in order to produce one output data point. Roughly speaking then, your problem, and simplified kernel, fit this definition; it is a transformation.
Following the above thread strategy, we will need a total number of threads in our grid equal to the total number of output points I need to create. For 2D problems, it is often convenient to think about these two-dimensionally, and CUDA provides 2D (or 3D) threadblock organization and 2D (or 3D) grid organization, for this purpose.
Choice of CUDA threadblock dimensions is often somewhat arbitrary. Generally speaking, we typically want to aim for threadblocks in the 128 - 512 threads per block range (for reasons that are covered elsewhere) and we want threadblocks that are whole-number multiples of 32 (the warp size) for efficiency when the threadblock gets subdivided into warps, which are the actual unit of CUDA execution. On currently supported GPUs, threadblocks are limited to 1024 threads per block (total - i.e. the product of the dimensions). However, for many problems, threadblock choices within this range (e.g. 256 threads vs. 512 threads) often have relatively little impact on performance. In the interest of getting something working, we don't sweat the details at this point. (When you're coming back for optimization, you may revisit this choice.)
So far we've learned that for this problem type, we need a total number of threads to cover our problem space, and we will have a somewhat arbitrary threadblock dimension choice. So let's choose (32,16) (x,y) to start with, for a total of 512 threads. There are no rules that state that theadblocks need be "square", or that grids need be "square", or that there should even be any sort of ratiometric parity between threadblock dimensions and problem size (or grid dimensions.)
Now that we have a threadblock choice of (32,16) in mind, we must ask ourselves "how many of these do I need?". This problem is 2D and so we've chosen a 2D threadblock for simplicity of index generation in the thread code. Let's choose a 2D grid as well - it makes sense for a 2D problem, and again for 2D simplicity of index generation. So we can consider the two dimensions independently.
So, how many blocks do I need in the x-direction? I need at least as many as (my problem size in x)/(my threadblock size in x). Since we are dealing with all integers here, this begs the question "what if my problem size is not evenly divisible by my threadblock size?" The canonical solution is to launch more than enough threads to cover the space, or enough blocks to cover the space. But in the non-evenly-divisible case, this will result in "extra threads". We'll discuss and deal with these shortly. Therefore, if I have a dim3 variable like this for threadblock dimensions:
#define BX 32
#define BY 16
...
dim3 block(BX,BY);
then I might construct my dim3 grid variable like this:
#define DX 800
#define DY 6500
...
dim3 grid((DX+block.x-1)/block.x, (DY+block.y-1)/block.y);
If you work through this arithmetic, you will see that this causes us to launch enough blocks in the x and y direction, so that we will have at least enough threads to cover our problem space of (DX,DY), one thread per output point.
Hopefully it is clear that the Y dimension is treated separately and independently from the x-dimension.
The above calculations will usually result in the generation of "too many" threads in my grid. I will have some "extra threads" beyond the end of my problem space (DX, DY) that I need to handle. We want these threads to "do nothing". The canonical way to handle this, is to pass the problem space dimensions to my kernel, create an appropriate globally unique thread index in my kernel, then compare that index to the maximum index in my problem space. If it exceeds it, we simply have that thread skip all remaining thread code.
Using your kernel as an example, it might look like this:
__global__ void prop_mtx( float * w, float * i, float * o, int s, const size_t d_size_x, const size_t d_size_y )
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if ((x < d_size_x) && (y < d_size_y)) // thread check
o[y + x * s] = w[x] * i[y];
}
Note that such a thread check will create threads (in some blocks) that are "not participating" in the subsequent code. A point to be aware of here is that the usage of __syncthreads() depends on all threads in a block participating. Therefore, we should not use __syncthreads() directly in such a case. Instead, we have to condition threadblock behavior appropriately:
__global__ void prop_mtx( float * w, float * i, float * o, int s, const size_t d_size_x, const size_t d_size_y )
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if ((x < d_size_x) && (y < d_size_y)) // thread check
{
o[y + x * s] = w[x] * i[y];
// and other code not dependent on __syncthreads()
}
// now it is safe to use since all threads are participating
__syncthreads();
if ((x < d_size_x) && (y < d_size_y)) // thread check
{
// rest of kernel code
}
}
Note that it is possible to have a smaller number of threads perform the necessary computations for a larger number of output data points. The 1:1 correspondence between threads and output data is an easy way to think about and write the cuda kernel code, but it's not the only way. One other possible method would be to use some form of a grid-striding loop, so that a smaller grid can cover a larger problem space. Discussion of those strategies is outside the scope of this answer, and the basic methodology discussed in this answer should be understood before tackling other approaches.

CUDA Thread IDs

I'm new to CUDA programming and I have the following problem.
If I use the following code to perform matrix multiplication, since CUDA uses Cartesian indexing for thread indexing and C/C++ use row major indexing for matrices, wouldn't it influence the accuracy of the calculation?
__global__ void gpuMM(float *A, float *B, float *C, int N)
{
// Matrix multiplication for NxN matrices C=A*B
// Each thread computes a single element of C
int col = blockIdx.y*blockDim.y + threadIdx.y;
int row = blockIdx.x*blockDim.x + threadIdx.x;
float sum = 0.f;
for (int n = 0; n < N; ++n)
sum += A[row*N+n]*B[n*N+col];
C[row*N+col] = sum;
}
CUDA doesn't imply any memory storage structure. You can say CUDA C is row-major for matrix storage, but that is due to C, not CUDA. (CUDA Fortran would be column-major.) Thread indexing dimensions are arbitrary. They do not imply a data storage order in memory.
Implications about data storage order in memory of course arise as you write your code. From a correctness standpoint, it does not matter if we assign row indices based on x thread dimensions or on y thread dimensions. You can write correct code for this matrix multiply example using either approach (either row based on x, or else row based on y).
However, from a coalescing standpoint, we generally want adjacent executing threads to read or write adjacent cells in memory. Adjacent threads (for execution) typically are grouped in x first. Therefore this is preferable (for your kernel code):
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
because it will allow the read of B[] and the write of C[] to coalesce.
This is easy to prove to yourself. Try it both ways, and measure the execution time of the kernel. The results are correct (match the results produced using a host-based matrix multiply) either way, but one formulation runs significantly faster than the other.
This is especially easy to try, since your kernel code implies square matrices.