Is something like this possible in CUDA - c++

Let's say, I have a matrix with values of 0 or 1. It is in CUDA possible, to do something like this:
__global__ void kernel(float *matrix, float *count)
int row = blockIdx.y * blockDim.y + threadIdx.y;
int column = blockIdx.x * blockDim.x + threadIdx.x;
if (row >= MATRIXSIZE || column >= MATRIXSIZE)
if (matrix[MATRIXSIZE * row + column] == 1)
So I get in the end ne number of ones in the matrix. I know, this is very simple example, but if this might be possible, so also other variants ...

There are highly optimized libraries for CUDA that perform these types of operations, called reductions. Look into CUDA Thrust or CUB. In Thrust, you can use reduce to sum up all the values or count to count number of instances of a particular value.

If you really want to do this. You should use atomic add: atomicadd. atomicadd(count[0],1)
But this may cause performance issue.


Combine neural network layer kernels into one kernel CUDA

I am working on a CUDA implementation of a neural network and I'm wondering how the calculations within a fully connected layer can be optimized more.
My current CUDA kernel for a fully connected layer in a neural network consists of the following steps:
Set the output neuron accumulators (input) to 0
Multiply the output data from the previous layer (in) with the weights of the current layer and sum the result in the accumulator
Calculate the output of the current layer (out) by applying an activation function to the accumulated data
These are general steps in a single layer of neural network, but are currently (see below) implemented as separate kernels. For small output sizes (outSizeX equals 10 for example), the first and third step are relatively slow, especially combined with launching the three kernels.
Thus, my question is: how can I combine these three kernels into one kernel which performs all of the three above mentioned steps?
// Step 1
__global__ void set_to_zero_cuda(float *__restrict__ input, int outSizeX)
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i >= outSizeX)
input[i] = 0;
// Step 2
__global__ void activate_cuda_fc(const float *__restrict__ in, float *__restrict__ input, const float *__restrict__ weights,
int totalInSize, int outSizeX, int weightSizeX)
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int nx = blockDim.x * gridDim.x;
int ny = blockDim.y * gridDim.y;
for (int n = x; n < outSizeX; n += nx)
for (int i = y; i < totalInSize; i += ny)
atomicAdd(&input[n], in[i] * weights[i + n * weightSizeX]);
// Step 3
__global__ void perform_activation_function_cuda_fc(float *__restrict__ out, float *input,
int outSizeX)
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i >= outSizeX)
out[i] = activator_function_cuda(input[i]);
For reference, the current profile looks like this:
Unless you are using a linear activation function, you can't "collapse" a sequence of fully connected layers like this.
Applying the weights and biases to the inputs of each layer is exactly the kind of trivially parallelizable linear algebra operation that are the bread and butter of GPUs. However, for that to work efficiently, you need to have all inputs of a layer ready before you launch it. Anything that precludes doing that operation in bulk will hurt performance immediately.
At the same time, since most activation functions introduce nonlinearity, they cannot be embedded directly into a linear algebra process, so you don't have much choice but to perform them separately.
However, there's still a lot of gains to be made in the code you posted. As I said, applying the weights and biases is the bread and butter of GPUs. In fact, it's effectively the exact same thing as transforming a vector by a matrix, but you are going about that in a rather roundabout way. Using a ready-made function M*V function such as cublasSgemv() would most likely give you some immediate benefits.
If you are using a linear activation function, then you are effectively doing y = A3 * L3 * A2 * L2 * A1 * L1 * x where Ln is the matrix associated with a layer, and the activation function An are just scalars. You can premultiply all the A's and L's together ahead of time and treat it as one big matrix multiplication.

Need help understanding how to work with 2D/3D glyphs

Here's the code snippet I'd like help understanding
for (i = 0; i < samplesX; i++)
for (j = 0; j < samplesY; j++)
newI = DIM * i / samplesX;
newJ = DIM * j / samplesY;
idx = (round(newJ) * DIM) + round(newI);
if (color_dir == 1 && draw_vecs == 1) {
direction_to_color(vx[idx], vy[idx], color_dir);
if (color_dir == 1 && draw_vecs == 2) {
direction_to_color(fx[idx], fy[idx], color_dir);
else if (color_dir == 2) {
scalar = rho[idx];
set_colormap(scalar, min, max, clampLow, clampHigh);
else if (color_dir == 3) {
scalar = sqrt(vx[idx] * vx[idx] + vy[idx] * vy[idx]);
set_colormap(scalar, min, max, clampLow, clampHigh);
else if (color_dir == 4) {
scalar = sqrt(fx[idx] * fx[idx] + fy[idx] * fy[idx]);
set_colormap(scalar, min, max, clampLow, clampHigh);
/*if (draw_vecs == 1) {
glVertex2f(wn + (fftw_real)newI * wn, hn + (fftw_real)newJ * hn);
glVertex2f((wn + (fftw_real)newI * wn) + vec_scale * vx[idx], (hn + (fftw_real)newJ * hn) + vec_scale * vy[idx]);
else if (draw_vecs == 2) {
glVertex2f(wn + (fftw_real)newI * wn, hn + (fftw_real)newJ * hn);
glVertex2f((wn + (fftw_real)newI * wn) + vec_scale * fx[idx], (hn + (fftw_real)newJ * hn) + vec_scale * fy[idx]);
if (draw_vecs == 1) {
glVertex2f(wn + (fftw_real)i * wn, hn + (fftw_real)j * hn);
glVertex2f((wn + (fftw_real)i * wn) + vec_scale * vx[idx], (hn + (fftw_real)j * hn) + vec_scale * vy[idx]);
else if (draw_vecs == 2) {
glVertex2f(wn + (fftw_real)i * wn, hn + (fftw_real)j * hn);
glVertex2f((wn + (fftw_real)i * wn) + vec_scale * fx[idx], (hn + (fftw_real)j * hn) + vec_scale * fy[idx]);
What this currently does, as far as my understanding goes, is display these two-dimensional lines/arrows (hedgehogs) that visualize force/velocity in 2D as can be seen in the picture below.
Sadly, my understanding of linear algebra, calculus and computer graphics in general only goes so far and I'm having trouble dissecting this piece.
Ideally I'd like to understand this and also understand how I can take this pre-existing code and also add in functionality that can display two other glyph types that show a vector and/or scalar field such as
three-dimensional cones
three-dimensional ellipsoids
If I'm missing anything here, please let me know!
Some of the variables included in the above snippet:
const int DIM = 50; //size of simulation grid
int color_dir = 0; //use direction color-coding or not
float scalar;
int newI, newJ;
float temp;
float vec_scale = 1000; //scaling of hedgehogs
int draw_vecs = 1; //draw the vector field or not
The code snippet you have there could have been written simpler (also it takes some educated guessing what some of the variables and functions mean).
Let's break it down.
The first two lines are easy to understand, they're the standard stanza to iterate over a 2D array
for (i = 0; i < samplesX; i++)
for (j = 0; j < samplesY; j++)
i and j are running indices, that will iterate over every discrete coordinate tuple in (i,j) ∈ [i, samplesX) × [j, samplesY). The next two lines remap the 2D indices into into a new value range, specifically [i, samplesX)×[j, samplesY) → [0, DIM)×[0, DIM). A missing piece of information is, what type is DIM of. It would make for it to be some floating point type.
newI = DIM * i / samplesX;
newJ = DIM * j / samplesY;
The next line is bug prone. It translates newI and newJ into a running 1D index for a 1D array, that's addressed by i and j.
Why is this problematic? Because in the conversion to DIM-space information may have been lost. This kind of information loss may lead to security bugs(!), as a matter of fact, Skia, the rendering library used by Google Chrome, Android and other projects had exactly this kind of bug recently; the writeup is a worthwhile read:
The correct way to implement this is to have DIM be an integer and perform fixed point arithmetic on it, eventually truncating the fractional digits. But I digress. The next block is essentially performing a poor man's lookup table lookup. vx``vy and fx``fy are some flattened 2D arrays, accessed through an 1D index, and direction_to_color maps either to a value presumably to a call of glColor; the same probably also goes for set_colormap. This is a bad use of OpenGL.
The whole remapping from i and j to DIM and then the lookups are just poor implementation of a texture lookup. OpenGL already has textures. Just load as texture coordinate array and enable texturing.
Finally for each spine, two calls of glVertex are made, one with the staring point, which lies on grid centers (wn, hn), to an offset location (wn, hn) + (i, j).
My verdict of that code: Utter garbage! All of this could have been done far more elegantly, even back in 1994 with OpenGL-1.0, which is code seems to have been written for. If you want to implement your own vector field plot, don't use this as a starting point.
These days we have programmable GPUs with shaders. All of that bulk up there can be done is a few lines of shader code.

Karatsuba - polynomials multiplication with CUDA

I'm using CUDA for the iterative Karatsuba algorithm and I would like to ask, why is one line computed always different.
First, I implemented this function, which computed the result always correctly:
__global__ void kernel_res_main(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) / 2;
for(TYPE inner = start; inner < end; inner++){
result[i] += ( A[inner] + A[i - inner] ) * ( B[inner] + B[i - inner] );
result[i] -= ( D[inner] + D[i-inner] );
Now I would like to use the 2D grid and use CUDA for the for-loop, so I changed my function to this:
__global__ void kernel_res_nested(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
TYPE rtmp = result[i];
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) >> 1;
if(j >= start && j <= end ){
rtmp += ( A[j] + A[i - j] ) * ( B[j] + B[i - j] ) - ( D[j] + D[i - j] );
result[i] = rtmp;
I am calling this function like this:
dim3 block( 32, 8 );
dim3 grid( (resultSize+1/32) , (resultSize+7/8) );
kernel_res_nested <<<grid, block>>> (devA, devB, devD, devResult, size, resultSize);
And the result is alway wrong and always different. I can't understand why is that second implementation wrong and always computes wrong results. I can't see there any logical problem connected with data dependency. Does anyone know How can I solve this problem?
For question like this, you are supposed to provide a MCVE. (See item 1 here) For example, I don't know what type is indicated by TYPE, and it does matter for the correctness of the solution I will propose.
In your first kernel, only one thread in your entire grid was reading and writing location result[i]. But in your second kernel, you now have multiple threads writing to the result[i] location. They are conflicting with each other. CUDA doesn't specify the order in which threads will run, and some may run before, after, or at the same time as, others. In this case, some threads may read result[i] at the same time as others. Then, when the threads write their results, they will be inconsistent. And it may vary from run-to-run. You have a race condition there (execution order dependency, not data dependency).
The canonical method to sort this out would be to employ a reduction technique.
However for simplicity, I will suggest that atomics could help you sort it out. This is easier to implement based on what you have shown, and will help confirm the race condition. After that, if you want to try a reduction method, there are plenty of tutorials for that (one is linked above) and plenty of questions here on the cuda tag about it.
You could modify your kernel to something like this, to sort out the race condition:
__global__ void kernel_res_nested(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) >> 1;
if(j >= start && j < end ){ // see note below
atomicAdd(result+i, (( A[j] + A[i - j] ) * ( B[j] + B[i - j] ) - ( D[j] + D[i - j] )));
Note that depending on your GPU type, and the actual type of TYPE you are using, this may not work (may not compile) as-is. But since you had previously used TYPE as a loop variable, I am assuming it is an integer type, and the necessary atomicAdd for those should be available.
A few other comments:
This may not be giving you the grid size you expect:
dim3 grid( (resultSize+1/32) , (resultSize+7/8) );
I think the usual calculations there would be:
dim3 grid( (resultSize+31)/32, (resultSize+7)/8 );
I always recommend proper CUDA error checking and running your codes with cuda-memcheck, any time you are having trouble with a CUDA code, to make sure there are no runtime errors.
It also looks to me like this:
if(j >= start && j <= end ){
should be this:
if(j >= start && j < end ){
to match your for-loop range. I am also making an assumption that size is less than resultSize (again, a MCVE would help).

Min of array rows in CUDA

Given a n-by-m matrix, I would like to build a n-sized vector containing the minimums of each matrix row, in CUDA.
So far I've come through this:
__global__ void OnMin(float * Mins, const float * Matrix, const int n, const int m) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < n) {
Mins[i] = Matrix[m * i];
for (int j = 1; j < m; ++j){
if (Matrix[m * i + j] < Mins[i])
Mins[i] = Matrix[m * i + j];
called in:
OnMin<<<(n + TPB - 1) / TPB, TPB>>>(Mins, Matrix, n, m);
However I think that something more optimized could exist.
I tried invoking cublasIsamin in a loop, but it is slower.
I also tried launching a kernel (global) from OnMin kernel without success... (sm_35, compute_35 raises compile errors... I have a GTX670)
Any ideas ?
Finding the min of array rows in a row-major matrix is a parallel reduction question that has been discussed many times on stack overflow. For exmaple, this one.
Reduce matrix rows with CUDA
The basic idea is to use n blocks in a grid. Each block contains a fixed number of threads, typically 256. Each block of threads will do the parallel reduction on a row of the m elements to find the minimum collaboratively.
For a large enough matrix where the GPU can be fully utilized, the performance upper bound is half the time of copying the matrix once.

CUDA Thread IDs

I'm new to CUDA programming and I have the following problem.
If I use the following code to perform matrix multiplication, since CUDA uses Cartesian indexing for thread indexing and C/C++ use row major indexing for matrices, wouldn't it influence the accuracy of the calculation?
__global__ void gpuMM(float *A, float *B, float *C, int N)
// Matrix multiplication for NxN matrices C=A*B
// Each thread computes a single element of C
int col = blockIdx.y*blockDim.y + threadIdx.y;
int row = blockIdx.x*blockDim.x + threadIdx.x;
float sum = 0.f;
for (int n = 0; n < N; ++n)
sum += A[row*N+n]*B[n*N+col];
C[row*N+col] = sum;
CUDA doesn't imply any memory storage structure. You can say CUDA C is row-major for matrix storage, but that is due to C, not CUDA. (CUDA Fortran would be column-major.) Thread indexing dimensions are arbitrary. They do not imply a data storage order in memory.
Implications about data storage order in memory of course arise as you write your code. From a correctness standpoint, it does not matter if we assign row indices based on x thread dimensions or on y thread dimensions. You can write correct code for this matrix multiply example using either approach (either row based on x, or else row based on y).
However, from a coalescing standpoint, we generally want adjacent executing threads to read or write adjacent cells in memory. Adjacent threads (for execution) typically are grouped in x first. Therefore this is preferable (for your kernel code):
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
because it will allow the read of B[] and the write of C[] to coalesce.
This is easy to prove to yourself. Try it both ways, and measure the execution time of the kernel. The results are correct (match the results produced using a host-based matrix multiply) either way, but one formulation runs significantly faster than the other.
This is especially easy to try, since your kernel code implies square matrices.