Combine neural network layer kernels into one kernel CUDA - c++

I am working on a CUDA implementation of a neural network and I'm wondering how the calculations within a fully connected layer can be optimized more.
My current CUDA kernel for a fully connected layer in a neural network consists of the following steps:
Set the output neuron accumulators (input) to 0
Multiply the output data from the previous layer (in) with the weights of the current layer and sum the result in the accumulator
Calculate the output of the current layer (out) by applying an activation function to the accumulated data
These are general steps in a single layer of neural network, but are currently (see below) implemented as separate kernels. For small output sizes (outSizeX equals 10 for example), the first and third step are relatively slow, especially combined with launching the three kernels.
Thus, my question is: how can I combine these three kernels into one kernel which performs all of the three above mentioned steps?
// Step 1
__global__ void set_to_zero_cuda(float *__restrict__ input, int outSizeX)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i >= outSizeX)
return;
input[i] = 0;
}
// Step 2
__global__ void activate_cuda_fc(const float *__restrict__ in, float *__restrict__ input, const float *__restrict__ weights,
int totalInSize, int outSizeX, int weightSizeX)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int nx = blockDim.x * gridDim.x;
int ny = blockDim.y * gridDim.y;
for (int n = x; n < outSizeX; n += nx)
{
for (int i = y; i < totalInSize; i += ny)
{
atomicAdd(&input[n], in[i] * weights[i + n * weightSizeX]);
}
}
}
// Step 3
__global__ void perform_activation_function_cuda_fc(float *__restrict__ out, float *input,
int outSizeX)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i >= outSizeX)
return;
out[i] = activator_function_cuda(input[i]);
}
For reference, the current profile looks like this:

Thus, my question is: how can I combine these three kernels into one kernel which performs all of the three above mentioned steps?
Unless you are using a linear activation function, you can't "collapse" a sequence of fully connected layers like this.
Applying the weights and biases to the inputs of each layer is exactly the kind of trivially parallelizable linear algebra operation that are the bread and butter of GPUs. However, for that to work efficiently, you need to have all inputs of a layer ready before you launch it. Anything that precludes doing that operation in bulk will hurt performance immediately.
At the same time, since most activation functions introduce nonlinearity, they cannot be embedded directly into a linear algebra process, so you don't have much choice but to perform them separately.
However, there's still a lot of gains to be made in the code you posted. As I said, applying the weights and biases is the bread and butter of GPUs. In fact, it's effectively the exact same thing as transforming a vector by a matrix, but you are going about that in a rather roundabout way. Using a ready-made function M*V function such as cublasSgemv() would most likely give you some immediate benefits.
Addendum:
If you are using a linear activation function, then you are effectively doing y = A3 * L3 * A2 * L2 * A1 * L1 * x where Ln is the matrix associated with a layer, and the activation function An are just scalars. You can premultiply all the A's and L's together ahead of time and treat it as one big matrix multiplication.

Related

Faster Method for Multiple Bilinear Interpolation?

I am writing a program in C++ to reconstruct a 3D object from a set of projected 2D images, the most computation-intensive part of which involves magnifying and shifting each image via bilinear interpolation. I currently have a pair of functions for this task; "blnSetup" defines a handful of parameters outside the loop, then "bilinear" applies the interpolation point-by-point within the loop:
(NOTE: 'I' is a 1D array containing ordered rows of image data)
//Pre-definition structure (in header)
struct blnData{
float* X;
float* Y;
int* I;
float X0;
float Y0;
float delX;
float delY;
};
//Pre-definition function (outside the FOR loop)
extern inline blnData blnSetup(float* X, float* Y, int* I)
{
blnData bln;
//Create pointers to X, Y, and I vectors
bln.X = X;
bln.Y = Y;
bln.I = I;
//Store offset and step values for X and Y
bln.X0 = X[0];
bln.delX = X[1] - X[0];
bln.Y0 = Y[0];
bln.delY = Y[1] - Y[0];
return bln;
}
//Main interpolation function (inside the FOR loop)
extern inline float bilinear(float x, float y, blnData bln)
{
float Ixy;
//Return -1 if the target point is outside the image matrix
if (x < bln.X[0] || x > bln.X[-1] || y < bln.Y[0] || y > bln.Y[-1])
Ixy = 0;
//Otherwise, apply bilinear interpolation
else
{
//Define known image width
int W = 200;
//Find nearest indices for interpolation
int i = floor((x - bln.X0) / bln.delX);
int j = floor((y - bln.Y0) / bln.delY);
//Interpolate I at (xi, yj)
Ixy = 1 / ((bln.X[i + 1] - bln.X[i])*(bln.Y[j + 1] - bln.Y[j])) *
(
bln.I[W*j + i] * (bln.X[i + 1] - x) * (bln.Y[j + 1] - y) +
bln.I[W*j + i + 1] * (x - bln.X[i]) * (bln.Y[j + 1] - y) +
bln.I[W*(j + 1) + i] * (bln.X[i + 1] - x) * (y - bln.Y[j]) +
bln.I[W*(j + 1) + i + 1] * (x - bln.X[i]) * (y - bln.Y[j])
);
}
return Ixy;
}
EDIT: The function calls are below. 'flat.imgdata' is a std::vector containing the input image data and 'proj.imgdata' is a std::vector containing the transformed image.
int Xs = flat.dim[0];
int Ys = flat.dim[1];
int* Iarr = flat.imgdata.data();
float II, x, y;
bln = blnSetup(X, Y, Iarr);
for (int j = 0; j < flat.imgdata.size(); j++)
{
x = 1.2*X[j % Xs];
y = 1.2*Y[j / Xs];
II = bilinear(x, y, bln);
proj.imgdata[j] = (int)II;
}
Since I started optimizing, I have been able to reduce computation time by ~50x (!) by switching from std::vectors to C arrays within the interpolation function, and another 2x or so by cleaning up redundant computations/typecasting/etc, but assuming O(n) with n being the total number of processed pixels, the full reconstruction (~7e10 pixels) should still take 40min or so--about an order of magnitude longer than my goal of <5min.
According to Visual Studio's performance profiler, the interpolation function call ("II = bilinear(x, y, bln);") is unsurprisingly still the majority of my computation load. I haven't been able to find any linear algebraic methods for fast multiple interpolation, so my question is: is this basically as fast as my code will get, short of applying more or faster CPUs to the task? Or is there a different approach that might speed things up?
P.S. I've also only been coding in C++ for about a month now, so feel free to point out any beginner mistakes I might be making.
I wrote up a long answer suggesting looking at OpenCV (opencv.org), or using Halide (http://halide-lang.org/), and getting into how image warping is optimized, but I think a shorter answer might serve better. If you are really just scaling and translating entire images, OpenCV has code to do that and we have an example for resizing in Halide as well (https://github.com/halide/Halide/blob/master/apps/resize/resize.cpp).
If you really have an algorithm that needs to index an image using floating-point coordinates which result from a computation that cannot be turned into a moderately simple function on integer coordinates, then you really want to be using filtered texture sampling on a GPU. Most techniques for optimizing on the CPU rely on exploiting some regular pattern of access in the algorithm and removing float to integer conversion from the addressing. (For resizing, one uses two integer variables, one which indexes the pixel coordinate of the image and the other which is the fractional part of the coordinate and it indexes a kernel of weights.) If this is not possible, the speedups are somewhat limited on CPUs. OpenCV does provide fairly general remapping support, but it likely isn't all that fast.
Two optimizations that may be applicable here are trying to move the boundary condition out the loop and using a two pass approach in which the horizontal and vertical dimensions are processed separately. The latter may or may not win and will require tiling the data to fit in cache if the images are very large. Tiling in general is pretty important for large images, but it isn't clear it is the first order performance problem here and depending on the values in the inputs, the cache behavior may not be regular enough anyway.
"vector 50x slower than array". That's a dead giveaway you're in debug mode, where vector::operator[] is not inlined. You will probably get the necessary speedup, and a lot more, simply by switching to release mode.
As a bonus, vector has a .back() method, so you have a proper replacement for that [-1]. Pointers to the begin of an array don't contain the array size, so you can't find the back of an array that way.

Min of array rows in CUDA

Given a n-by-m matrix, I would like to build a n-sized vector containing the minimums of each matrix row, in CUDA.
So far I've come through this:
__global__ void OnMin(float * Mins, const float * Matrix, const int n, const int m) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < n) {
Mins[i] = Matrix[m * i];
for (int j = 1; j < m; ++j){
if (Matrix[m * i + j] < Mins[i])
Mins[i] = Matrix[m * i + j];
}
}
}
called in:
OnMin<<<(n + TPB - 1) / TPB, TPB>>>(Mins, Matrix, n, m);
However I think that something more optimized could exist.
I tried invoking cublasIsamin in a loop, but it is slower.
I also tried launching a kernel (global) from OnMin kernel without success... (sm_35, compute_35 raises compile errors... I have a GTX670)
Any ideas ?
Thanks!
Finding the min of array rows in a row-major matrix is a parallel reduction question that has been discussed many times on stack overflow. For exmaple, this one.
Reduce matrix rows with CUDA
The basic idea is to use n blocks in a grid. Each block contains a fixed number of threads, typically 256. Each block of threads will do the parallel reduction on a row of the m elements to find the minimum collaboratively.
For a large enough matrix where the GPU can be fully utilized, the performance upper bound is half the time of copying the matrix once.

Is something like this possible in CUDA

Let's say, I have a matrix with values of 0 or 1. It is in CUDA possible, to do something like this:
__global__ void kernel(float *matrix, float *count)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int column = blockIdx.x * blockDim.x + threadIdx.x;
if (row >= MATRIXSIZE || column >= MATRIXSIZE)
{
return;
}
if (matrix[MATRIXSIZE * row + column] == 1)
{
count[0]++;
}
}
So I get in the end ne number of ones in the matrix. I know, this is very simple example, but if this might be possible, so also other variants ...
There are highly optimized libraries for CUDA that perform these types of operations, called reductions. Look into CUDA Thrust or CUB. In Thrust, you can use reduce to sum up all the values or count to count number of instances of a particular value.
If you really want to do this. You should use atomic add: atomicadd. atomicadd(count[0],1)
But this may cause performance issue.

Gram matrix using VexCL

I have a pretty large data (does not fit into a GPU memory) containing many vectors where each vector is several MBs.
I'd like to calculate, using multiple GPU devices, the Gram matrix using a gaussian kernel.
In other words, for every pair of vectors x,y, I need to calculate the norm of x-y. So if I have N vectors, I have (N^2+N)/2 such pairs. I don't care about saving space or time by taking advantage of the symmetry, it can do the whole N^2.
How can I do it with VexCL? I know its the only library supporting multiple GPUs, and I did pretty much doing it effectively with plain OpenCL with no success so far.
Please note that the dataset won't even fit the machine's RAM, I'm reading blocks of vectors from a memory mapped file.
Thanks much!!
You will obviously need to split your vectors into groups of, say, m, load the groups one by one (or, rather, two by two) onto your GPUs and do the computations. Here is a complete program that does the computation (as I understood it) for the two currently loaded chunks:
#include <vexcl/vexcl.hpp>
int main() {
const size_t n = 1024; // Each vector size.
const size_t m = 4; // Number of vectors in a chunk.
vex::Context ctx( vex::Filter::Count(1) );
// The input vectors...
vex::vector<double> chunk1(ctx, m * n);
vex::vector<double> chunk2(ctx, m * n);
// ... with some data.
chunk1 = vex::element_index();
chunk2 = vex::element_index();
vex::vector<double> gram(ctx, m * m); // The current chunk of Gram matrix to fill.
/*
* chunk1 and chunk2 both have dimensions [m][n].
* We want to take each of chunk2 m rows, subtract those from each of
* chunk1 rows, and reduce the result along the dimension n.
*
* In order to do this, we create two virtual 3D matrices (x and y below,
* those are just expressions and are never instantiated) sized [m][m][n],
* where
*
* x[i][j][k] = chunk1[i][k] for each j, and
* y[i][j][k] = chunk2[j][k] for each i.
*
* Then what we need to compute is
*
* gram[i][j] = sum_k( (x[i][j][k] - y[i][j][k])^2 );
*
* Here it goes:
*/
using vex::extents;
auto x = vex::reshape(chunk1, extents[m][m][n], extents[0][2]);
auto y = vex::reshape(chunk2, extents[m][m][n], extents[1][2]);
// The single OpenCL kernel is generated and launched here:
gram = vex::reduce<vex::SUM>(
extents[m][m][n], // The dimensions of the expression to reduce.
pow(x - y, 2.0), // The expression to reduce.
2 // The dimension to reduce along.
);
// Copy the result to host, spread it across your complete gram matrix.
// I am lazy though, so let's just dump it to std::cout:
std::cout << gram << std::endl;
}
I suggest you load chunk1 once, then in sequence load all chunk2 variants and do the computations, then load next chunk1, etc. etc. Note that slicing, reshaping, and multidimensional reduction operations are only supported for a context with a single compute device in it. So what is left is how to spread the computations across all of your compute devices. The easiest way to do this is probably to create single VexCL context that would grab all available GPUs, and then create vectors of command queues out of it:
vex::Context ctx( vex::Filter::Any );
std::vector<std::vector<vex::command_queue>> q;
for(size_t d = 0; d < ctx.size(); ++d)
q.push_back({ctx.queue(d)});
//...
// In a std::thread perhaps:
chunk1(q[d], m * n);
chunk2(q[d], m * n);
// ...
I hope this is enough to get you started.

CUDA Thread IDs

I'm new to CUDA programming and I have the following problem.
If I use the following code to perform matrix multiplication, since CUDA uses Cartesian indexing for thread indexing and C/C++ use row major indexing for matrices, wouldn't it influence the accuracy of the calculation?
__global__ void gpuMM(float *A, float *B, float *C, int N)
{
// Matrix multiplication for NxN matrices C=A*B
// Each thread computes a single element of C
int col = blockIdx.y*blockDim.y + threadIdx.y;
int row = blockIdx.x*blockDim.x + threadIdx.x;
float sum = 0.f;
for (int n = 0; n < N; ++n)
sum += A[row*N+n]*B[n*N+col];
C[row*N+col] = sum;
}
CUDA doesn't imply any memory storage structure. You can say CUDA C is row-major for matrix storage, but that is due to C, not CUDA. (CUDA Fortran would be column-major.) Thread indexing dimensions are arbitrary. They do not imply a data storage order in memory.
Implications about data storage order in memory of course arise as you write your code. From a correctness standpoint, it does not matter if we assign row indices based on x thread dimensions or on y thread dimensions. You can write correct code for this matrix multiply example using either approach (either row based on x, or else row based on y).
However, from a coalescing standpoint, we generally want adjacent executing threads to read or write adjacent cells in memory. Adjacent threads (for execution) typically are grouped in x first. Therefore this is preferable (for your kernel code):
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
because it will allow the read of B[] and the write of C[] to coalesce.
This is easy to prove to yourself. Try it both ways, and measure the execution time of the kernel. The results are correct (match the results produced using a host-based matrix multiply) either way, but one formulation runs significantly faster than the other.
This is especially easy to try, since your kernel code implies square matrices.