cublasSdot is working slower than cublasSgemm - c++

In my toy example I first multiply matrices of size 32x32, 100 000 times, and after that I calculate scalar products of two vectors of size 1024, 100 000 times again. For the first I used cublasSgemm, for the second - cublasSdot.
As a result, time for first calculation is 530 msec, for the second - 10 000 msec. However, in order to multiply matrices we need to perform 32^3 operations (multiply-add), and for scalar product just 1024=32^2 operations.
So why am I getting such result? Here is the code:
__device__ float res;
void randomInit(float *data, int size)
{
for (int i = 0; i < size; ++i)
data[i] = rand() / (float)RAND_MAX;
}
int main(){
cublasHandle_t handle;
float out;
cudaError_t cudaerr;
cudaEvent_t start1, stop1,start2,stop2;
cublasStatus_t stat;
int size = 32;
int num = 100000;
float *h_A = new float[size*size];
float *h_B = new float[size*size];
float *h_C = new float[size*size];
float *d_A, *d_B, *d_C;
const float alpha = 1.0f;
const float beta = 0.0f;
randomInit(h_A, size*size);
randomInit(h_B, size*size);
cudaMalloc((void **)&d_A, size *size *sizeof(float));
cudaMalloc((void **)&d_B, size *size * sizeof(float));
cudaMalloc((void **)&d_C, size *size * sizeof(float));
stat = cublasCreate(&handle);
cudaEventCreate(&start1);
cudaEventCreate(&stop1);
cudaEventCreate(&start2);
cudaEventCreate(&stop2);
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, size, size, size, &alpha, d_A, size,
d_B, size, &beta, d_C, size);
cudaEventRecord(start1, NULL);
cudaMemcpy(d_A, h_A, size *size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size *size * sizeof(float), cudaMemcpyHostToDevice);
for (int i = 0; i < num; i++){
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, size, size, size, &alpha, d_A,
size, d_B, size, &beta, d_C, size);
}
cudaMemcpy(h_C, d_C, size*size*sizeof(float), cudaMemcpyDeviceToHost);
cudaEventRecord(stop1, NULL);
cudaEventSynchronize(stop1);
float msecTotal1 = 0.0f;
cudaEventElapsedTime(&msecTotal1, start1, stop1);
std::cout <<"total time for MAtMul:" << msecTotal1 << "\n";
cudaEventRecord(start2, NULL);
cudaMemcpy(d_A, h_A, size *size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size *size * sizeof(float), cudaMemcpyHostToDevice);
for (int i = 0; i < num; i++){
cublasSdot(handle, 1024, d_A , 1, d_B , 1, &res);
}
cudaEventRecord(stop2, NULL);
cudaEventSynchronize(stop2);
float msecTotal2 = 0.0f;
cudaEventElapsedTime(&msecTotal2, start2, stop2);
std::cout << "total time for dotVec:" << msecTotal2 << "\n";
cublasDestroy(handle);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
delete[] h_A;
delete[] h_B;
delete[] h_C;
return 1;
}
Update: I tried also to perform dot product with cublasSgemm by treating vector as 1 by 1024 matrix. The result is 3550 msec, which is better, but still 7 times more then in the first calculation.

One problem is that you're not handling the pointer mode correctly for the call to cublasSdot.
You'll want to read this section of the manual.
Furthermore this:
cublasSdot(handle, 1024, d_A , 1, d_B , 1, &res);
^^^^
is illegal under any circumstances. It is not legal in CUDA to take the address of a device variable in host code. You can certainly do it, but the results are garbage.
When I modify your code as follows:
cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_DEVICE);
float *dres;
cudaMalloc(&dres, sizeof(float));
cudaEventRecord(start2, NULL);
cudaMemcpy(d_A, h_A, size *size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size *size * sizeof(float), cudaMemcpyHostToDevice);
for (int i = 0; i < num; i++){
if(cublasSdot(handle, 1024, d_A , 1, d_B , 1, dres) != CUBLAS_STATUS_SUCCESS) {std::cout << ".";}
}
I get about a 2:1 ratio of execution time for cublasSdot to cublasSgemm which may be plausible, particularly for these sizes. Under the hood, the dot operation implies a parallel reduction. 1024 threads can compute the partial results, but then a 1024-thread-wide parallel reduction is required. The gemm does not need a parallel reduction, and so may be quicker. 1024 threads can be assigned to produce the 1024 results each in a single thread. For a memory-bound algorithm, the difference between 32^2 and 32^3 operations may not be that significant, but the parallel reduction implies significant additional operations. When I then change size in your program from 32 to 128, I see the ratio reverse, and the matrix multiply does indeed become 3x longer than the dot product.

Related

Convert Eigen::SparseMatrix to cuSparse and vice versa

I am having trouble figuring out how to convert Eigen::SparseMatrix to cuSparse due to how little documentation and examples are online. For dense matrices, converting from Eigen to CUDA for cublas is fairly straight forward
Eigen::MatrixXd A = Eigen::MatrixXd::Identity(3,3);
double *d_A;
cudaMalloc(reinterpret_cast<void **>(&d_A), 3 * 3 * sizeof(double));
cudaMemcpy(d_A, A.data(), sizeof(double) * 3 * 3, cudaMemcpyHostToDevice);
// do cublas operations on d_A
How to do the equivalent for the sparse matrices?
std::vector<Eigen::Triplet<double>> trip;
trip.emplace_back(0, 0, 1);
trip.emplace_back(1, 1, 1);
trip.emplace_back(2, 2, 1);
Eigen::SparseMatrix<double> A(3, 3);
A.setFromTriplets(trip.begin(), trip.end());
double *d_A;
// cudaMalloc?
// cudaMemcpy? some conversion?
// do cusparse operations
Just in case people are interested, I figured it out. The tricky part is Eigen's sparse matrix is in CSC format, whereas cuSparse is in CSR format. Fortunately, the conversion can be done by simply transpose CSC into CSR.
void EigenSparseToCuSparseTranspose(
const Eigen::SparseMatrix<double> &mat, int *row, int *col, double *val)
{
const int num_non0 = mat.nonZeros();
const int num_outer = mat.cols() + 1;
cudaMemcpy(row,
mat.outerIndexPtr(),
sizeof(int) * num_outer,
cudaMemcpyHostToDevice);
cudaMemcpy(
col, mat.innerIndexPtr(), sizeof(int) * num_non0, cudaMemcpyHostToDevice);
cudaMemcpy(
val, mat.valuePtr(), sizeof(double) * num_non0, cudaMemcpyHostToDevice);
}
void CuSparseTransposeToEigenSparse(
const int *row,
const int *col,
const double *val,
const int num_non0,
const int mat_row,
const int mat_col,
Eigen::SparseMatrix<double> &mat)
{
std::vector<int> outer(mat_col + 1);
std::vector<int> inner(num_non0);
std::vector<double> value(num_non0);
cudaMemcpy(
outer.data(), row, sizeof(int) * (mat_col + 1), cudaMemcpyDeviceToHost);
cudaMemcpy(inner.data(), col, sizeof(int) * num_non0, cudaMemcpyDeviceToHost);
cudaMemcpy(
value.data(), val, sizeof(double) * num_non0, cudaMemcpyDeviceToHost);
Eigen::Map<Eigen::SparseMatrix<double>> mat_map(
mat_row, mat_col, num_non0, outer.data(), inner.data(), value.data());
mat = mat_map.eval();
}

How can i make GPU process much faster than CPU process with CUDA 10.0 in Visual Studio 2017?

Smart developer!
I am the beginner of CUDA programming and I have a big problem with my code.
Following code is a sample code from Nvidia and I changed a little bit for showing the GPU process much faster than from CPU process. However, after compiling this code, I got a unexpected result from that CPU process is much faster than GPU process.
This is my laptop gpu info.
This is my cuda code for Visual Studio 2017.
===========================================================================
#define N 10
This is add2 function() from GPU process
`___global____ void add2(int *a, int *b, int *c) {`
// GPU block from grid sector
//int tid = blockIdx.x; // checking the data of index = if you
insert min of N, you will get slow result from CPU. But if you put big number, this show much faster than CPU
// GPU thread
//int tid = threadIdx.x; // Same result as blockIdx.x
// GPU unexpected vector // Same result as above
int tid = threadIdx.x + blockIdx.x*blockDim.x;
if (tid < N) {
c[tid] = a[tid] + b[tid];
}
}
This is add function() from CPU process
`void add(int *a, int *b, int *c) {
int tid = 0;
while (tid < N) {
c[tid] = a[tid] + b[tid];
tid += 1;
}
}
This is Main function()
int main() {
// Values for time duration
LARGE_INTEGER tFreq, tStart, tEnd;
cudaEvent_t start, stop;
float tms, ms;
int a[N], b[N], c[N]; // CPU values
int *dev_a, *dev_b, *dev_c; // GPU values----------------------------------------------
// Creating alloc for GPU--------------------------------------------------------------
cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMalloc((void**)&dev_c, N * sizeof(int));
// Fill 'a' and 'b' from CPU
for (int i = 0; i < N; i++) {
a[i] = -i;
b[i] = i * i;
}
// Copy values of CPU to GPU values----------------------------------------------------
cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
//////////////////////////////////////
QueryPerformanceFrequency(&tFreq); // Frequency set
QueryPerformanceCounter(&tStart); // Time count Start
// CPU operation
add(a, b, c);
//////////////////////////////////////
QueryPerformanceCounter(&tEnd); // TIme count End
tms = ((tEnd.QuadPart - tStart.QuadPart) / (float)tFreq.QuadPart) * 1000;
//////////////////////////////////////
// show result of CPU
cout << fixed;
cout.precision(10);
cout << "CPU Time=" << tms << endl << endl;
for (int i = 0; i < N; i++) {
printf("CPU calculate = %d + %d = %d\n", a[i], b[i], c[i]);
}
cout << endl;
///////////////////////////////////////
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
// GPU operatinog---------------------------------------------------------------------
//add2 <<<N,1 >>> (dev_a, dev_b, dev_c); // block
//add2 << <1,N >> > (dev_a, dev_b, dev_c); // Thread
add2 << <N/32+1, 32 >> > (dev_a, dev_b, dev_c); // grid
///////////////////////////////////////
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&ms, start, stop);
///////////////////////////////////////
// show result of GPU
cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);
cout << fixed;
cout.precision(10);
cout << "GPU Time=" << ms << endl << endl;
for (int i = 0; i < N; i++) {
printf("GPU calculate = %d + %d = %d\n", a[i], b[i], c[i]);
}
//Free GPU values
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
This is result of compiling this code.
I want to make GPU process much faster than CPU process.
The GPU is generally actually slower than the CPU for running a single operation. Additionally it takes time to send data to the GPU and read it back again.
The advantage of the GPU is it can execute many operations in parallel.
As you have defined N to be 10 it probably takes longer to upload and download the data than to execute on the CPU. In order to see the advantage of the GPU increase your problem size to something much larger. Ideally you want to execute a minimum of a few operations on each GPU core before you start seeing some benefit. For example with your GPU's 1280 cores you would want to execute something like 4000 operations or more at once to get the benefit of the GPU.

Numerical error in cuda/cublas simple kernel using particular input

I am working with cuda and cublas and I was trying to implement simple operations like matrix element-wise multiplication/division. I am using only float for my experiments. I know the most obvious way to do it is to write a kernel like this one:
__global__ void mul_elementwise(const unsigned int n, float* source, float* dest, const float value)
{
const unsigned int offset = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int stride = blockDim.x * gridDim.x;
for (unsigned int i = offset; i < n; i += stride)
{
dest[i] = source[i] * value;
}
}
This kernel can work both for multiplication and division (just using 1/x as value). But this can be achieved using cublas library too: suppose we have a matrix A m x n stored in column-major style and a scalar x, then setting alpha = x or alpha = 1/x and d_ones as a vector of m*n 1s, we can invoke and obtain the same result
cublasSaxpy(cublas_handle, m * n, &alpha, d_ones, 1, A_dev, 1);
Both methods work just fine, but I am facing few problems with some particular matrix, for which both methods do no work. I isolated this big matrix and build a MCVE available here (you can compile it with nvcc mcve.cu -lcublas. As you can see the results in both cases are totally wrong: host result is totally different, I am trying to figure out what's going on. I do not see any error in code but maybe i should try to use double instead of float and see what happens.
Any opinions about this situation? Thanks in advance!
EDIT #1 I tried using doubles but nothing changes if I use cublasDaxpy meanwhile it works perfectly with the custom kernel. I think the values are too small so single floating point precision is not enough.
Interesting MCVE. Wouldn't it have been possible to shrink your vector down to just a few elements? Isn't it possible to show the calculation discrepancy based on just 1 vector element?
Anyway I see several problems.
Your kernel implements the following function: y=alpha*x. But SAXPY implements y=alpha*x+y. Now, if y started out as (all) zero, then these two would be the same. But that's not what you have:
CUBLAS Your Kernel
---------------------------
alpha: alpha alpha
x: 1 ahost (ahost is your huge data array)
y: ahost -
So your kernel is computing y=alpha * ahost, but your CUBLAS call is computing y = alpha*1 + ahost. I wouldn't expect the same result from these, in general.
Your analysis of error seems flawed in a few ways. First, you are computing the absolute error in a float variable (a number which will always be positive, since it's the absolute value), but then you're comparing it against a negative number:
float diff = abs(host[i]-dev[i]);
...
if (diff > (-1e12))
won't that if test always be true? Perhaps you meant 1e-12 although that would still be flawed. Looking for a fixed error threshold on a floating point comparison should be scaled to the size of the numbers being compared. float quantities only contain about 6-7 accurate decimal digits. (And summing these errors is also troublesome.)
Here is a complete code that has the above issues fixed, and produces zero sum error for all the comparisons (host<->kernel and host<->cublas):
static float array[] = {0x00000000,
0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0x00000000,0xB58DA1CF,0xB50D2FEC,0x34A48536,0xB4A1D5BC,0x358E1345,0x35943AAC,0xB5983F40,0xB43628BB,0xB4A95348,0xB4DB751C,0xB50C8D1A,0xB3EFCBB5,0x3552B8CD,0x3538A167,0x358FDE0D,0xB4D54CE9,0xB5D29BB7,0xB4A234EE,0x346EF2F4,0x35B5D9F2,0xB40F1487,0x3554BC20,0x33FD9466,0xB536D37D,0xB3C2E594,0xB59DA581,0x3584FC87,0x34438F09,0x35D293CB,0xB4FBB002,0xB59F41E9};
#include <iostream>
#include <stdio.h>
#include <cublas_v2.h>
#include <assert.h>
#define TOL 0.0001
typedef unsigned int u32;
#define GET_STRIDE() u32(blockDim.x * gridDim.x)
#define GET_OFFSET() u32(blockIdx.x * blockDim.x + threadIdx.x)
inline
cudaError_t checkCuda(cudaError_t result)
{
#if defined(DEBUG) || defined(_DEBUG)
if (result != cudaSuccess) {
fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result));
assert(result == cudaSuccess);
}
#endif
return result;
}
__global__ void div_elementwise(const u32 n, float* source, float* dest, const float value)
{
for (u32 i = GET_OFFSET(); i < n; i += GET_STRIDE())
{
dest[i] = source[i] * value;
}
}
float check_eq(float* dev, float* host, u32 len)
{
float sum = 0.0f;
for (u32 i = 0; i < len; ++i)
{
if (dev[i]!=host[i])
{
//printf("diff %d %f %f\n", i, dev[i], host[i]);
//break;
float diff = abs((host[i]-dev[i])/host[i]);
sum += diff;
if (diff > (TOL))
printf("diff %d %f\n", i, diff);
}
}
printf("%f\n", sum);
return sum;
}
void div_host(float* a, float v, u32 len)
{
for (u32 i = 0; i < len; ++i)
{
a[i]=a[i]*v;
}
}
int main()
{
u32 len = sizeof(array)/sizeof(float);
printf("array len = %d\n", len);
for (int i =0; i < len; i++) if (isnan(array[i])) {printf("nan value at %d\n",i); return -1;}
float* adev, *adevcublas, *d_zero;
float* ahost = (float*) malloc(len * sizeof(float));
checkCuda(cudaMalloc(&adev, len * sizeof(float)));
checkCuda(cudaMalloc(&adevcublas, len * sizeof(float)));
checkCuda(cudaMalloc(&d_zero, len * sizeof(float)));
memcpy(ahost, &array[0], len * sizeof(float));
checkCuda(cudaMemcpy(adev, ahost, len * sizeof(float), cudaMemcpyHostToDevice));
checkCuda(cudaMemcpy(adevcublas, ahost, len * sizeof(float), cudaMemcpyHostToDevice));
checkCuda(cudaMemset(d_zero, 0, len*sizeof(float)));
float alpha = 1/2494.f;
printf("%f\n", alpha);
div_host(ahost, alpha, len);
u32 tb = 256;
div_elementwise<<<((len + tb - 1) / tb),tb>>>(len, adev, adev, alpha);
float* r = (float*) malloc(len * sizeof(float));
checkCuda(cudaMemcpy(r, adev, len * sizeof(float), cudaMemcpyDeviceToHost));
check_eq(r,ahost,len);
cublasHandle_t ch;
cublasCreate(&ch);
float* r0 = (float*) malloc(len * sizeof(float));
cublasStatus_t stat = cublasSaxpy(ch, len, &alpha, adevcublas, 1, d_zero, 1);
if (stat != CUBLAS_STATUS_SUCCESS) {std::cout << "CUBLAS error: " << (int)stat << std::endl; return 1;}
checkCuda(cudaMemcpy(r0, d_zero, len * sizeof(float), cudaMemcpyDeviceToHost));
check_eq(r0,ahost,len);
free(r);
free(r0);
free(ahost);
cudaFree(adev);
return 0;
}

Example of increasing the work per thread in CUDA

Algorithm :
I'm writing a program with CUDA and the problem is the following:
Two matrices A (n * 128) and B (m * 128)
I take the first row of A, and I compute the distance between that vector and all the rows of B, one by one.
I write the result of each distance on a row of a matrix C, so the element C(i,j) of C contains the distance between row i of A and row j of B.
and I proceed with the next row of A.
I've implemented it this way: I've got a grid made by ( n * m ) blocks, and 128 threads per block. ( 1 * 128 ).
QUESTION: The program runs successfully with the expected results but the time execution is only around 5 to 10 times faster than the one-threaded CPU version of it. So I would like to know how to increase the work per thread before reduction in order to increase performance.
Kernel code (original : Not optimized)
__global__ void EuclideanDistances( float *A, float *B , float *C , int n , int m)
{
// SIZE is equal to 128
__shared__ float accumResult[SIZE];
float sA;
float sB;
// MAPPING
int bx = blockIdx.x; // n
int by = blockIdx.y; // m
int ty = threadIdx.y; // 128
int tx = threadIdx.x; // 1
sA = A [bx * SIZE + ty];
sB = B [by * SIZE + ty];
__syncthreads();
accumResult[ty] = (sA - sB) * (sA - sB);
__syncthreads();
// Parallel tree-reduction
for (int stride = SIZE/2 ; stride > 0 ; stride >>= 1)
if (ty < stride)
{
accumResult[ty] += accumResult [stride + ty];
__syncthreads();
}
// Writing results to output matrix
if ((threadIdx.y == 0))
C [bx * m + by] = accumResult[ty];
__syncthreads();
}
UPDATE
Now, I'm using another mapping : Instead of taking a grid of n by m blocks and a block of 128 threads, I'm increasing the number of threads within a block in order to decrease the number of blocks.
New mapping:
Block of 128 by 8 threads (total of 1024 threads, which is the max size)
Grid of n/8 by m/8 blocks
Unfortunately, it's giving wrong results ).
Optimized kernel code (to be updated)
__global__ void EuclideanDistances( float *A, float *B , float *C, int n , int m)
{
__shared__ float accumResult[SIZE][8];
__shared__ float sA[SIZE][8];
__shared__ float sB[SIZE][8];
int bx = blockIdx.x; // n / 8
int by = blockIdx.y; // m / 8
int tx = threadIdx.x; // 8
int ty = threadIdx.y; // 128
int i = bx * tx * SIZE + ty;
int j = by * tx * SIZE + ty;
sA[ty][tx] = A [i];
sB[ty][tx] = B[j];
__syncthreads();
accumResult[ty][tx] = (sA[ty][tx] - sB[ty][tx]) * (sA[ty][tx] - sB[ty][tx]);
__syncthreads();
// Reduction
for (int stride = SIZE/2 ; stride > 0 ; stride>>=1)
if (ty < stride)
{
accumResult[ty][tx] += accumResult [stride + ty][tx];
__syncthreads();
}
C[bx * m + by] = accumResult[0][tx];
}
HOST CODE (allocations + kernel calls)
int main()
{
int m = 20000; //MatrixA size : m * SIZE
int n = 4000; //MatrixB size : n * SIZE
srand((unsigned)time(0));
// Host Allocations
float *matrixA = (float *) malloc (n * SIZE * sizeof(float));
for(int i=0; i < n * SIZE; i++)
matrixA[i] = (float) (rand()%100)+1;
float *matrixB = (float *) malloc (m * SIZE * sizeof(float));
for(int i=0; i < m * SIZE; i++)
matrixB[i] = (float) (rand()%100)+1;
float *results_kernel1 = (float *) malloc (n * m * sizeof(float));
float *results_kernel2 = (float *) malloc (n * m * sizeof(float));
//Device Allocation
float *d_matrixA;
float *d_matrixB;
cudaMalloc((void **)&d_matrixA, n * SIZE * sizeof(float));
cudaMalloc((void **)&d_matrixB, m * SIZE * sizeof(float));
cudaMemcpy(d_matrixA , matrixA , n * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
cudaMemcpy(d_matrixB , matrixB , m * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
float *d_results_kernel1;
float *d_results_kernel2;
cudaMalloc((void **)&d_results_kernel1 , n * m * sizeof(float));
cudaMalloc((void **)&d_results_kernel2 , n * m * sizeof(float));
dim3 threads1 (1 , 128);
dim3 blocks1 (n , m);
EuclideanDistances1 <<<blocks1 , threads1>>> (d_matrixA , d_matrixB , d_results_kernel1 , n , m);
cudaDeviceSynchronize();
cudaMemcpy(results_kernel1 , d_results_kernel1 , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
cudaFree(d_results_kernel1);
dim3 threads2 (8 , 128); // 1024 threads per block (maximum)
dim3 blocks2 (ceil((float)n/8) , ceil((float)m/8));
EuclideanDistances2 <<<blocks2 , threads2>>> (d_matrixA , d_matrixB , d_results_kernel2 , n , m);
cudaDeviceSynchronize();
cudaMemcpy(results_kernel2 , d_results_kernel2 , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
cudaFree(d_results_kernel2);
// Visualising and comparing results
for (int i = 0 ; i < 50 ; i++)
std::cout << "kernel1 : " << results_kernel1[i] << " | kernel2 : " << results_kernel2[i] << std::endl;
free(matrixA);
free(matrixB);
free(results_kernel1);
free(results_kernel2);
return 0;
}
PS: I have CUDA 6.0 with a NVIDIA GTX 650 (compute capability 3.0)
It seems your question has 2 components:
why isn't my second kernel working?
how do I make my code run faster?
Why isn't my second kernel working?
You had several issues:
indexing problems in initial calculation of i, j as well as the index for storing the C value.
violation of usage of _syncthreads() inside a conditional block
item 1 was the key element to get the code working.
How do I make my code run faster?
This is more involved. First of all, your attempt at "increasing work per thread" didn't do anything of the kind, it was merely an increase in the number of threads per block (from 128 to 8*128). Each thread was doing approximately the same amount of work. Furthermore, in the process of going to a 2D threadblock for this attempt, I believe a couple of bad things happened:
various coalescing and shared-memory-bank-conflict load and store patterns were broken.
effective occupancy went down, due the amount of shared memory required per block.
The net effect of the second kernel was to approximately double the execution time. So that is not what we want.
However, increasing work per thread may be a good idea, along with using shared memory, as well as trying to preserve good (global, shared) memory access patterns, as well as allowing for increased occupancy.
What follows is a work-in-progress along those lines. The following code has your second kernel fixed, along with timing infrastructure, as well as full data verification, as well as 2 new kernels. The first new kernel (#3) is what I would call a "naive" kernel. It simply allocates one thread per output point, and each thread loops through the necessary vectors, computing its individual result. No usage of shared memory, or even much attention to coalescing or any other optimization. However with a tweak to threadblock configuration (16,16) -> (8,32) threads, which I observed from #talonmies answer (now deleted), this kernel performs significantly (3x) faster than your "fast" kernel. After further thought about the (8,32) observation, I concluded that the next attempt at optimization should focus on:
elimination of the usage of a parallel reduction to compute the vector distance (i.e. allow adjacent threads to use a straight for-loop to loop through the vectors)
maximization of benefit from the cache
efficient usage of shared memory
insist on perfect global coalescing/perfect usage of shared memory for all reads and writes
Item 4 prompted the question in the comments "may I transpose the matrices?" With this permission, it's possible to re-organize the data to facilitate item 4 above. Item 2 above is addressed in my "fast" kernel (#4) by loading the B vector into shared memory, while allowing the cache to mostly focus on caching the A vectors, hopefully reducing cache-thrashing (A is the smaller of the 2 vector arrays, at about 2MB - fermi L2 is 768K, Kepler L2 is 1.5MB). By delivering A in transposed form, and effectively "transposing" B on-chip from shared memory, it's possible to use a straight for-loop to compute the vector distance, while allowing adjacent threads to have perfectly coalesced reads and writes, as well as "efficient" use of shared memory (i.e. non-bank-conflicted loads, and broadcast reads).
For my particular timing, (Quadro5000 cc2.0 GPU, CUDA 6, RHEL 5.5) I see that your "fast" kernel requires about 2 seconds, my "naive" kernel requires about 0.7 seconds, and my "fast" kernel requires about 0.2 seconds, albeit with transposed (A,C) data.
EDIT: I've made one additional optimization, that is to have each block compute multiple (CHKSIZE) B vectors at one time. You can set CHKSIZE to 1 to see the previous result (~0.2sec). I found CHKSIZE of 4 gave good improvement. This is an attack at attempting to exploit the data re-use of A. With this additional optimization at CHKSIZE of 4, the kernel time for kernel 4 drops to about 0.1 second.
Following is the code and a sample run:
$ cat t460.cu
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
// both M and N must be evenly divisible by SIZE, M must be evenly divisible by CHKSIZE
#define SIZE 128
#define N 4000
#define M 20000
#define CHKSIZE 4
__global__ void EuclideanDistances1( float *A, float *B , float *C , int n , int m)
{
// SIZE is equal to 128
__shared__ float accumResult[SIZE];
float sA;
float sB;
// MAPPING
int bx = blockIdx.x; // n
int by = blockIdx.y; // m
int ty = threadIdx.y; // 128
//int tx = threadIdx.x; // 1
sA = A [bx * SIZE + ty];
sB = B [by * SIZE + ty];
__syncthreads();
accumResult[ty] = (sA - sB) * (sA - sB);
__syncthreads();
// Parallel tree-reduction
for (int stride = SIZE/2 ; stride > 0 ; stride >>= 1){
if (ty < stride)
{
accumResult[ty] += accumResult [stride + ty];
}
__syncthreads();
}
// Writing results to output matrix
if ((ty == 0))
C [bx * m + by] = accumResult[ty];
__syncthreads();
}
__global__ void EuclideanDistances2( float *A, float *B , float *C, int n , int m)
{
__shared__ float accumResult[SIZE][8];
__shared__ float sA[SIZE][8];
__shared__ float sB[SIZE][8];
int bx = blockIdx.x; // n / 8
int by = blockIdx.y; // m
int tx = threadIdx.x; // 8
int ty = threadIdx.y; // 128
int i = ((bx*8) + tx) * SIZE + ty;
int j = by * SIZE + ty;
sA[ty][tx] = A[i];
sB[ty][tx] = B[j];
__syncthreads();
accumResult[ty][tx] = (sA[ty][tx] - sB[ty][tx]) * (sA[ty][tx] - sB[ty][tx]);
__syncthreads();
// Reduction
for (int stride = SIZE/2 ; stride > 0 ; stride>>=1){
if (ty < stride)
{
accumResult[ty][tx] += accumResult [stride + ty][tx];
}
__syncthreads();
}
if (ty == 0)
C[((bx*8)+tx) * m + by] = accumResult[0][tx];
}
//naive kernel
__global__ void EuclideanDistances3( float *A, float *B , float *C, int n , int m){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int idy = threadIdx.y+blockDim.y*blockIdx.y;
float result = 0.0f;
if ((idx < n) && (idy < m)){
for (int i = 0; i < SIZE; i++){
float temp = A[(idx*SIZE)+i] - B[(idy*SIZE)+i];
result += temp * temp;}
C[(idx*m) + idy] = result;
}
}
//optimized kernel
__global__ void EuclideanDistances4( const float *A, const float *B , float *C, const int n , const int m){
// n, A, 4000 this kernel assumes A is column-major A(SIZE, n)
// m, B, 20000 this kernel assumes B is row-major B(m, SIZE)
// this kernel assumes C is column-major C(m,n)
// this kernel assumes number of threads per threadblock == SIZE
// CHKSIZE is the number of B vectors that will be compute per block
__shared__ float my_sB[CHKSIZE*SIZE]; // enough shared storage for CHKSIZE vectors of B
int bx = blockIdx.x; // one block per CHKSIZE rows of B (the larger input matrix)
while ((bx*CHKSIZE) < m){ // not used, this while loop could be used to extend a block to multiple chunks
int tx = threadIdx.x;
for (int i = 0; i < CHKSIZE; i++) // load vectors of B into shared memory
my_sB[(i*SIZE)+tx] = B[(((bx*CHKSIZE)+i)*SIZE)+tx];
__syncthreads();
while (tx < n){ //loop across all vectors in A
float result[CHKSIZE];
for (int i = 0; i < CHKSIZE; i++)
result[i] = 0.0f;
for (int i = 0; i < SIZE; i++){
float Atemp = A[(n*i)+tx];
for (int j = 0; j < CHKSIZE; j++){ // compute all CHKSIZE B vectors with read of A
float temp = Atemp - my_sB[i + (j*SIZE)];
result[j] += temp * temp;}}
for (int i = 0; i < CHKSIZE; i++) // store CHKSIZE results
C[((i+(bx*CHKSIZE))*n)+ tx] = result[i];
tx += blockDim.x; } // continue looping across vectors in A
__syncthreads(); // necessary to prevent warps from racing ahead, if block looping is used
bx += gridDim.x;}
}
float comp_euclid_sq(const float *rA, const float *rB, const int size){
float result = 0.0f;
float temp;
for (int i = 0; i < size; i++){
temp = (rA[i] - rB[i]);
result += temp * temp;}
return result;
}
int main()
{
float et1=0.0f, et2=0.0f, et3=0.0f, et4=0.0f;
cudaEvent_t start1, start2, start3,start4, stop1, stop2, stop3, stop4;
cudaEventCreate(&start1);
cudaEventCreate(&start2);
cudaEventCreate(&start3);
cudaEventCreate(&start4);
cudaEventCreate(&stop1);
cudaEventCreate(&stop2);
cudaEventCreate(&stop3);
cudaEventCreate(&stop4);
int n = N; //MatrixA size : n * SIZE
int m = M; //MatrixB size : m * SIZE
srand((unsigned)time(0));
// Host Allocations
float *matrixA = (float *) malloc (n * SIZE * sizeof(float));
for(int i=0; i < n * SIZE; i++)
matrixA[i] = (float) (rand()%100)+1;
float *matrixB = (float *) malloc (m * SIZE * sizeof(float));
for(int i=0; i < m * SIZE; i++)
matrixB[i] = (float) (rand()%100)+1;
float *results_kernel = (float *) malloc (n * m * sizeof(float));
float *cpu_results_kernel = (float *) malloc (n * m * sizeof(float));
for (int i = 0; i< n*m; i++)
cpu_results_kernel[i] = comp_euclid_sq(matrixA + ((i/m)*SIZE), matrixB + (i%m)*SIZE, SIZE);
//Device Allocation
float *d_matrixA;
float *d_matrixB;
cudaMalloc((void **)&d_matrixA, n * SIZE * sizeof(float));
cudaMalloc((void **)&d_matrixB, m * SIZE * sizeof(float));
cudaMemcpy(d_matrixA , matrixA , n * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
cudaMemcpy(d_matrixB , matrixB , m * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
float *d_results_kernel;
cudaMalloc((void **)&d_results_kernel , n * m * sizeof(float));
dim3 threads1 (1 , SIZE);
dim3 blocks1 (n , m);
cudaEventRecord(start1);
EuclideanDistances1 <<<blocks1 , threads1>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
cudaEventRecord(stop1);
cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
for (int i = 0; i< n*m; i++) {
if (results_kernel[i] != cpu_results_kernel[i]) {printf("cpu/kernel1 mismatch at %d, cpu: %f, kernel1: %f\n", i, cpu_results_kernel[i], results_kernel[i]); return 1;}}
cudaMemset(d_results_kernel, 0, n*m*sizeof(float));
cudaEventSynchronize(stop1);
cudaEventElapsedTime(&et1, start1, stop1);
dim3 threads2 (8 , SIZE); // 1024 threads per block (maximum)
dim3 blocks2 (n/8 , m); // assumes n evenly divisible by 8
cudaEventRecord(start2);
EuclideanDistances2 <<<blocks2 , threads2>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
cudaEventRecord(stop2);
cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
for (int i = 0; i< n*m; i++) {
if (results_kernel[i] != cpu_results_kernel[i]) {printf("cpu/kernel2 mismatch at %d, cpu: %f, kernel1: %f\n", i, cpu_results_kernel[i], results_kernel[i]); return 1;}}
cudaMemset(d_results_kernel, 0, n*m*sizeof(float));
cudaEventSynchronize(stop2);
cudaEventElapsedTime(&et2, start2, stop2);
cudaFuncSetCacheConfig(EuclideanDistances3, cudaFuncCachePreferL1);
dim3 threads3 (8, 32); // 1024 threads per block (maximum)
dim3 blocks3 (n/threads3.x , m/threads3.y); // assumes evenly divisible
cudaEventRecord(start3);
EuclideanDistances3 <<<blocks3 , threads3>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
cudaEventRecord(stop3);
cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
for (int i = 0; i< n*m; i++) {
if (results_kernel[i] != cpu_results_kernel[i]) {printf("cpu/kernel3 mismatch at %d, cpu: %f, kernel3: %f\n", i, cpu_results_kernel[i], results_kernel[i]); return 1;}}
cudaMemset(d_results_kernel, 0, n*m*sizeof(float));
cudaEventSynchronize(stop3);
cudaEventElapsedTime(&et3, start3, stop3);
// transpose matrix A
float *matrixA_T = (float *) malloc (n * SIZE * sizeof(float));
for (int i = 0; i < n; i++)
for (int j = 0; j < SIZE; j++)
matrixA_T[(j*n)+i] = matrixA[(i*SIZE)+j];
cudaMemcpy(d_matrixA , matrixA_T , n * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
cudaFuncSetCacheConfig(EuclideanDistances4, cudaFuncCachePreferL1);
dim3 threads4(SIZE); // one thread per vector element
dim3 blocks4(m/CHKSIZE);
cudaEventRecord(start4);
EuclideanDistances4 <<<blocks4 , threads4>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
cudaEventRecord(stop4);
cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
// test for correct transposed result C(m,n)
for (int i = 0; i< n; i++)
for (int j = 0; j < m; j++)
if (results_kernel[(j*n)+i] != cpu_results_kernel[(i*m)+j]) {printf("cpu/kernel4 mismatch at %d,%d, cpu: %f, kernel4: %f\n", i,j, cpu_results_kernel[(i*m)+j], results_kernel[(j*n)+i]); return 1;}
cudaEventSynchronize(stop4);
cudaEventElapsedTime(&et4, start4, stop4);
cudaFree(d_results_kernel);
printf("Success!\n");
printf("kernel1 : %.fms, kernel2 : %.fms, kernel3 : %.fms, kernel4 : %.fms\n", et1, et2, et3, et4);
free(matrixA);
free(matrixB);
free(results_kernel);
return 0;
}
$ nvcc -O3 -arch=sm_20 -o t460 t460.cu
$ ./t460
Success!
kernel1 : 2213ms, kernel2 : 4660ms, kernel3 : 691ms, kernel4 : 99ms
$
Hopefully that will get you going with more ideas of things to work on. You may get different timings of course on your cc3.0 device.
Are further optimizations possible? Probably. The first target I would look at would be to figure out how to take advantage of the data-reuse opportunities on vector A. (data re-use of vector B is already handled in the kernel 4 by loading it into shared memory. There may be ways to use some shared memory to store portions of A to make the code run even faster.)
I guess I should also mention that following the lead of the code you provided, this code is computing the square of the euclidean distance. A trivial modification to the kernels can make it compute the actual euclidean distance instead (C[...] = sqrtf(...);) The validation I have included, however, assumes the results are "in-range" for perfect storage of an integer quantity in a float. Your test case satisfies this requirement, but otherwise the validation code would need to be modified (if sqrtf were used).

CUDA Matrix Addition Seg faults

I just have a question about my cuda program that I wrote. It allows me to enter the size of the matrix, col and rows. Say I enter ~1124 and it computes fine. However say I enter 1149 it Seg faults AFTER computing in the device(I think it's seg faulting during the copy back). But say I enter 2000 it seg faults BEFORE computing in the device(I think it seg faults during the copy over). I think my issue is all with memory management. If you guys could point me in the right direction I'd appreciate it.
I udpated the code with how it is called. In the new edit(at the bottom) it contains: sumMatrix(blank matrix with the size of eleCount1, which is the size of the entire matrix), matrixOne(first matrix),matrixTwo(second matrix, allocated same way matrix1 is done),eleCount1(entire size of matrix). Both matrixOne and two are read in from a file.
Wasn't sure if someone needed to see this stuff about my GPU:
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
The code is:
void addKernel(float *c, float *a, float *b)
{
int i = threadIdx.x;
int idx = blockDim.x * blockIdx.x + threadIdx.x;
c[idx] = a[idx] + b[idx];
}
cudaError_t addWithCuda(float *c, float *a, float *b, size_t size)
{
float *dev_a = 0;
float *dev_b = 0;
float *dev_c = 0;
cudaError_t cudaStatus;
blocksNeeded=(size/MAXTHREADS)+1;
int threadsPerBlock = MAXTHREADS/blocksNeeded+1;
cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(float));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(float));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(float));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(float), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(float), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
addKernel<<<blocksNeeded, size>>>(dev_c, dev_a, dev_b);
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
goto Error;
}
cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(float), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
Error:
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
return cudaStatus;
}
//edit: added how the matrix are allocated
float* matrixOne = (float*)malloc(sizeof(float)*file1size);
int matrixIndex = 0;
readFromFile(fd,byte, matrixOneWidth, matrixOneHeight, matrixOne);
//matrixOneHeight--;
eleCount1 = matrixOneHeight*matrixOneWidth;
matrixOne= (float*)realloc(matrixOne,eleCount1*sizeof(float));
//Edit: Added how the addWithCuda is called.
cudaStatus = addWithCuda(sumMatrix, matrixOne,matrixTwo,eleCount1);
//sumMatrix is created after we know how large the matrices are.
float sumMatrix[eleCount1];
You are not testing the bounds of your computation inside the kernel. If the total amount of work does not evenly divide to the size of a block, some threads will try to write to indices that are outside the output array. I suggest you also pass the size as a parameter to the kernel and introduce the check:
__global__ void addKernel(float *c, float *a, float *b, int size)
{
int i = threadIdx.x;
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if(idx < size) c[idx] = a[idx] + b[idx];
}
I see that you are indexing into arrays a, b and c in your kernel, but you do not check to make sure that the index is within the array bounds. You are therefore writing into memory that you do not own, causing seg faults in random places.