I tried to use cudaMallocPitch and cudaMemcpy2D, but when I tried to use cudaMemcpy2D with large array, I encountered a problem:
Segmentation fault
Here is the runnable source code, with no error.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
#include <random>
#define ROW_SIZE 32
#define COL_SIZE 1024
int main()
{
float ** pfTest;
pfTest = (float**)malloc(ROW_SIZE * sizeof(float*));
for (int i = 0; i < ROW_SIZE; i++) {
pfTest[i] = (float*)malloc(COL_SIZE * sizeof(float));
}
std::default_random_engine generator;
std::uniform_real_distribution<float> distribution;
for (int y = 0; y < ROW_SIZE; y++) {
for (int x = 0; x < COL_SIZE; x++) {
pfTest[y][x] = distribution(generator);
}
}
float *dev_Test;
size_t pitch;
cudaMallocPitch(&dev_Test, &pitch, COL_SIZE * sizeof(float), ROW_SIZE);
cudaMemcpy2D(dev_Test, pitch, pfTest, COL_SIZE * sizeof(float), COL_SIZE * sizeof(float), ROW_SIZE, cudaMemcpyHostToDevice);
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
return 0;
}
As you can see, there's no problem at all.
But, when I tried to extend COL_SIZE to around 500,000 (exactly, 524288), it crashes with segmentation fault.
Any help as to the source of the problem?
cudaMemcpy2D can only be used for copying pitched linear memory. Your source array is not pitched linear memory, it is an array of pointers. This is not supported and is the source of the segfault.
Try something like this:
float* buffer;
float** pfTest;
const size_t buffer_pitch = size_t(COL_SIZE) * sizeof(float);
buffer = (float*)malloc(size_t(ROW_SIZE) * buffer_pitch);
pfTest = (float**)malloc(ROW_SIZE * sizeof(float*));
for (size_t i = 0; i < ROW_SIZE; i++) {
pfTest[i] = buffer + i * size_t(COL_SIZE);
}
// ...
cudaMallocPitch(&dev_Test, &pitch, buffer_pitch, ROW_SIZE);
cudaMemcpy2D(dev_Test, pitch, buffer, buffer_pitch,
buffer_pitch, ROW_SIZE, cudaMemcpyHostToDevice);
[Note: written in browser, never tested or compiled, use at own risk]
i.e. store the data to be copied in a single contiguous memory allocation which can act as a pitched linear source for cudaMemcpy2D. If you insist on using [][] style indexing on the host, then you have to pay the penalty of having an additional array of pointers to store alongside the data. Note that isn't actually necessary, and you could just directly index into buffer and achieve the same result, while saving memory at the same time.
Related
So I need to create a matrix with different row lengths, and this is how it looks like in normal C/C++
int** MpesosT = (int**)malloc(N * sizeof(int*));
for (int i = 0; i < N; i++)
{
MpesosT[i] = (int*)malloc(vecinosT[i] * sizeof(int));
}
However, I don't know how to do this using the CUDA function to allocate memory:
int* Vector; cudaMallocManaged(&Vector, VectorSize* sizeof(int));
I can't just use a vector of size N*N or something, because every row has a different size, so how could I do that?
Took a couple of hours, but I found the way to do it. In case anyone has the same problem:
double** Matrix;
cudaMallocManaged((double***)&Matrix, N * sizeof(double*));
for (i = 0; i < N; i++)
{
cudaMallocManaged((double**)&Matrix[i], rowlength[i] * sizeof(double));
}
This way, every row has a different length
I am attempting to load in a .mat file containing a tensor of known dimensions in C++; 144x192x256.
I have adjusted the linear index for the read operation to be column major as in MATLAB. However I am still getting memory access issues.
void FeatureLoader::readMat(const std::string &fname, Image< std::vector<float> > *out) {
//Read MAT file.
const char mode = 'r';
MATFile *matFile = matOpen(fname.c_str(), &mode);
if (matFile == NULL) {
throw std::runtime_error("Cannot read MAT file.");
}
//Copy the data from column major to row major storage.
float *newData = newImage->GetData();
const mxArray *arr = matGetVariable(matFile, "map");
if (arr == NULL) {
throw std::runtime_error("Cannot read variable.");
}
double *arrData = (double*)mxGetPr(arr);
#pragma omp parallel for
for (int i = 0; i < 144; i++) {
#pragma omp parallel for
for (int j = 0; j < 192; j++) {
for (int k = 0; k < 256; k++) {
int rowMajIdx = (i * 192 + j) * 256 + k;
int colMajIdx = (j * 144 + i) * 256 + k;
newData[rowMajIdx] = static_cast<float>(arrData[colMajIdx]);
}
}
}
}
In the above snippet, am I right to be accessing the data linearly as with a flattened 3D array in C++? For example:-
idx_row_major = (x*WIDTH + y)*DEPTH + z
idx_col_major = (y*HEIGHT + x)*DEPTH + z
Is this the underlying representation that MATLAB uses?
You have some errors in the indexing of the row mayor and column mayor Idx. Additionally, naively accessing the data can lead to very slow times due to random memory access (memory latency is key! Read more here).
The best way to pass from MATLAB to C++ types (From 3D to 1D) is following the example below.
In this example we illustrate how to take a double real-type 3D matrix from MATLAB, and pass it to a C double* array.
The main objectives of this example are showing how to obtain data from MATLAB MEX arrays and to highlight some small details in matrix storage and handling.
matrixIn.cpp
#include "mex.h"
void mexFunction(int nlhs , mxArray *plhs[],
int nrhs, mxArray const *prhs[]){
// check amount of inputs
if (nrhs!=1) {
mexErrMsgIdAndTxt("matrixIn:InvalidInput", "Invalid number of inputs to MEX file.");
}
// check type of input
if( !mxIsDouble(prhs[0]) || mxIsComplex(prhs[0])){
mexErrMsgIdAndTxt("matrixIn:InvalidType", "Input matrix must be a double, non-complex array.");
}
// extract the data
double const * const matrixAux= static_cast<double const *>(mxGetData(prhs[0]));
// Get matrix size
const mwSize *sizeInputMatrix= mxGetDimensions(prhs[0]);
// allocate array in C. Note: its 1D array, not 3D even if our input is 3D
double* matrixInC= (double*)malloc(sizeInputMatrix[0] *sizeInputMatrix[1] *sizeInputMatrix[2]* sizeof(double));
// MATLAB is column major, not row major (as C). We need to reorder the numbers
// Basically permutes dimensions
// NOTE: the ordering of the loops is optimized for fastest memory access!
// This improves the speed in about 300%
const int size0 = sizeInputMatrix[0]; // Const makes compiler optimization kick in
const int size1 = sizeInputMatrix[1];
const int size2 = sizeInputMatrix[2];
for (int j = 0; j < size2; j++)
{
int jOffset = j*size0*size1; // this saves re-computation time
for (int k = 0; k < size0; k++)
{
int kOffset = k*size1; // this saves re-computation time
for (int i = 0; i < size1; i++)
{
int iOffset = i*size0;
matrixInC[i + jOffset + kOffset] = matrixAux[iOffset + jOffset + k];
}
}
}
// we are done!
// Use your C matrix here
// free memory
free(matrixInC);
return;
}
The relevant concepts to be aware of:
MATLAB matrices are all 1D in memory, no matter how many dimensions they have when used in MATLAB. This is also true for most (if not all) main matrix representation in C/C++ libraries, as allows optimization and faster execution.
You need to explicitly copy matrices from MATLAB to C in a loop.
MATLAB matrices are stored in column major order, as in Fortran, but C/C++ and most modern languages are row major. It is important to permute the input matrix , or else the data will look completely different.
The relevant function in this example are:
mxIsDouble checks if input is double type.
mxIsComplex checks if input is real or imaginary.
mxGetData returns a pointer to the real data in the input array. NULL if there is no real data.
mxGetDimensions returns an pointer to a mwSize array, with the size of the dimension in each index.
I'm trying to use GCC vector extension (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html) to speed up matrix multiplication. The idea is to use SIMD instructions to multiply and add four float numbers at once. A minimal working example is listed below. The example works fine when multiplying a (M=10,K=12) matrix to a (K=12,N=12) matrix. When I change the parameters (say N=9), however, I get a segmentation fault.
I suspect this is due to memory alignment issues. In my understanding, when using a SIMD for a vector wich 16bytes (in this case float4), the target memory address should be a multiple of 16. There are already discussions on memory alignment issues with SIMD instructions. (e.g. Relationship between SSE vectorization and Memory alignment). In the example below, when &b(0,0) is 0x810e10, &b(1,0) is 0x810e34, which is not a multiple of 16.
My questions are,
Is it true that I'm getting the segfault for the memory alignment issues?
Can anyone tell me how to fix the problem easily? I've thought of using a two-dimensional array instead of one array, but I don't want to do this so as not to change the rest of the codes.
Minimal Working Example
#include <iostream>
#include <cstdlib>
#include <stdio.h>
#include <cstring>
#include <assert.h>
#include <algorithm>
using namespace std;
typedef float float4 __attribute__((vector_size (16)));
static inline void * alloc64(size_t sz) {
void * a = 0;
if (posix_memalign(&a, 64, sz) != 0) {
perror("posix_memalign");
exit(1);
}
return a;
}
struct Mat {
size_t m,n;
float * a;
Mat(size_t m_, size_t n_, float f) {
m = m_;
n = n_;
a = (float*) malloc(sizeof(float) * m * n);
fill(a,a + m * n,f);
}
/* a(i,j) */
float& operator()(long i, long j) {
return a[i * n + j];
}
};
Mat operator* (Mat a, Mat b) {
Mat c(a.m, b.n,0);
assert(a.n == b.m);
for (long i = 0; i < a.m; i++) {
for(long k = 0; k < a.n; k++){
float aa = a(i,k);
float4 a4 = {aa,aa,aa,aa};
long j;
for (j = 0; j <= b.n-4; j+=4) {
*((float4 *)&c(i,j)) = *((float4 *)&c(i,j)) + a4 * (*(float4 *)&b(k,j));
}
while(j < b.n){
c(i,j) += aa * b(k,j);
j++;
}
}
}
return c;
}
const int M = 10;
const int K = 12;
const int N = 12;
int main(){
Mat a(M,K,1);
Mat b(K,N,1);
Mat c = a * b;
for(int i = 0; i < M; i++){
for(int j = 0; j < N; j++)
cout << c(i,j) << " ";
cout << endl;
}
cout << endl;
}
In my understanding, when using a SIMD for a vector wich 16bytes (in
this case float4), the target memory address should be a multiple of
16.
That is incorrect on x64 processors. There are instructions that require alignment, but you can perfectly well write and read SIMD registers from unaligned memory locations without penalty and with absolute safety using the right instructions.
Is it true that I'm getting the segfault for the memory alignment
issues?
Yes.
But it is not related to SIMD instructions. In C/C++, it is undefined behavior to write *((float4 *)&c) = ... the way you do, and can certainly crash, but you can reproduce the problem without vectorization... Given the right circumstances, the following basic code will crash...
char * c = ...
*(int *) c = 1;
Can anyone tell me how to fix the problem easily? I've thought of
using a two-dimensional array instead of one array, but I don't want
to do this so as not to change the rest of the codes.
The typical workaround is to use memcpy. Let us look at a code example...
#include <string.h>
typedef float float4 __attribute__((vector_size (16)));
void writeover(float * x, float4 y) {
*(float4 * ) x = y;
}
void writeover2(float * x, float4 y) {
memcpy(x,&y,sizeof(y));
}
With, say, clang++, these two functions get compiled to vmovaps and vmovups. These are equivalent instructions, but the first one will crash if your pointer is not aligned on sizeof(float4). They are very fast functions on recent hardware.
The point is that you can often rely on memcpy to generate code that is nearly optimally fast. Of course, the amount of overhead you get (if any) will depend on the compiler you are using.
If you do get performance problems, then you can use Intel intrinsics or assembly instead... but chances are good that memcpy will serve you well.
A different fix is to only work in terms of float4 * pointers. This forces all your matrices to have dimensions that are divisible by four, but if you pad the leftover with zeroes you will probably get simple and really fast code.
I am trying to implement k-means algorithm on CUDA using Tesla card on external Unix. I read input file and store coordinates of all data points in dataX and dataY arrays. The next step is to select every centreInterval-th point and store it in another array allocated in GPU memory. However, I have no idea how may I even check what's the problem if all I can get is 'Segmentation error' and from obvious reasons can't print any kind of output from kernel.
EDIT 2: I simplified this example to the shortest possible solution. I found my solution during process, but decided to provide the version, which was not solved yet in this question to make more clear what caused the problem.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <strings.h>
#include <math.h>
#include <time.h>
#include <unistd.h>
#define BLOCK_SIZE 16
// My kernel - Selects some centres at the beginning of algorithm and stores it at appropriate place
__global__ void kMeansSelectInitialCentres(float* d_dataX, float* d_dataY, float* d_centresX, float* d_centresY, int centreInterval) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int idx = i * centreInterval;
d_centresX[i] = d_dataX[idx];
d_centresY[i] = d_dataY[idx];
}
// Simplified example
int main(int argn, char ** argc) {
// My data - let's say it is 32 floats in each
int dataSize = 32;
float* dataX = new float[dataSize];
float* dataY = new float[dataSize];
// Fill arrays with numbers
for (int i = 0; i < dataSize; i++) {
dataX[i] = i;
dataY[i] = i;
}
// Interval - we select first number, then 1 + N * centreInterval
int centreInterval = 2;
// There I will store my results in program
int centreSize = dataSize / centreInterval;
float* centresX = new float[centreSize];
float* centresY = new float[centreSize];
// Pointers to the arrays stored in GPU memory
float* d_dataX;
float* d_dataY;
float* d_centresX;
float* d_centresY;
// Allocate memory for those arrays
// Calculate how much space in memory do we need for this
size_t d_centreSize = sizeof(float) * centreSize;
size_t d_dataSize = sizeof(float) * dataSize;
// Memory for raw data
cudaMalloc((void**)&d_dataX, d_dataSize);
cudaMalloc((void**)&d_dataY, d_dataSize);
// Copy raw data to the device memory so we can operate on it freely
cudaMemcpy(d_dataY, dataY, d_dataSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_dataX, dataX, d_dataSize, cudaMemcpyHostToDevice);
// Memory for centre results
cudaMalloc((void**)&d_centresX, d_dataSize);
cudaMalloc((void**)&d_centresY, d_dataSize);
// Call kernel
dim3 dimBlock(BLOCK_SIZE);
dim3 dimGridK((centreSize + dimBlock.x) / dimBlock.x);
kMeansSelectInitialCentres <<<dimGridK, dimBlock>>> (d_dataX, d_dataY, d_centresX, d_centresY, centreInterval);
// Check results - we get every n-th point
float* check_x = new float[centreSize];
float* check_y = new float[centreSize];
cudaMemcpy(check_x, d_centresX, d_dataSize, cudaMemcpyDeviceToHost);
cudaMemcpy(check_y, d_centresY, d_dataSize, cudaMemcpyDeviceToHost);
printf("X: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_x[i]);
printf("\nY: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_y[i]);
printf("\n");
}
Main question: What is wrong with this kernel / check-out of data?
Side question: Is there any fair way to debug program kernels in such situations?
So, here's the solution I came up with after simplifying my case. There was a problem with memory usage - I tried to store / read different amount of data than I claimed to use when allocating it. I hope it will be helpful for anyone in the future:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <strings.h>
#include <math.h>
#include <time.h>
#include <unistd.h>
#define BLOCK_SIZE 16
// My kernel - Selects some centres at the beginning of algorithm and stores it at appropriate place
__global__ void kMeansSelectInitialCentres(float* d_dataX, float* d_dataY, float* d_centresX, float* d_centresY, int centreInterval) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int idx = i * centreInterval;
d_centresX[i] = d_dataX[idx];
d_centresY[i] = d_dataY[idx];
}
// Simplified example
int main(int argn, char ** argc) {
// My data - let's say it is 32 floats in each
int dataSize = 32;
float* dataX = new float[dataSize];
float* dataY = new float[dataSize];
// Fill arrays with numbers
for (int i = 0; i < dataSize; i++) {
dataX[i] = i;
dataY[i] = i;
}
// Interval - we select first number, then 1 + N * centreInterval
int centreInterval = 2;
// There I will store my results in program
int centreSize = dataSize / centreInterval;
float* centresX = new float[centreSize];
float* centresY = new float[centreSize];
// Pointers to the arrays stored in GPU memory
float* d_dataX;
float* d_dataY;
float* d_centresX;
float* d_centresY;
// Allocate memory for those arrays
// Calculate how much space in memory do we need for this
size_t d_centreSize = sizeof(float) * centreSize;
size_t d_dataSize = sizeof(float) * dataSize;
// Memory for raw data
cudaMalloc((void**)&d_dataX, d_dataSize);
cudaMalloc((void**)&d_dataY, d_dataSize);
// Copy raw data to the device memory so we can operate on it freely
cudaMemcpy(d_dataY, dataY, d_dataSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_dataX, dataX, d_dataSize, cudaMemcpyHostToDevice);
// Memory for centre results
cudaMalloc((void**)&d_centresX, d_centreSize);
cudaMalloc((void**)&d_centresY, d_centreSize);
// Call kernel
dim3 dimBlock(BLOCK_SIZE);
dim3 dimGridK((centreSize + dimBlock.x) / dimBlock.x);
kMeansSelectInitialCentres <<<dimGridK, dimBlock>>> (d_dataX, d_dataY, d_centresX, d_centresY, centreInterval);
// Check results - we get every n-th point
float* check_x = new float[centreSize];
float* check_y = new float[centreSize];
cudaMemcpy(check_x, d_centresX, d_centreSize, cudaMemcpyDeviceToHost);
cudaMemcpy(check_y, d_centresY, d_centreSize, cudaMemcpyDeviceToHost);
printf("X: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_x[i]);
printf("\nY: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_y[i]);
printf("\n");
}
I'm new to CUDA, and I been trying to figure out what I'm doing wrong here. CUDA is taking longer than just using the CPU to multiply a matrix. If I'm doing something wrong please let me know.
Here is my code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include <cstdlib>
#include <assert.h>
#include <time.h>
#define size 100 // Matrix size
#define cols size // Matrix width
#define rows size // Matrix height
void checkCUDAError(const char *msg)
{
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
{
fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}
}
__global__ void matrixMul( int *A, int *B, int *C)
{
int bx = blockIdx.x; // Block index
int tx = threadIdx.x; // Thread index
int ts = blockDim.x; // number of threads
// Declaration of the shared memory C element
extern __shared__ int c_element_sum[];
c_element_sum[tx] = A[tx+((bx/ts)*ts)] * B[(bx%ts)+(tx*ts)];
//Block until all threads in the block have written their data to shared mem
__syncthreads();
int sum;
for(int i=0; i<ts; i++){
if(i==0){
sum=c_element_sum[i];
}
else{
sum+=c_element_sum[i];
}
}
C[bx] = sum;
}
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
int main(int argc, char** argv)
{
//create timer.
clock_t t1, t2;
//start timer
t1=clock();
//allocate host memory for matrices
unsigned int size_A = cols * rows;
unsigned int mem_size_A = sizeof(int) * size_A;
int* mA = (int*) malloc(mem_size_A);
unsigned int size_B = cols * rows;
unsigned int mem_size_B = sizeof(int) * size_B;
int* mB = (int*) malloc(mem_size_B);
unsigned int size_C = cols * rows;
unsigned int mem_size_C = sizeof(int) * size_C;
int* mC = (int*) malloc(mem_size_C);
//initialize host memory
for (int i = 0; i < size_A; ++i){
mA[i] = 1;
mB[i] = 1;
mC[i] = 0;
}
// allocate device memory
int* d_mA;
int* d_mB;
int* d_mC;
cudaMalloc((void**) &d_mA, mem_size_A);
cudaMalloc((void**) &d_mB, mem_size_B);
cudaMalloc((void**) &d_mC, mem_size_C);
//copy host memory to device (A and B)
cudaMemcpy(d_mA, mA, mem_size_A, cudaMemcpyHostToDevice);
cudaMemcpy(d_mB, mB, mem_size_B, cudaMemcpyHostToDevice);
cudaMemcpy(d_mC, mC, mem_size_C, cudaMemcpyHostToDevice);
// setup execution parameters
int numThreadsPerBlock = cols;
int numBlocks = (cols * rows);
int sharedMemSize = numThreadsPerBlock * sizeof(int);
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
// execute the kernel
matrixMul <<< dimGrid, dimBlock, sharedMemSize >>>(d_mA, d_mB, d_mC);
//Block until device has completed
cudaThreadSynchronize();
// check if kernel execution generated an error
// Check for any CUDA errors
checkCUDAError("kernel invocation");
//copy result from device to host
cudaMemcpy(mC, d_mC, mem_size_C, cudaMemcpyDeviceToHost);
// Check for any CUDA errors
checkCUDAError("memcpy");
//stop timer
t2 = clock();
//check results
for (int i = 0; i < size_C; ++i){
assert(mC[i] == cols);
}
//clean up memory
free(mA);
free(mB);
free(mC);
cudaFree(d_mA);
cudaFree(d_mB);
cudaFree(d_mC);
printf("WITH CUDA - clocks: %d \n\n", t2-t1);
//////////////////////////////
///////// CPU ONLY //////////
/////////////////////////////
//create timer.
clock_t cpu_t1, cpu_t2;
//start timer
cpu_t1=clock();
//allocate host memory for matrices
unsigned int cpu_size_A = cols * rows;
unsigned int cpu_mem_size_A = sizeof(int) * cpu_size_A;
int* cpu_mA = (int*) malloc(cpu_mem_size_A);
unsigned int cpu_size_B = cols * rows;
unsigned int cpu_mem_size_B = sizeof(int) * cpu_size_B;
int* cpu_mB = (int*) malloc(cpu_mem_size_B);
unsigned int cpu_size_C = cols * rows;
unsigned int cpu_mem_size_C = sizeof(int) * cpu_size_C;
int* cpu_mC = (int*) malloc(cpu_mem_size_C);
//initialize host memory
for (int i = 0; i < cpu_size_A; ++i){
cpu_mA[i] = 1;
cpu_mB[i] = 1;
cpu_mC[i] = 0;
}
int ts = cols;
for(int bx=0; bx<(cols*rows);bx++){
int sum = 0;
for(int tx=0; tx<cols; tx++){
sum += cpu_mA[tx+((bx/ts)*ts)] * cpu_mB[(bx%ts)+(tx*ts)];
}
cpu_mC[bx]=sum;
}
//stop timer
cpu_t2 = clock();
//check results
for (int i = 0; i < cpu_size_C; ++i){
assert(cpu_mC[i] == cols);
}
//clean up memory
free(cpu_mA);
free(cpu_mB);
free(cpu_mC);
printf("CPU ONLY - clocks: %d \n\n", cpu_t2-cpu_t1);
return 0;
}
Based on your program, this is expected. Your timer looks like it clocks the entire execution of the program, which would include copying to the device, computation time, and copying the results back. Given the rather small workload you've provided for the program (100x100 matrices), the overhead of the memory copies far outweighs any computational benefit you get when doing the computation with the kernel. Your kernel itself is also not the most efficient implementation.
I don't think you're doing anything wrong, it's just that you haven't provided a large enough chunk of work for the GPU and you could potentially further optimize your kernel. Note that simply scaling up the size of the chunk may not significantly improve the performance with respect to the CPU, since you would also be scaling up the memory management time. While it is relatively simple to write a first implementation of a program on CUDA, it is significantly more difficult to get good performance out of it. The most effective way to use CUDA is to have a high ratio of compute to memory transactions. For example, having a pipeline of several compute-intensive kernels to operate successively on a chunk of data, only needing host-device copying at the beginning and end.
If this is just a program to help you learn to code for CUDA, this is a great step and getting a deep understanding of how to optimize matrix multiplication kernels will serve you well in many other cases. If you are writing this kernel for use in a production piece of software, I would recommend you use the highly-optimized linear algebra library CUBLAS: http://developer.nvidia.com/cublas (or some other library where the hard work has been done for you already).