Copying array from RAM to GPU and from GPU to RAM

Copying array from RAM to GPU and from GPU to RAM - c++

I'm trying to introduce some CUDA optimizations in one of my projects. But I think I'm doing something wrong here. I want to implement a simple matrix-vector multiplication (result = matrix * vector). But when I want to copy the result back to the host, errors will occur (cudaErrorLaunchFailure). Is there an error in my kernel (matrixVectorMultiplicationKernel) or did I call cudaMemcpy incorrectly? I found no helpful documentation for this kind of error state. I think this completely destroys the state of the GPU because I cannot call any CUDA kernel without getting this error again after the first occurrence.
edit#1: Updated code, following leftaroundabout's advice.
// code
...
Eigen::MatrixXf matrix(M, N); // matrix.data() usually should return a float array
Eigen::VectorXf vector(N); // same here for vector.data()
Eigen::VectorXf result(M);
... // fill matrix and vector
float* matrixOnDevice = copyMatrixToDevice(matrix.data(), matrix.rows(), matrix.cols());
matrixVectorMultiplication(matrixOnDevice, vector.data(), result.data(), matrix.rows(), cm.cols());
... // clean up
// helper functions
float* copyMatrixToDevice(const float* matrix, int mRows, int mCols)
{
float* matrixOnDevice;
const int length = mRows*mCols;
const int size = length * sizeof(float);
handleCUDAError(cudaMalloc((void**)&matrixOnDevice, size));
handleCUDAError(cudaMemcpy(matrixOnDevice, matrix, size, cudaMemcpyHostToDevice));
return matrixOnDevice;
}
void matrixVectorMultiplication(const float* matrixOnDevice, const float* vector, float* result, int mRows, int mCols)
{
const int vectorSize = mCols*sizeof(float);
const int resultSize = mRows*sizeof(float);
const int matrixLength = mRows*mCols;
float* deviceVector;
float* deviceResult;
handleCUDAError(cudaMalloc((void**)&deviceVector, vectorSize));
handleCUDAError(cudaMalloc((void**)&deviceResult, resultSize));
handleCUDAError(cudaMemset(deviceResult, 0, resultSize));
handleCUDAError(cudaMemcpy(deviceVector, vector, vectorSize, cudaMemcpyHostToDevice));
int threadsPerBlock = 256;
int blocksPerGrid = (mRows + threadsPerBlock - 1) / threadsPerBlock;
matrixVectorMultiplicationKernel<<<blocksPerGrid, threadsPerBlock>>>(matrixOnDevice, vector, result, mRows, mCols, matrixLength);
// --- no errors yet ---
handleCUDAError(cudaMemcpy(result, deviceResult, resultSize, cudaMemcpyDeviceToHost)); // cudaErrorLaunchFailure
handleCUDAError(cudaFree(deviceVector)); // cudaErrorLaunchFailure
handleCUDAError(cudaFree(deviceResult)); // cudaErrorLaunchFailure
}
__global__ void matrixVectorMultiplicationKernel(const float* matrix, const float* vector, float* result, int mRows, int mCols, int length)
{
int row = blockDim.x * blockIdx.x + threadIdx.x;
if(row < mRows)
{
for(int col = 0, mIdx = row*mCols; col < mCols; col++, mIdx++)
result[row] += matrix[mIdx] * vector[col];
}
}

Your problem is that void copyMatrixToDevice(..., float* matrixOnDevice, ...) takes this pointer by value, i.e. it can't "output" the device matrix. You can do it with void copyMatrixToDevice(..., float** matrixOnDevice, ...), called by
copyMatrixToDevice(matrix.data(), &matrixOnDevice, matrix.rows(), matrix.cols());
There is the same problem with result in matrixVectorMultiplication.
In the long term, in C++ you should put a proper class abstraction layer around all of this.

Related

cudaMemcpy2D error with large array

I tried to use cudaMallocPitch and cudaMemcpy2D, but when I tried to use cudaMemcpy2D with large array, I encountered a problem:
Segmentation fault
Here is the runnable source code, with no error.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
#include <random>
#define ROW_SIZE 32
#define COL_SIZE 1024
int main()
{
float ** pfTest;
pfTest = (float**)malloc(ROW_SIZE * sizeof(float*));
for (int i = 0; i < ROW_SIZE; i++) {
pfTest[i] = (float*)malloc(COL_SIZE * sizeof(float));
}
std::default_random_engine generator;
std::uniform_real_distribution<float> distribution;
for (int y = 0; y < ROW_SIZE; y++) {
for (int x = 0; x < COL_SIZE; x++) {
pfTest[y][x] = distribution(generator);
}
}
float *dev_Test;
size_t pitch;
cudaMallocPitch(&dev_Test, &pitch, COL_SIZE * sizeof(float), ROW_SIZE);
cudaMemcpy2D(dev_Test, pitch, pfTest, COL_SIZE * sizeof(float), COL_SIZE * sizeof(float), ROW_SIZE, cudaMemcpyHostToDevice);
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
return 0;
}
As you can see, there's no problem at all.
But, when I tried to extend COL_SIZE to around 500,000 (exactly, 524288), it crashes with segmentation fault.
Any help as to the source of the problem?

cudaMemcpy2D can only be used for copying pitched linear memory. Your source array is not pitched linear memory, it is an array of pointers. This is not supported and is the source of the segfault.
Try something like this:
float* buffer;
float** pfTest;
const size_t buffer_pitch = size_t(COL_SIZE) * sizeof(float);
buffer = (float*)malloc(size_t(ROW_SIZE) * buffer_pitch);
pfTest = (float**)malloc(ROW_SIZE * sizeof(float*));
for (size_t i = 0; i < ROW_SIZE; i++) {
pfTest[i] = buffer + i * size_t(COL_SIZE);
}
// ...
cudaMallocPitch(&dev_Test, &pitch, buffer_pitch, ROW_SIZE);
cudaMemcpy2D(dev_Test, pitch, buffer, buffer_pitch,
buffer_pitch, ROW_SIZE, cudaMemcpyHostToDevice);
[Note: written in browser, never tested or compiled, use at own risk]
i.e. store the data to be copied in a single contiguous memory allocation which can act as a pitched linear source for cudaMemcpy2D. If you insist on using [][] style indexing on the host, then you have to pay the penalty of having an additional array of pointers to store alongside the data. Note that isn't actually necessary, and you could just directly index into buffer and achieve the same result, while saving memory at the same time.

CUDA Pinned memory implementation error cannot set when device is active in this process

I want to implement the pinned memory feature of GPU in my code. For doing that I write my code like this:
bool addVectorGPU(float* M, float* N, float* P, int size)
{
// Error return value
cudaError_t status;
cudaSetDeviceFlags(cudaDeviceMapHost);
// Number of bytes in the matrix.
int bytes = size * sizeof(float);
// Pointers to the device arrays
float *Md, *Nd, *Pd;
// Allocate memory on the device to store each matrix
cudaHostAlloc((void**)&M, bytes, cudaHostAllocMapped);
cudaHostAlloc((void**)&N, bytes, cudaHostAllocMapped);
cudaHostAlloc((void**)&P, bytes, cudaHostAllocMapped);
// Copy the host input data to the device
cudaHostGetDevicePointer((void**)&Md, M, 0);
cudaHostGetDevicePointer((void**)&Nd, N, 0);
cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Specify the size of the grid and the size of the block
dim3 dimBlock(TILE_SIZE); // Matrix is contained in a block
dim3 dimGrid((int)ceil((float)size / (float)TILE_SIZE));
// Launch the kernel on a size-by-size block of threads
addVectorKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, size);
// Wait for completion
cudaThreadSynchronize();
cudaDeviceSynchronize();
// Check for errors
status = cudaGetLastError();
if (status != cudaSuccess) {
std::cout << "Kernel failed: " << cudaGetErrorString(status) <<
std::endl;
cudaFreeHost(M);
cudaFreeHost(N);
cudaFreeHost(P);
return false;
}
// Retrieve the result matrix
//cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Free device memory
cudaFreeHost(M);
cudaFreeHost(N);
cudaFreeHost(P);
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);
// Success
return true;
}
Now for evaluating performance on my device I call this function 1000 times and then compute the average time which it takes to run:
int main(){
// Timing data
float tcpuadd, tcpusub, tcpuscale, tgpuadd, tgpusub, tgpuscale, sum, delta, L2norm;
clock_t start, end;
bool success;
//Allocate the four vectors of SIZE floats
float* M = new float[SIZE];
float* N = new float[SIZE];
float* Pcpu = new float[SIZE];
float* Pgpu = new float[SIZE];
//Initialize M and N to random integers
for (int i = 0; i < SIZE; i ++){
M[i] = (float) rand()/(RAND_MAX);
N[i] = (float) rand()/(RAND_MAX);
}
printf("Operating on a vector of length %d\n", SIZE);
//Add two vectors and compute timing in CPU
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorCPU(M, N, Pcpu, SIZE);
}
end = clock();
tcpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf( "CPU Addition took %f ms\n", tcpuadd);
//Add two vectors and compute timing in GPU
success = addVectorGPU(M, N ,Pgpu , SIZE);
if(!success)
{
printf("Device Error!\n");
return 1;
}
//compute GPU timing
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorGPU(M, N, Pgpu, SIZE);
}
end = clock();
tgpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf("GPU Addition took %f ms\n", tgpuadd);
The problem is, for the first time this function works without any errors. But the second time when I call this function, I've got error:
cannot set when device is active in this process
So does anyone know what it is all about?

If you do a better job of cuda error checking by checking the return value of each runtime API call, you'll discover that this error is returned from the second time you call this:
cudaSetDeviceFlags(cudaDeviceMapHost);
Note that description of this runtime API call:
If the current device has been set and that device has already been initialized then this call will fail with the error cudaErrorSetOnActiveProcess.
The solution is to call the function only once, at the beginning of your application, not every time you call the addVectorGPU function. Take that call out of the addVectorGPU function, and put it in your main routine, prior to the first call of addVectorGPU.
Based on the question below, there are various other issues with the code:
I would suggest implementing proper cuda error checking on all kernel calls and all CUDA API calls, rather than once at the end of the routine.
The usage of cudaHostAlloc is incorrect. The intent of the program appears to be to pass host pointers to host-resident data to the GPU routine, and then add that data using a zero-copy technique. This is technically feasible (although it will be very slow), but the correct approach would involve the use of cudaHostRegister, not cudaHostAlloc. cudaHostAlloc creates a new allocation, so the existing data passed to the function would not be used or referenced that way.
Here's a worked example, based on what you have shown. Note that I personally would not benchmark things this way, but I am providing this to show that the process can work in an error-free way:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <iostream>
#define TILE_SIZE 512
#define SIZE 1048576
#define ITERS 10
bool addVectorCPU(float *M, float *N, float *P, int size){
for (int i=0; i< size; i++) P[i] = M[i]+N[i];
return true;
}
__global__ void addVectorKernel(float *M, float *N, float *P,int size){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < size)
P[idx] = M[idx]+N[idx];
}
bool addVectorGPU(float* M, float* N, float* P, int size)
{
// Error return value
cudaError_t status;
// Number of bytes in the matrix.
int bytes = size * sizeof(float);
// Pointers to the device arrays
float *Md, *Nd, *Pd;
// Allocate memory on the device to store each matrix
cudaHostRegister(M, bytes, cudaHostRegisterMapped);
cudaHostRegister(N, bytes, cudaHostRegisterMapped);
cudaHostRegister(P, bytes, cudaHostRegisterMapped);
// Copy the host input data to the device
cudaHostGetDevicePointer((void**)&Md, M, 0);
cudaHostGetDevicePointer((void**)&Nd, N, 0);
cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Specify the size of the grid and the size of the block
dim3 dimBlock(TILE_SIZE); // Matrix is contained in a block
dim3 dimGrid((int)ceil((float)size / (float)TILE_SIZE));
// Launch the kernel on a size-by-size block of threads
addVectorKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, size);
// Wait for completion
cudaDeviceSynchronize();
bool res = true;
// Check for errors
status = cudaGetLastError();
if (status != cudaSuccess) {
std::cout << "Kernel failed: " << cudaGetErrorString(status) << std::endl;
res = false;
}
// Retrieve the result matrix
//cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Free device memory
cudaHostUnregister(M);
cudaHostUnregister(N);
cudaHostUnregister(P);
// Success
return res;
}
int main(){
// Timing data
float tcpuadd, tgpuadd;
clock_t start, end;
bool success;
//Allocate the four vectors of SIZE floats
float* M = new float[SIZE];
float* N = new float[SIZE];
float* Pcpu = new float[SIZE];
float* Pgpu = new float[SIZE];
//Initialize M and N to random integers
for (int i = 0; i < SIZE; i ++){
M[i] = rand()/(float)(RAND_MAX);
N[i] = rand()/(float)(RAND_MAX);
}
printf("Operating on a vector of length %d\n", SIZE);
//Add two vectors and compute timing in CPU
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorCPU(M, N, Pcpu, SIZE);
}
end = clock();
tcpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf( "CPU Addition took %f ms\n", tcpuadd);
//Add two vectors and compute timing in GPU
cudaSetDeviceFlags(cudaDeviceMapHost);
success = addVectorGPU(M, N ,Pgpu , SIZE);
if(!success)
{
printf("Device Error!\n");
return 1;
}
//compute GPU timing
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorGPU(M, N, Pgpu, SIZE);
}
end = clock();
tgpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf("GPU Addition took %f ms\n", tgpuadd);
}
Note I've made a few other changes, as well. For example cudaThreadSynchronize() is deprecated, and it's not necessary to use both cudaThreadSynchronize() and cudaDeviceSynchronize(); they are redundant.

CUDA: Group every n-th point of array passed to GPU

I am trying to implement k-means algorithm on CUDA using Tesla card on external Unix. I read input file and store coordinates of all data points in dataX and dataY arrays. The next step is to select every centreInterval-th point and store it in another array allocated in GPU memory. However, I have no idea how may I even check what's the problem if all I can get is 'Segmentation error' and from obvious reasons can't print any kind of output from kernel.
EDIT 2: I simplified this example to the shortest possible solution. I found my solution during process, but decided to provide the version, which was not solved yet in this question to make more clear what caused the problem.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <strings.h>
#include <math.h>
#include <time.h>
#include <unistd.h>
#define BLOCK_SIZE 16
// My kernel - Selects some centres at the beginning of algorithm and stores it at appropriate place
__global__ void kMeansSelectInitialCentres(float* d_dataX, float* d_dataY, float* d_centresX, float* d_centresY, int centreInterval) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int idx = i * centreInterval;
d_centresX[i] = d_dataX[idx];
d_centresY[i] = d_dataY[idx];
}
// Simplified example
int main(int argn, char ** argc) {
// My data - let's say it is 32 floats in each
int dataSize = 32;
float* dataX = new float[dataSize];
float* dataY = new float[dataSize];
// Fill arrays with numbers
for (int i = 0; i < dataSize; i++) {
dataX[i] = i;
dataY[i] = i;
}
// Interval - we select first number, then 1 + N * centreInterval
int centreInterval = 2;
// There I will store my results in program
int centreSize = dataSize / centreInterval;
float* centresX = new float[centreSize];
float* centresY = new float[centreSize];
// Pointers to the arrays stored in GPU memory
float* d_dataX;
float* d_dataY;
float* d_centresX;
float* d_centresY;
// Allocate memory for those arrays
// Calculate how much space in memory do we need for this
size_t d_centreSize = sizeof(float) * centreSize;
size_t d_dataSize = sizeof(float) * dataSize;
// Memory for raw data
cudaMalloc((void**)&d_dataX, d_dataSize);
cudaMalloc((void**)&d_dataY, d_dataSize);
// Copy raw data to the device memory so we can operate on it freely
cudaMemcpy(d_dataY, dataY, d_dataSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_dataX, dataX, d_dataSize, cudaMemcpyHostToDevice);
// Memory for centre results
cudaMalloc((void**)&d_centresX, d_dataSize);
cudaMalloc((void**)&d_centresY, d_dataSize);
// Call kernel
dim3 dimBlock(BLOCK_SIZE);
dim3 dimGridK((centreSize + dimBlock.x) / dimBlock.x);
kMeansSelectInitialCentres <<<dimGridK, dimBlock>>> (d_dataX, d_dataY, d_centresX, d_centresY, centreInterval);
// Check results - we get every n-th point
float* check_x = new float[centreSize];
float* check_y = new float[centreSize];
cudaMemcpy(check_x, d_centresX, d_dataSize, cudaMemcpyDeviceToHost);
cudaMemcpy(check_y, d_centresY, d_dataSize, cudaMemcpyDeviceToHost);
printf("X: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_x[i]);
printf("\nY: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_y[i]);
printf("\n");
}
Main question: What is wrong with this kernel / check-out of data?
Side question: Is there any fair way to debug program kernels in such situations?

So, here's the solution I came up with after simplifying my case. There was a problem with memory usage - I tried to store / read different amount of data than I claimed to use when allocating it. I hope it will be helpful for anyone in the future:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <strings.h>
#include <math.h>
#include <time.h>
#include <unistd.h>
#define BLOCK_SIZE 16
// My kernel - Selects some centres at the beginning of algorithm and stores it at appropriate place
__global__ void kMeansSelectInitialCentres(float* d_dataX, float* d_dataY, float* d_centresX, float* d_centresY, int centreInterval) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int idx = i * centreInterval;
d_centresX[i] = d_dataX[idx];
d_centresY[i] = d_dataY[idx];
}
// Simplified example
int main(int argn, char ** argc) {
// My data - let's say it is 32 floats in each
int dataSize = 32;
float* dataX = new float[dataSize];
float* dataY = new float[dataSize];
// Fill arrays with numbers
for (int i = 0; i < dataSize; i++) {
dataX[i] = i;
dataY[i] = i;
}
// Interval - we select first number, then 1 + N * centreInterval
int centreInterval = 2;
// There I will store my results in program
int centreSize = dataSize / centreInterval;
float* centresX = new float[centreSize];
float* centresY = new float[centreSize];
// Pointers to the arrays stored in GPU memory
float* d_dataX;
float* d_dataY;
float* d_centresX;
float* d_centresY;
// Allocate memory for those arrays
// Calculate how much space in memory do we need for this
size_t d_centreSize = sizeof(float) * centreSize;
size_t d_dataSize = sizeof(float) * dataSize;
// Memory for raw data
cudaMalloc((void**)&d_dataX, d_dataSize);
cudaMalloc((void**)&d_dataY, d_dataSize);
// Copy raw data to the device memory so we can operate on it freely
cudaMemcpy(d_dataY, dataY, d_dataSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_dataX, dataX, d_dataSize, cudaMemcpyHostToDevice);
// Memory for centre results
cudaMalloc((void**)&d_centresX, d_centreSize);
cudaMalloc((void**)&d_centresY, d_centreSize);
// Call kernel
dim3 dimBlock(BLOCK_SIZE);
dim3 dimGridK((centreSize + dimBlock.x) / dimBlock.x);
kMeansSelectInitialCentres <<<dimGridK, dimBlock>>> (d_dataX, d_dataY, d_centresX, d_centresY, centreInterval);
// Check results - we get every n-th point
float* check_x = new float[centreSize];
float* check_y = new float[centreSize];
cudaMemcpy(check_x, d_centresX, d_centreSize, cudaMemcpyDeviceToHost);
cudaMemcpy(check_y, d_centresY, d_centreSize, cudaMemcpyDeviceToHost);
printf("X: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_x[i]);
printf("\nY: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_y[i]);
printf("\n");
}

Implement Multiply and adding 2 matrix by avx programming

I want to implement multiply and adding 2 matrices in Visual C++ 2012 using AVX. I enable AVX(Advanced Vector Extensions (/arch:AVX)) in Visual studio. But for adding matrices when I enable this property and when I disable it, the time is same and enabling this property doesn't affect on time of running program.
for 100000 iteration for 4*4 matrices. the clock time is 9. enable or disable Advanced Vector Extensions (/arch:AVX) didn't change this time.
another problem is How to implement multiply 2 matrix by AVX?
void AVXadd(
double* pArray1, // [in] first source array
double* pArray2, // [in] second source array
double* pResult, // [out] result array
int nSize) // [in] size of all arrays
{
int nLoop = (nSize*nSize)/ 4;
//
__m256d* pSrc1 = (__m256d*) pArray1;
__m256d* pSrc2 = (__m256d*) pArray2;
__m256d* pDest = (__m256d*) pResult;
for ( int i = 0; i < nLoop; i++ )
{
*pDest = _mm256_add_pd(*pSrc1,*pSrc2); //add input arrays
pSrc1++;
pSrc2++;
pDest++;
}
}
//get datas from random matrix and test add
void AVX(vector<vector<double>> randn,int size,int itt){
double *A,*B,*C;
int ARRAY_SIZE=size*size;
//alligment arrays:
A = (double*) _aligned_malloc(ARRAY_SIZE * sizeof(double), 32);
B = (double*) _aligned_malloc(ARRAY_SIZE * sizeof(double), 32);
C = (double*) _aligned_malloc(ARRAY_SIZE * sizeof(double), 32);
//fill by random vector numbers
for(int i=0;i<size;i++)
for(int j=0;j<size;j++)
A[i*size+j]=B[i*size+j]=randn[i][j];//matrix to array
clock_t t1, t2;
t1=clock();
//add
for(int i=0;i<itt;i++)
AVXadd(A,B,C,size);
t2=clock();
vector<vector<double>> c(size,vector<double>(size));
for(int i=0;i<size;i++)
for(int j=0;j<size;j++)
c[i][j]=C[i*size+j];//C is result
cout<<"\t\t"<<t2-t1;
}

Matrix Multiplication with CUDA, long execution time

I'm new to CUDA, and I been trying to figure out what I'm doing wrong here. CUDA is taking longer than just using the CPU to multiply a matrix. If I'm doing something wrong please let me know.
Here is my code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include <cstdlib>
#include <assert.h>
#include <time.h>
#define size 100 // Matrix size
#define cols size // Matrix width
#define rows size // Matrix height
void checkCUDAError(const char *msg)
{
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
{
fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}
}
__global__ void matrixMul( int *A, int *B, int *C)
{
int bx = blockIdx.x; // Block index
int tx = threadIdx.x; // Thread index
int ts = blockDim.x; // number of threads
// Declaration of the shared memory C element
extern __shared__ int c_element_sum[];
c_element_sum[tx] = A[tx+((bx/ts)*ts)] * B[(bx%ts)+(tx*ts)];
//Block until all threads in the block have written their data to shared mem
__syncthreads();
int sum;
for(int i=0; i<ts; i++){
if(i==0){
sum=c_element_sum[i];
}
else{
sum+=c_element_sum[i];
}
}
C[bx] = sum;
}
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
int main(int argc, char** argv)
{
//create timer.
clock_t t1, t2;
//start timer
t1=clock();
//allocate host memory for matrices
unsigned int size_A = cols * rows;
unsigned int mem_size_A = sizeof(int) * size_A;
int* mA = (int*) malloc(mem_size_A);
unsigned int size_B = cols * rows;
unsigned int mem_size_B = sizeof(int) * size_B;
int* mB = (int*) malloc(mem_size_B);
unsigned int size_C = cols * rows;
unsigned int mem_size_C = sizeof(int) * size_C;
int* mC = (int*) malloc(mem_size_C);
//initialize host memory
for (int i = 0; i < size_A; ++i){
mA[i] = 1;
mB[i] = 1;
mC[i] = 0;
}
// allocate device memory
int* d_mA;
int* d_mB;
int* d_mC;
cudaMalloc((void**) &d_mA, mem_size_A);
cudaMalloc((void**) &d_mB, mem_size_B);
cudaMalloc((void**) &d_mC, mem_size_C);
//copy host memory to device (A and B)
cudaMemcpy(d_mA, mA, mem_size_A, cudaMemcpyHostToDevice);
cudaMemcpy(d_mB, mB, mem_size_B, cudaMemcpyHostToDevice);
cudaMemcpy(d_mC, mC, mem_size_C, cudaMemcpyHostToDevice);
// setup execution parameters
int numThreadsPerBlock = cols;
int numBlocks = (cols * rows);
int sharedMemSize = numThreadsPerBlock * sizeof(int);
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
// execute the kernel
matrixMul <<< dimGrid, dimBlock, sharedMemSize >>>(d_mA, d_mB, d_mC);
//Block until device has completed
cudaThreadSynchronize();
// check if kernel execution generated an error
// Check for any CUDA errors
checkCUDAError("kernel invocation");
//copy result from device to host
cudaMemcpy(mC, d_mC, mem_size_C, cudaMemcpyDeviceToHost);
// Check for any CUDA errors
checkCUDAError("memcpy");
//stop timer
t2 = clock();
//check results
for (int i = 0; i < size_C; ++i){
assert(mC[i] == cols);
}
//clean up memory
free(mA);
free(mB);
free(mC);
cudaFree(d_mA);
cudaFree(d_mB);
cudaFree(d_mC);
printf("WITH CUDA - clocks: %d \n\n", t2-t1);
//////////////////////////////
///////// CPU ONLY //////////
/////////////////////////////
//create timer.
clock_t cpu_t1, cpu_t2;
//start timer
cpu_t1=clock();
//allocate host memory for matrices
unsigned int cpu_size_A = cols * rows;
unsigned int cpu_mem_size_A = sizeof(int) * cpu_size_A;
int* cpu_mA = (int*) malloc(cpu_mem_size_A);
unsigned int cpu_size_B = cols * rows;
unsigned int cpu_mem_size_B = sizeof(int) * cpu_size_B;
int* cpu_mB = (int*) malloc(cpu_mem_size_B);
unsigned int cpu_size_C = cols * rows;
unsigned int cpu_mem_size_C = sizeof(int) * cpu_size_C;
int* cpu_mC = (int*) malloc(cpu_mem_size_C);
//initialize host memory
for (int i = 0; i < cpu_size_A; ++i){
cpu_mA[i] = 1;
cpu_mB[i] = 1;
cpu_mC[i] = 0;
}
int ts = cols;
for(int bx=0; bx<(cols*rows);bx++){
int sum = 0;
for(int tx=0; tx<cols; tx++){
sum += cpu_mA[tx+((bx/ts)*ts)] * cpu_mB[(bx%ts)+(tx*ts)];
}
cpu_mC[bx]=sum;
}
//stop timer
cpu_t2 = clock();
//check results
for (int i = 0; i < cpu_size_C; ++i){
assert(cpu_mC[i] == cols);
}
//clean up memory
free(cpu_mA);
free(cpu_mB);
free(cpu_mC);
printf("CPU ONLY - clocks: %d \n\n", cpu_t2-cpu_t1);
return 0;
}

Based on your program, this is expected. Your timer looks like it clocks the entire execution of the program, which would include copying to the device, computation time, and copying the results back. Given the rather small workload you've provided for the program (100x100 matrices), the overhead of the memory copies far outweighs any computational benefit you get when doing the computation with the kernel. Your kernel itself is also not the most efficient implementation.
I don't think you're doing anything wrong, it's just that you haven't provided a large enough chunk of work for the GPU and you could potentially further optimize your kernel. Note that simply scaling up the size of the chunk may not significantly improve the performance with respect to the CPU, since you would also be scaling up the memory management time. While it is relatively simple to write a first implementation of a program on CUDA, it is significantly more difficult to get good performance out of it. The most effective way to use CUDA is to have a high ratio of compute to memory transactions. For example, having a pipeline of several compute-intensive kernels to operate successively on a chunk of data, only needing host-device copying at the beginning and end.
If this is just a program to help you learn to code for CUDA, this is a great step and getting a deep understanding of how to optimize matrix multiplication kernels will serve you well in many other cases. If you are writing this kernel for use in a production piece of software, I would recommend you use the highly-optimized linear algebra library CUBLAS: http://developer.nvidia.com/cublas (or some other library where the hard work has been done for you already).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js