CUDA: Group every n-th point of array passed to GPU - c++

I am trying to implement k-means algorithm on CUDA using Tesla card on external Unix. I read input file and store coordinates of all data points in dataX and dataY arrays. The next step is to select every centreInterval-th point and store it in another array allocated in GPU memory. However, I have no idea how may I even check what's the problem if all I can get is 'Segmentation error' and from obvious reasons can't print any kind of output from kernel.
EDIT 2: I simplified this example to the shortest possible solution. I found my solution during process, but decided to provide the version, which was not solved yet in this question to make more clear what caused the problem.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <strings.h>
#include <math.h>
#include <time.h>
#include <unistd.h>
#define BLOCK_SIZE 16
// My kernel - Selects some centres at the beginning of algorithm and stores it at appropriate place
__global__ void kMeansSelectInitialCentres(float* d_dataX, float* d_dataY, float* d_centresX, float* d_centresY, int centreInterval) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int idx = i * centreInterval;
d_centresX[i] = d_dataX[idx];
d_centresY[i] = d_dataY[idx];
}
// Simplified example
int main(int argn, char ** argc) {
// My data - let's say it is 32 floats in each
int dataSize = 32;
float* dataX = new float[dataSize];
float* dataY = new float[dataSize];
// Fill arrays with numbers
for (int i = 0; i < dataSize; i++) {
dataX[i] = i;
dataY[i] = i;
}
// Interval - we select first number, then 1 + N * centreInterval
int centreInterval = 2;
// There I will store my results in program
int centreSize = dataSize / centreInterval;
float* centresX = new float[centreSize];
float* centresY = new float[centreSize];
// Pointers to the arrays stored in GPU memory
float* d_dataX;
float* d_dataY;
float* d_centresX;
float* d_centresY;
// Allocate memory for those arrays
// Calculate how much space in memory do we need for this
size_t d_centreSize = sizeof(float) * centreSize;
size_t d_dataSize = sizeof(float) * dataSize;
// Memory for raw data
cudaMalloc((void**)&d_dataX, d_dataSize);
cudaMalloc((void**)&d_dataY, d_dataSize);
// Copy raw data to the device memory so we can operate on it freely
cudaMemcpy(d_dataY, dataY, d_dataSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_dataX, dataX, d_dataSize, cudaMemcpyHostToDevice);
// Memory for centre results
cudaMalloc((void**)&d_centresX, d_dataSize);
cudaMalloc((void**)&d_centresY, d_dataSize);
// Call kernel
dim3 dimBlock(BLOCK_SIZE);
dim3 dimGridK((centreSize + dimBlock.x) / dimBlock.x);
kMeansSelectInitialCentres <<<dimGridK, dimBlock>>> (d_dataX, d_dataY, d_centresX, d_centresY, centreInterval);
// Check results - we get every n-th point
float* check_x = new float[centreSize];
float* check_y = new float[centreSize];
cudaMemcpy(check_x, d_centresX, d_dataSize, cudaMemcpyDeviceToHost);
cudaMemcpy(check_y, d_centresY, d_dataSize, cudaMemcpyDeviceToHost);
printf("X: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_x[i]);
printf("\nY: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_y[i]);
printf("\n");
}
Main question: What is wrong with this kernel / check-out of data?
Side question: Is there any fair way to debug program kernels in such situations?

So, here's the solution I came up with after simplifying my case. There was a problem with memory usage - I tried to store / read different amount of data than I claimed to use when allocating it. I hope it will be helpful for anyone in the future:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <strings.h>
#include <math.h>
#include <time.h>
#include <unistd.h>
#define BLOCK_SIZE 16
// My kernel - Selects some centres at the beginning of algorithm and stores it at appropriate place
__global__ void kMeansSelectInitialCentres(float* d_dataX, float* d_dataY, float* d_centresX, float* d_centresY, int centreInterval) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int idx = i * centreInterval;
d_centresX[i] = d_dataX[idx];
d_centresY[i] = d_dataY[idx];
}
// Simplified example
int main(int argn, char ** argc) {
// My data - let's say it is 32 floats in each
int dataSize = 32;
float* dataX = new float[dataSize];
float* dataY = new float[dataSize];
// Fill arrays with numbers
for (int i = 0; i < dataSize; i++) {
dataX[i] = i;
dataY[i] = i;
}
// Interval - we select first number, then 1 + N * centreInterval
int centreInterval = 2;
// There I will store my results in program
int centreSize = dataSize / centreInterval;
float* centresX = new float[centreSize];
float* centresY = new float[centreSize];
// Pointers to the arrays stored in GPU memory
float* d_dataX;
float* d_dataY;
float* d_centresX;
float* d_centresY;
// Allocate memory for those arrays
// Calculate how much space in memory do we need for this
size_t d_centreSize = sizeof(float) * centreSize;
size_t d_dataSize = sizeof(float) * dataSize;
// Memory for raw data
cudaMalloc((void**)&d_dataX, d_dataSize);
cudaMalloc((void**)&d_dataY, d_dataSize);
// Copy raw data to the device memory so we can operate on it freely
cudaMemcpy(d_dataY, dataY, d_dataSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_dataX, dataX, d_dataSize, cudaMemcpyHostToDevice);
// Memory for centre results
cudaMalloc((void**)&d_centresX, d_centreSize);
cudaMalloc((void**)&d_centresY, d_centreSize);
// Call kernel
dim3 dimBlock(BLOCK_SIZE);
dim3 dimGridK((centreSize + dimBlock.x) / dimBlock.x);
kMeansSelectInitialCentres <<<dimGridK, dimBlock>>> (d_dataX, d_dataY, d_centresX, d_centresY, centreInterval);
// Check results - we get every n-th point
float* check_x = new float[centreSize];
float* check_y = new float[centreSize];
cudaMemcpy(check_x, d_centresX, d_centreSize, cudaMemcpyDeviceToHost);
cudaMemcpy(check_y, d_centresY, d_centreSize, cudaMemcpyDeviceToHost);
printf("X: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_x[i]);
printf("\nY: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_y[i]);
printf("\n");
}

Related

cudaMemcpy2D error with large array

I tried to use cudaMallocPitch and cudaMemcpy2D, but when I tried to use cudaMemcpy2D with large array, I encountered a problem:
Segmentation fault
Here is the runnable source code, with no error.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
#include <random>
#define ROW_SIZE 32
#define COL_SIZE 1024
int main()
{
float ** pfTest;
pfTest = (float**)malloc(ROW_SIZE * sizeof(float*));
for (int i = 0; i < ROW_SIZE; i++) {
pfTest[i] = (float*)malloc(COL_SIZE * sizeof(float));
}
std::default_random_engine generator;
std::uniform_real_distribution<float> distribution;
for (int y = 0; y < ROW_SIZE; y++) {
for (int x = 0; x < COL_SIZE; x++) {
pfTest[y][x] = distribution(generator);
}
}
float *dev_Test;
size_t pitch;
cudaMallocPitch(&dev_Test, &pitch, COL_SIZE * sizeof(float), ROW_SIZE);
cudaMemcpy2D(dev_Test, pitch, pfTest, COL_SIZE * sizeof(float), COL_SIZE * sizeof(float), ROW_SIZE, cudaMemcpyHostToDevice);
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
return 0;
}
As you can see, there's no problem at all.
But, when I tried to extend COL_SIZE to around 500,000 (exactly, 524288), it crashes with segmentation fault.
Any help as to the source of the problem?
cudaMemcpy2D can only be used for copying pitched linear memory. Your source array is not pitched linear memory, it is an array of pointers. This is not supported and is the source of the segfault.
Try something like this:
float* buffer;
float** pfTest;
const size_t buffer_pitch = size_t(COL_SIZE) * sizeof(float);
buffer = (float*)malloc(size_t(ROW_SIZE) * buffer_pitch);
pfTest = (float**)malloc(ROW_SIZE * sizeof(float*));
for (size_t i = 0; i < ROW_SIZE; i++) {
pfTest[i] = buffer + i * size_t(COL_SIZE);
}
// ...
cudaMallocPitch(&dev_Test, &pitch, buffer_pitch, ROW_SIZE);
cudaMemcpy2D(dev_Test, pitch, buffer, buffer_pitch,
buffer_pitch, ROW_SIZE, cudaMemcpyHostToDevice);
[Note: written in browser, never tested or compiled, use at own risk]
i.e. store the data to be copied in a single contiguous memory allocation which can act as a pitched linear source for cudaMemcpy2D. If you insist on using [][] style indexing on the host, then you have to pay the penalty of having an additional array of pointers to store alongside the data. Note that isn't actually necessary, and you could just directly index into buffer and achieve the same result, while saving memory at the same time.

Cuda passing char** to kernel

I am having a spot of bother with this basic CUDA code.
I have a char** which is a flat 2d array of passwords, my current implementation is for CUDA simply to iterate through this list and display the passwords. However, when I go to display them I simply get "(NULL)". I'm not quite sure why this is. Can someone explain what it happening?
Main:
char ** pwdAry;
pwdAry = new char *[numberOfPwd];
//pwdAry given some values (flat 2d array layout)
const int pwdArySize = sizeof(pwdAry);
dim3 grid(gridSize,gridSize);
dim3 block(blockSize,blockSize);
searchKeywordKernel << <grid, block >> >(pwdAry);
return EXIT_SUCCESS;
Cuda:
__global__ void searchKeywordKernel(char **passwordList)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int pitch = blockDim.x * gridDim.x;
int idx = x + y * pitch;
int tidy = idx / pitch;
int tidx = idx - (pitch * tidy);
int bidx = tidx / blockDim.x;
int bidy = tidy / blockDim.y;
int currentThread = threadIdx.x + blockDim.x * threadIdx.y;
printf("hi, i am thread: %i, and my block x: %i, and y: %i\n", currentThread, bidx, bidy);
printf("My password is: %s\n", passwordList[currentThread]);
}
Based on discussion in the comments, here is an example code that roughly follows the code in the question, using 3 different methods:
Use a "flattened" array. This is the traditional advice for beginners who are asking about how to handle a double pointer array (char **, or any other type), or any data structure that contains embedded pointers. The basic idea is to create a single pointer array of the same type (e.g. char *), and copy all the data to that array, end-to-end. In this case, since the array elements are of variable length, we also need to pass an array containing the starting indices of each string (in this case).
Use a direct double-pointer method. I consider this code difficult to write. It may also have performance implications. The canonical example is here, and a stepwise description of what is required algorithmically is here and/or here is a 3D (i.e. triple-pointer) worked example with method description (yuck!). This is fundamentally doing a deep-copy in CUDA, and I consider it somewhat more difficult than typical CUDA coding.
Use the managed memory subsystem, that is available in CUDA platforms that support it. Coding-wise, this is probably simpler than either of the above 2 approaches.
Here is a worked example of all 3 methods:
$ cat t1035.cu
#include <stdio.h>
#include <string.h>
#define nTPB 256
__global__ void kern_1D(char *data, unsigned *indices, unsigned num_strings){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < num_strings)
printf("Hello from thread %d, my string is %s\n", idx, data+indices[idx]);
}
__global__ void kern_2D(char **data, unsigned num_strings){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < num_strings)
printf("Hello from thread %d, my string is %s\n", idx, data[idx]);
}
int main(){
const int num_strings = 3;
const char s0[] = "s1\0";
const char s1[] = "s2\0";
const char s2[] = "s3\0";
int ds[num_strings];
ds[0] = sizeof(s0)/sizeof(char);
ds[1] = sizeof(s1)/sizeof(char);
ds[2] = sizeof(s2)/sizeof(char);
// pretend we have a dynamically allocated char** array
char **data;
data = (char **)malloc(num_strings*sizeof(char *));
data[0] = (char *)malloc(ds[0]*sizeof(char));
data[1] = (char *)malloc(ds[1]*sizeof(char));
data[2] = (char *)malloc(ds[2]*sizeof(char));
// initialize said array
strcpy(data[0], s0);
strcpy(data[1], s1);
strcpy(data[2], s2);
// method 1: "flattening"
char *fdata = (char *)malloc((ds[0]+ds[1]+ds[2])*sizeof(char));
unsigned *ind = (unsigned *)malloc(num_strings*sizeof(unsigned));
unsigned next = 0;
for (int i = 0; i < num_strings; i++){
strcpy(fdata+next, data[i]);
ind[i] = next;
next += ds[i];}
//copy to device
char *d_fdata;
unsigned *d_ind;
cudaMalloc(&d_fdata, next*sizeof(char));
cudaMalloc(&d_ind, num_strings*sizeof(unsigned));
cudaMemcpy(d_fdata, fdata, next*sizeof(char), cudaMemcpyHostToDevice);
cudaMemcpy(d_ind, ind, num_strings*sizeof(unsigned), cudaMemcpyHostToDevice);
printf("method 1:\n");
kern_1D<<<(num_strings+nTPB-1)/nTPB, nTPB>>>(d_fdata, d_ind, num_strings);
cudaDeviceSynchronize();
//method 2: "2D" (pointer-to-pointer) array
char **d_data;
cudaMalloc(&d_data, num_strings*sizeof(char *));
char **d_temp_data;
d_temp_data = (char **)malloc(num_strings*sizeof(char *));
for (int i = 0; i < num_strings; i++){
cudaMalloc(&(d_temp_data[i]), ds[i]*sizeof(char));
cudaMemcpy(d_temp_data[i], data[i], ds[i]*sizeof(char), cudaMemcpyHostToDevice);
cudaMemcpy(d_data+i, &(d_temp_data[i]), sizeof(char *), cudaMemcpyHostToDevice);}
printf("method 2:\n");
kern_2D<<<(num_strings+nTPB-1)/nTPB, nTPB>>>(d_data, num_strings);
cudaDeviceSynchronize();
// method 3: managed allocations
// start over with a managed char** array
char **m_data;
cudaMallocManaged(&m_data, num_strings*sizeof(char *));
cudaMallocManaged(&(m_data[0]), ds[0]*sizeof(char));
cudaMallocManaged(&(m_data[1]), ds[1]*sizeof(char));
cudaMallocManaged(&(m_data[2]), ds[2]*sizeof(char));
// initialize said array
strcpy(m_data[0], s0);
strcpy(m_data[1], s1);
strcpy(m_data[2], s2);
// call kernel directly on managed data
printf("method 3:\n");
kern_2D<<<(num_strings+nTPB-1)/nTPB, nTPB>>>(m_data, num_strings);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -arch=sm_35 -o t1035 t1035.cu
$ cuda-memcheck ./t1035
========= CUDA-MEMCHECK
method 1:
Hello from thread 0, my string is s1
Hello from thread 1, my string is s2
Hello from thread 2, my string is s3
method 2:
Hello from thread 0, my string is s1
Hello from thread 1, my string is s2
Hello from thread 2, my string is s3
method 3:
Hello from thread 0, my string is s1
Hello from thread 1, my string is s2
Hello from thread 2, my string is s3
========= ERROR SUMMARY: 0 errors
$
Notes:
I suggest running this code with cuda-memcheck if you are just testing it out for the first time. I have omitted proper cuda error checking for brevity of presentation, but I recommend it any time you are having trouble with a CUDA code. Proper execution of this code depends on having a managed memory subsystem available (read the doc links I have provided). If your platform does not support it, running this code as-is will probably result in a seg fault, because I have not included proper error checking.
Copying a double-pointer array from device to host, although not explicitly covered in this example, is essentially the reverse of the steps for each of the 3 methods. For method 1, a single cudaMemcpy call can do it. For method 2, it requires a for-loop that reverses the steps to copy to the device (including the use of the temp pointers). For method 3, nothing at all is required, other than proper adherence to managed memory coding practices, such as use of cudaDeviceSynchronize() after a kernel call, before attempting to access the device from host code again.
I don't wish to argue about whether or not methods 1 and 3 explicitly adhere to the letter of the question in terms of providing a method to pass a char ** array to a CUDA kernel. If your focus is that narrow, then please use method 2, or else disregard this answer entirely.
EDIT: Based on a question in the comments below, here is the above code modified with a different initialization sequence for the host-side strings (at line 42). There are now compilation warnings, but those warnings arise from the code specifically requested to be used by OP:
$ cat t1036.cu
#include <stdio.h>
#include <string.h>
#define nTPB 256
__global__ void kern_1D(char *data, unsigned *indices, unsigned num_strings){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < num_strings)
printf("Hello from thread %d, my string is %s\n", idx, data+indices[idx]);
}
__global__ void kern_2D(char **data, unsigned num_strings){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < num_strings)
printf("Hello from thread %d, my string is %s\n", idx, data[idx]);
}
int main(){
const int num_strings = 3;
#if 0
const char s0[] = "s1\0";
const char s1[] = "s2\0";
const char s2[] = "s3\0";
int ds[num_strings];
ds[0] = sizeof(s0)/sizeof(char);
ds[1] = sizeof(s1)/sizeof(char);
ds[2] = sizeof(s2)/sizeof(char);
// pretend we have a dynamically allocated char** array
char **data;
data = (char **)malloc(num_strings*sizeof(char *));
data[0] = (char *)malloc(ds[0]*sizeof(char));
data[1] = (char *)malloc(ds[1]*sizeof(char));
data[2] = (char *)malloc(ds[2]*sizeof(char));
// initialize said array
strcpy(data[0], s0);
strcpy(data[1], s1);
strcpy(data[2], s2);
#endif
char ** pwdAry; pwdAry = new char *[num_strings]; for (int a = 0; a < num_strings; a++) { pwdAry[a] = new char[1024]; } for (int a = 0; a < 3; a++) { pwdAry[a] = "hello\0"; }
// method 1: "flattening"
char *fdata = (char *)malloc((1024*num_strings)*sizeof(char));
unsigned *ind = (unsigned *)malloc(num_strings*sizeof(unsigned));
unsigned next = 0;
for (int i = 0; i < num_strings; i++){
memcpy(fdata+next, pwdAry[i], 1024);
ind[i] = next;
next += 1024;}
//copy to device
char *d_fdata;
unsigned *d_ind;
cudaMalloc(&d_fdata, next*sizeof(char));
cudaMalloc(&d_ind, num_strings*sizeof(unsigned));
cudaMemcpy(d_fdata, fdata, next*sizeof(char), cudaMemcpyHostToDevice);
cudaMemcpy(d_ind, ind, num_strings*sizeof(unsigned), cudaMemcpyHostToDevice);
printf("method 1:\n");
kern_1D<<<(num_strings+nTPB-1)/nTPB, nTPB>>>(d_fdata, d_ind, num_strings);
cudaDeviceSynchronize();
//method 2: "2D" (pointer-to-pointer) array
char **d_data;
cudaMalloc(&d_data, num_strings*sizeof(char *));
char **d_temp_data;
d_temp_data = (char **)malloc(num_strings*sizeof(char *));
for (int i = 0; i < num_strings; i++){
cudaMalloc(&(d_temp_data[i]), 1024*sizeof(char));
cudaMemcpy(d_temp_data[i], pwdAry[i], 1024*sizeof(char), cudaMemcpyHostToDevice);
cudaMemcpy(d_data+i, &(d_temp_data[i]), sizeof(char *), cudaMemcpyHostToDevice);}
printf("method 2:\n");
kern_2D<<<(num_strings+nTPB-1)/nTPB, nTPB>>>(d_data, num_strings);
cudaDeviceSynchronize();
// method 3: managed allocations
// start over with a managed char** array
char **m_data;
cudaMallocManaged(&m_data, num_strings*sizeof(char *));
cudaMallocManaged(&(m_data[0]), 1024*sizeof(char));
cudaMallocManaged(&(m_data[1]), 1024*sizeof(char));
cudaMallocManaged(&(m_data[2]), 1024*sizeof(char));
// initialize said array
for (int i = 0; i < num_strings; i++)
memcpy(m_data[i], pwdAry[i], 1024);
// call kernel directly on managed data
printf("method 3:\n");
kern_2D<<<(num_strings+nTPB-1)/nTPB, nTPB>>>(m_data, num_strings);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -arch=sm_35 -o t1036 t1036.cu
t1036.cu(42): warning: conversion from a string literal to "char *" is deprecated
t1036.cu(42): warning: conversion from a string literal to "char *" is deprecated
$ cuda-memcheck ./t1036
========= CUDA-MEMCHECK
method 1:
Hello from thread 0, my string is hello
Hello from thread 1, my string is hello
Hello from thread 2, my string is hello
method 2:
Hello from thread 0, my string is hello
Hello from thread 1, my string is hello
Hello from thread 2, my string is hello
method 3:
Hello from thread 0, my string is hello
Hello from thread 1, my string is hello
Hello from thread 2, my string is hello
========= ERROR SUMMARY: 0 errors
$

CUDA Pinned memory implementation error cannot set when device is active in this process

I want to implement the pinned memory feature of GPU in my code. For doing that I write my code like this:
bool addVectorGPU(float* M, float* N, float* P, int size)
{
// Error return value
cudaError_t status;
cudaSetDeviceFlags(cudaDeviceMapHost);
// Number of bytes in the matrix.
int bytes = size * sizeof(float);
// Pointers to the device arrays
float *Md, *Nd, *Pd;
// Allocate memory on the device to store each matrix
cudaHostAlloc((void**)&M, bytes, cudaHostAllocMapped);
cudaHostAlloc((void**)&N, bytes, cudaHostAllocMapped);
cudaHostAlloc((void**)&P, bytes, cudaHostAllocMapped);
// Copy the host input data to the device
cudaHostGetDevicePointer((void**)&Md, M, 0);
cudaHostGetDevicePointer((void**)&Nd, N, 0);
cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Specify the size of the grid and the size of the block
dim3 dimBlock(TILE_SIZE); // Matrix is contained in a block
dim3 dimGrid((int)ceil((float)size / (float)TILE_SIZE));
// Launch the kernel on a size-by-size block of threads
addVectorKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, size);
// Wait for completion
cudaThreadSynchronize();
cudaDeviceSynchronize();
// Check for errors
status = cudaGetLastError();
if (status != cudaSuccess) {
std::cout << "Kernel failed: " << cudaGetErrorString(status) <<
std::endl;
cudaFreeHost(M);
cudaFreeHost(N);
cudaFreeHost(P);
return false;
}
// Retrieve the result matrix
//cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Free device memory
cudaFreeHost(M);
cudaFreeHost(N);
cudaFreeHost(P);
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);
// Success
return true;
}
Now for evaluating performance on my device I call this function 1000 times and then compute the average time which it takes to run:
int main(){
// Timing data
float tcpuadd, tcpusub, tcpuscale, tgpuadd, tgpusub, tgpuscale, sum, delta, L2norm;
clock_t start, end;
bool success;
//Allocate the four vectors of SIZE floats
float* M = new float[SIZE];
float* N = new float[SIZE];
float* Pcpu = new float[SIZE];
float* Pgpu = new float[SIZE];
//Initialize M and N to random integers
for (int i = 0; i < SIZE; i ++){
M[i] = (float) rand()/(RAND_MAX);
N[i] = (float) rand()/(RAND_MAX);
}
printf("Operating on a vector of length %d\n", SIZE);
//Add two vectors and compute timing in CPU
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorCPU(M, N, Pcpu, SIZE);
}
end = clock();
tcpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf( "CPU Addition took %f ms\n", tcpuadd);
//Add two vectors and compute timing in GPU
success = addVectorGPU(M, N ,Pgpu , SIZE);
if(!success)
{
printf("Device Error!\n");
return 1;
}
//compute GPU timing
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorGPU(M, N, Pgpu, SIZE);
}
end = clock();
tgpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf("GPU Addition took %f ms\n", tgpuadd);
The problem is, for the first time this function works without any errors. But the second time when I call this function, I've got error:
cannot set when device is active in this process
So does anyone know what it is all about?
If you do a better job of cuda error checking by checking the return value of each runtime API call, you'll discover that this error is returned from the second time you call this:
cudaSetDeviceFlags(cudaDeviceMapHost);
Note that description of this runtime API call:
If the current device has been set and that device has already been initialized then this call will fail with the error cudaErrorSetOnActiveProcess.
The solution is to call the function only once, at the beginning of your application, not every time you call the addVectorGPU function. Take that call out of the addVectorGPU function, and put it in your main routine, prior to the first call of addVectorGPU.
Based on the question below, there are various other issues with the code:
I would suggest implementing proper cuda error checking on all kernel calls and all CUDA API calls, rather than once at the end of the routine.
The usage of cudaHostAlloc is incorrect. The intent of the program appears to be to pass host pointers to host-resident data to the GPU routine, and then add that data using a zero-copy technique. This is technically feasible (although it will be very slow), but the correct approach would involve the use of cudaHostRegister, not cudaHostAlloc. cudaHostAlloc creates a new allocation, so the existing data passed to the function would not be used or referenced that way.
Here's a worked example, based on what you have shown. Note that I personally would not benchmark things this way, but I am providing this to show that the process can work in an error-free way:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <iostream>
#define TILE_SIZE 512
#define SIZE 1048576
#define ITERS 10
bool addVectorCPU(float *M, float *N, float *P, int size){
for (int i=0; i< size; i++) P[i] = M[i]+N[i];
return true;
}
__global__ void addVectorKernel(float *M, float *N, float *P,int size){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < size)
P[idx] = M[idx]+N[idx];
}
bool addVectorGPU(float* M, float* N, float* P, int size)
{
// Error return value
cudaError_t status;
// Number of bytes in the matrix.
int bytes = size * sizeof(float);
// Pointers to the device arrays
float *Md, *Nd, *Pd;
// Allocate memory on the device to store each matrix
cudaHostRegister(M, bytes, cudaHostRegisterMapped);
cudaHostRegister(N, bytes, cudaHostRegisterMapped);
cudaHostRegister(P, bytes, cudaHostRegisterMapped);
// Copy the host input data to the device
cudaHostGetDevicePointer((void**)&Md, M, 0);
cudaHostGetDevicePointer((void**)&Nd, N, 0);
cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Specify the size of the grid and the size of the block
dim3 dimBlock(TILE_SIZE); // Matrix is contained in a block
dim3 dimGrid((int)ceil((float)size / (float)TILE_SIZE));
// Launch the kernel on a size-by-size block of threads
addVectorKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, size);
// Wait for completion
cudaDeviceSynchronize();
bool res = true;
// Check for errors
status = cudaGetLastError();
if (status != cudaSuccess) {
std::cout << "Kernel failed: " << cudaGetErrorString(status) << std::endl;
res = false;
}
// Retrieve the result matrix
//cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Free device memory
cudaHostUnregister(M);
cudaHostUnregister(N);
cudaHostUnregister(P);
// Success
return res;
}
int main(){
// Timing data
float tcpuadd, tgpuadd;
clock_t start, end;
bool success;
//Allocate the four vectors of SIZE floats
float* M = new float[SIZE];
float* N = new float[SIZE];
float* Pcpu = new float[SIZE];
float* Pgpu = new float[SIZE];
//Initialize M and N to random integers
for (int i = 0; i < SIZE; i ++){
M[i] = rand()/(float)(RAND_MAX);
N[i] = rand()/(float)(RAND_MAX);
}
printf("Operating on a vector of length %d\n", SIZE);
//Add two vectors and compute timing in CPU
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorCPU(M, N, Pcpu, SIZE);
}
end = clock();
tcpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf( "CPU Addition took %f ms\n", tcpuadd);
//Add two vectors and compute timing in GPU
cudaSetDeviceFlags(cudaDeviceMapHost);
success = addVectorGPU(M, N ,Pgpu , SIZE);
if(!success)
{
printf("Device Error!\n");
return 1;
}
//compute GPU timing
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorGPU(M, N, Pgpu, SIZE);
}
end = clock();
tgpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf("GPU Addition took %f ms\n", tgpuadd);
}
Note I've made a few other changes, as well. For example cudaThreadSynchronize() is deprecated, and it's not necessary to use both cudaThreadSynchronize() and cudaDeviceSynchronize(); they are redundant.

Copying array from RAM to GPU and from GPU to RAM

I'm trying to introduce some CUDA optimizations in one of my projects. But I think I'm doing something wrong here. I want to implement a simple matrix-vector multiplication (result = matrix * vector). But when I want to copy the result back to the host, errors will occur (cudaErrorLaunchFailure). Is there an error in my kernel (matrixVectorMultiplicationKernel) or did I call cudaMemcpy incorrectly? I found no helpful documentation for this kind of error state. I think this completely destroys the state of the GPU because I cannot call any CUDA kernel without getting this error again after the first occurrence.
edit#1: Updated code, following leftaroundabout's advice.
// code
...
Eigen::MatrixXf matrix(M, N); // matrix.data() usually should return a float array
Eigen::VectorXf vector(N); // same here for vector.data()
Eigen::VectorXf result(M);
... // fill matrix and vector
float* matrixOnDevice = copyMatrixToDevice(matrix.data(), matrix.rows(), matrix.cols());
matrixVectorMultiplication(matrixOnDevice, vector.data(), result.data(), matrix.rows(), cm.cols());
... // clean up
// helper functions
float* copyMatrixToDevice(const float* matrix, int mRows, int mCols)
{
float* matrixOnDevice;
const int length = mRows*mCols;
const int size = length * sizeof(float);
handleCUDAError(cudaMalloc((void**)&matrixOnDevice, size));
handleCUDAError(cudaMemcpy(matrixOnDevice, matrix, size, cudaMemcpyHostToDevice));
return matrixOnDevice;
}
void matrixVectorMultiplication(const float* matrixOnDevice, const float* vector, float* result, int mRows, int mCols)
{
const int vectorSize = mCols*sizeof(float);
const int resultSize = mRows*sizeof(float);
const int matrixLength = mRows*mCols;
float* deviceVector;
float* deviceResult;
handleCUDAError(cudaMalloc((void**)&deviceVector, vectorSize));
handleCUDAError(cudaMalloc((void**)&deviceResult, resultSize));
handleCUDAError(cudaMemset(deviceResult, 0, resultSize));
handleCUDAError(cudaMemcpy(deviceVector, vector, vectorSize, cudaMemcpyHostToDevice));
int threadsPerBlock = 256;
int blocksPerGrid = (mRows + threadsPerBlock - 1) / threadsPerBlock;
matrixVectorMultiplicationKernel<<<blocksPerGrid, threadsPerBlock>>>(matrixOnDevice, vector, result, mRows, mCols, matrixLength);
// --- no errors yet ---
handleCUDAError(cudaMemcpy(result, deviceResult, resultSize, cudaMemcpyDeviceToHost)); // cudaErrorLaunchFailure
handleCUDAError(cudaFree(deviceVector)); // cudaErrorLaunchFailure
handleCUDAError(cudaFree(deviceResult)); // cudaErrorLaunchFailure
}
__global__ void matrixVectorMultiplicationKernel(const float* matrix, const float* vector, float* result, int mRows, int mCols, int length)
{
int row = blockDim.x * blockIdx.x + threadIdx.x;
if(row < mRows)
{
for(int col = 0, mIdx = row*mCols; col < mCols; col++, mIdx++)
result[row] += matrix[mIdx] * vector[col];
}
}
Your problem is that void copyMatrixToDevice(..., float* matrixOnDevice, ...) takes this pointer by value, i.e. it can't "output" the device matrix. You can do it with void copyMatrixToDevice(..., float** matrixOnDevice, ...), called by
copyMatrixToDevice(matrix.data(), &matrixOnDevice, matrix.rows(), matrix.cols());
There is the same problem with result in matrixVectorMultiplication.
In the long term, in C++ you should put a proper class abstraction layer around all of this.

Matrix Multiplication with CUDA, long execution time

I'm new to CUDA, and I been trying to figure out what I'm doing wrong here. CUDA is taking longer than just using the CPU to multiply a matrix. If I'm doing something wrong please let me know.
Here is my code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include <cstdlib>
#include <assert.h>
#include <time.h>
#define size 100 // Matrix size
#define cols size // Matrix width
#define rows size // Matrix height
void checkCUDAError(const char *msg)
{
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
{
fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}
}
__global__ void matrixMul( int *A, int *B, int *C)
{
int bx = blockIdx.x; // Block index
int tx = threadIdx.x; // Thread index
int ts = blockDim.x; // number of threads
// Declaration of the shared memory C element
extern __shared__ int c_element_sum[];
c_element_sum[tx] = A[tx+((bx/ts)*ts)] * B[(bx%ts)+(tx*ts)];
//Block until all threads in the block have written their data to shared mem
__syncthreads();
int sum;
for(int i=0; i<ts; i++){
if(i==0){
sum=c_element_sum[i];
}
else{
sum+=c_element_sum[i];
}
}
C[bx] = sum;
}
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
int main(int argc, char** argv)
{
//create timer.
clock_t t1, t2;
//start timer
t1=clock();
//allocate host memory for matrices
unsigned int size_A = cols * rows;
unsigned int mem_size_A = sizeof(int) * size_A;
int* mA = (int*) malloc(mem_size_A);
unsigned int size_B = cols * rows;
unsigned int mem_size_B = sizeof(int) * size_B;
int* mB = (int*) malloc(mem_size_B);
unsigned int size_C = cols * rows;
unsigned int mem_size_C = sizeof(int) * size_C;
int* mC = (int*) malloc(mem_size_C);
//initialize host memory
for (int i = 0; i < size_A; ++i){
mA[i] = 1;
mB[i] = 1;
mC[i] = 0;
}
// allocate device memory
int* d_mA;
int* d_mB;
int* d_mC;
cudaMalloc((void**) &d_mA, mem_size_A);
cudaMalloc((void**) &d_mB, mem_size_B);
cudaMalloc((void**) &d_mC, mem_size_C);
//copy host memory to device (A and B)
cudaMemcpy(d_mA, mA, mem_size_A, cudaMemcpyHostToDevice);
cudaMemcpy(d_mB, mB, mem_size_B, cudaMemcpyHostToDevice);
cudaMemcpy(d_mC, mC, mem_size_C, cudaMemcpyHostToDevice);
// setup execution parameters
int numThreadsPerBlock = cols;
int numBlocks = (cols * rows);
int sharedMemSize = numThreadsPerBlock * sizeof(int);
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
// execute the kernel
matrixMul <<< dimGrid, dimBlock, sharedMemSize >>>(d_mA, d_mB, d_mC);
//Block until device has completed
cudaThreadSynchronize();
// check if kernel execution generated an error
// Check for any CUDA errors
checkCUDAError("kernel invocation");
//copy result from device to host
cudaMemcpy(mC, d_mC, mem_size_C, cudaMemcpyDeviceToHost);
// Check for any CUDA errors
checkCUDAError("memcpy");
//stop timer
t2 = clock();
//check results
for (int i = 0; i < size_C; ++i){
assert(mC[i] == cols);
}
//clean up memory
free(mA);
free(mB);
free(mC);
cudaFree(d_mA);
cudaFree(d_mB);
cudaFree(d_mC);
printf("WITH CUDA - clocks: %d \n\n", t2-t1);
//////////////////////////////
///////// CPU ONLY //////////
/////////////////////////////
//create timer.
clock_t cpu_t1, cpu_t2;
//start timer
cpu_t1=clock();
//allocate host memory for matrices
unsigned int cpu_size_A = cols * rows;
unsigned int cpu_mem_size_A = sizeof(int) * cpu_size_A;
int* cpu_mA = (int*) malloc(cpu_mem_size_A);
unsigned int cpu_size_B = cols * rows;
unsigned int cpu_mem_size_B = sizeof(int) * cpu_size_B;
int* cpu_mB = (int*) malloc(cpu_mem_size_B);
unsigned int cpu_size_C = cols * rows;
unsigned int cpu_mem_size_C = sizeof(int) * cpu_size_C;
int* cpu_mC = (int*) malloc(cpu_mem_size_C);
//initialize host memory
for (int i = 0; i < cpu_size_A; ++i){
cpu_mA[i] = 1;
cpu_mB[i] = 1;
cpu_mC[i] = 0;
}
int ts = cols;
for(int bx=0; bx<(cols*rows);bx++){
int sum = 0;
for(int tx=0; tx<cols; tx++){
sum += cpu_mA[tx+((bx/ts)*ts)] * cpu_mB[(bx%ts)+(tx*ts)];
}
cpu_mC[bx]=sum;
}
//stop timer
cpu_t2 = clock();
//check results
for (int i = 0; i < cpu_size_C; ++i){
assert(cpu_mC[i] == cols);
}
//clean up memory
free(cpu_mA);
free(cpu_mB);
free(cpu_mC);
printf("CPU ONLY - clocks: %d \n\n", cpu_t2-cpu_t1);
return 0;
}
Based on your program, this is expected. Your timer looks like it clocks the entire execution of the program, which would include copying to the device, computation time, and copying the results back. Given the rather small workload you've provided for the program (100x100 matrices), the overhead of the memory copies far outweighs any computational benefit you get when doing the computation with the kernel. Your kernel itself is also not the most efficient implementation.
I don't think you're doing anything wrong, it's just that you haven't provided a large enough chunk of work for the GPU and you could potentially further optimize your kernel. Note that simply scaling up the size of the chunk may not significantly improve the performance with respect to the CPU, since you would also be scaling up the memory management time. While it is relatively simple to write a first implementation of a program on CUDA, it is significantly more difficult to get good performance out of it. The most effective way to use CUDA is to have a high ratio of compute to memory transactions. For example, having a pipeline of several compute-intensive kernels to operate successively on a chunk of data, only needing host-device copying at the beginning and end.
If this is just a program to help you learn to code for CUDA, this is a great step and getting a deep understanding of how to optimize matrix multiplication kernels will serve you well in many other cases. If you are writing this kernel for use in a production piece of software, I would recommend you use the highly-optimized linear algebra library CUBLAS: http://developer.nvidia.com/cublas (or some other library where the hard work has been done for you already).