Matrix Multiplication with CUDA, long execution time - c++

I'm new to CUDA, and I been trying to figure out what I'm doing wrong here. CUDA is taking longer than just using the CPU to multiply a matrix. If I'm doing something wrong please let me know.
Here is my code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include <cstdlib>
#include <assert.h>
#include <time.h>
#define size 100 // Matrix size
#define cols size // Matrix width
#define rows size // Matrix height
void checkCUDAError(const char *msg)
{
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
{
fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}
}
__global__ void matrixMul( int *A, int *B, int *C)
{
int bx = blockIdx.x; // Block index
int tx = threadIdx.x; // Thread index
int ts = blockDim.x; // number of threads
// Declaration of the shared memory C element
extern __shared__ int c_element_sum[];
c_element_sum[tx] = A[tx+((bx/ts)*ts)] * B[(bx%ts)+(tx*ts)];
//Block until all threads in the block have written their data to shared mem
__syncthreads();
int sum;
for(int i=0; i<ts; i++){
if(i==0){
sum=c_element_sum[i];
}
else{
sum+=c_element_sum[i];
}
}
C[bx] = sum;
}
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
int main(int argc, char** argv)
{
//create timer.
clock_t t1, t2;
//start timer
t1=clock();
//allocate host memory for matrices
unsigned int size_A = cols * rows;
unsigned int mem_size_A = sizeof(int) * size_A;
int* mA = (int*) malloc(mem_size_A);
unsigned int size_B = cols * rows;
unsigned int mem_size_B = sizeof(int) * size_B;
int* mB = (int*) malloc(mem_size_B);
unsigned int size_C = cols * rows;
unsigned int mem_size_C = sizeof(int) * size_C;
int* mC = (int*) malloc(mem_size_C);
//initialize host memory
for (int i = 0; i < size_A; ++i){
mA[i] = 1;
mB[i] = 1;
mC[i] = 0;
}
// allocate device memory
int* d_mA;
int* d_mB;
int* d_mC;
cudaMalloc((void**) &d_mA, mem_size_A);
cudaMalloc((void**) &d_mB, mem_size_B);
cudaMalloc((void**) &d_mC, mem_size_C);
//copy host memory to device (A and B)
cudaMemcpy(d_mA, mA, mem_size_A, cudaMemcpyHostToDevice);
cudaMemcpy(d_mB, mB, mem_size_B, cudaMemcpyHostToDevice);
cudaMemcpy(d_mC, mC, mem_size_C, cudaMemcpyHostToDevice);
// setup execution parameters
int numThreadsPerBlock = cols;
int numBlocks = (cols * rows);
int sharedMemSize = numThreadsPerBlock * sizeof(int);
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
// execute the kernel
matrixMul <<< dimGrid, dimBlock, sharedMemSize >>>(d_mA, d_mB, d_mC);
//Block until device has completed
cudaThreadSynchronize();
// check if kernel execution generated an error
// Check for any CUDA errors
checkCUDAError("kernel invocation");
//copy result from device to host
cudaMemcpy(mC, d_mC, mem_size_C, cudaMemcpyDeviceToHost);
// Check for any CUDA errors
checkCUDAError("memcpy");
//stop timer
t2 = clock();
//check results
for (int i = 0; i < size_C; ++i){
assert(mC[i] == cols);
}
//clean up memory
free(mA);
free(mB);
free(mC);
cudaFree(d_mA);
cudaFree(d_mB);
cudaFree(d_mC);
printf("WITH CUDA - clocks: %d \n\n", t2-t1);
//////////////////////////////
///////// CPU ONLY //////////
/////////////////////////////
//create timer.
clock_t cpu_t1, cpu_t2;
//start timer
cpu_t1=clock();
//allocate host memory for matrices
unsigned int cpu_size_A = cols * rows;
unsigned int cpu_mem_size_A = sizeof(int) * cpu_size_A;
int* cpu_mA = (int*) malloc(cpu_mem_size_A);
unsigned int cpu_size_B = cols * rows;
unsigned int cpu_mem_size_B = sizeof(int) * cpu_size_B;
int* cpu_mB = (int*) malloc(cpu_mem_size_B);
unsigned int cpu_size_C = cols * rows;
unsigned int cpu_mem_size_C = sizeof(int) * cpu_size_C;
int* cpu_mC = (int*) malloc(cpu_mem_size_C);
//initialize host memory
for (int i = 0; i < cpu_size_A; ++i){
cpu_mA[i] = 1;
cpu_mB[i] = 1;
cpu_mC[i] = 0;
}
int ts = cols;
for(int bx=0; bx<(cols*rows);bx++){
int sum = 0;
for(int tx=0; tx<cols; tx++){
sum += cpu_mA[tx+((bx/ts)*ts)] * cpu_mB[(bx%ts)+(tx*ts)];
}
cpu_mC[bx]=sum;
}
//stop timer
cpu_t2 = clock();
//check results
for (int i = 0; i < cpu_size_C; ++i){
assert(cpu_mC[i] == cols);
}
//clean up memory
free(cpu_mA);
free(cpu_mB);
free(cpu_mC);
printf("CPU ONLY - clocks: %d \n\n", cpu_t2-cpu_t1);
return 0;
}

Based on your program, this is expected. Your timer looks like it clocks the entire execution of the program, which would include copying to the device, computation time, and copying the results back. Given the rather small workload you've provided for the program (100x100 matrices), the overhead of the memory copies far outweighs any computational benefit you get when doing the computation with the kernel. Your kernel itself is also not the most efficient implementation.
I don't think you're doing anything wrong, it's just that you haven't provided a large enough chunk of work for the GPU and you could potentially further optimize your kernel. Note that simply scaling up the size of the chunk may not significantly improve the performance with respect to the CPU, since you would also be scaling up the memory management time. While it is relatively simple to write a first implementation of a program on CUDA, it is significantly more difficult to get good performance out of it. The most effective way to use CUDA is to have a high ratio of compute to memory transactions. For example, having a pipeline of several compute-intensive kernels to operate successively on a chunk of data, only needing host-device copying at the beginning and end.
If this is just a program to help you learn to code for CUDA, this is a great step and getting a deep understanding of how to optimize matrix multiplication kernels will serve you well in many other cases. If you are writing this kernel for use in a production piece of software, I would recommend you use the highly-optimized linear algebra library CUBLAS: http://developer.nvidia.com/cublas (or some other library where the hard work has been done for you already).

Related

CUDA C++ - how to define an array of unknown size in cuda kernel (without __shared__ )? [duplicate]

I need to dynamically allocate some arrays inside the kernel function. How can a I do that?
My code is something like that:
__global__ func(float *grid_d,int n, int nn){
int i,j;
float x[n],y[nn];
//Do some really cool and heavy computations here that takes hours.
}
But that will not work. If this was inside the host code I could use malloc. cudaMalloc needs a pointer on host, and other on device. Inside the kernel function I don't have the host pointer.
So, what should I do?
If takes too long (some seconds) to allocate all the arrays (I need about 4 of size n and 5 of size nn), this won't be a problem. Since the kernel will probably run for 20 minutes, at least.
Dynamic memory allocation is only supported on compute capability 2.x and newer hardware. You can use either the C++ new keyword or malloc in the kernel, so your example could become:
__global__ func(float *grid_d,int n, int nn){
int i,j;
float *x = new float[n], *y = new float[nn];
}
This allocates memory on a local memory runtime heap which has the lifetime of the context, so make sure you free the memory after the kernel finishes running if your intention is not to use the memory again. You should also note that runtime heap memory cannot be accessed directly from the host APIs, so you cannot pass a pointer allocated inside a kernel as an argument to cudaMemcpy, for example.
#talonmies answered your question on how to dynamically allocate memory within a kernel. This is intended as a supplemental answer, addressing performance of __device__ malloc() and an alternative you might want to consider.
Allocating memory dynamically in the kernel can be tempting because it allows GPU code to look more like CPU code. But it can seriously affect performance. I wrote a self contained test and have included it below. The test launches some 2.6 million threads. Each thread populates 16 integers of global memory with some values derived from the thread index, then sums up the values and returns the sum.
The test implements two approaches. The first approach uses __device__ malloc() and the second approach uses memory that is allocated before the kernel runs.
On my 2.0 device, the kernel runs in 1500ms when using __device__ malloc() and 27ms when using pre-allocated memory. In other words, the test takes 56x longer to run when memory is allocated dynamically within the kernel. The time includes the outer loop cudaMalloc() / cudaFree(), which is not part of the kernel. If the same kernel is launched many times with the same number of threads, as is often the case, the cost of the cudaMalloc() / cudaFree() is amortized over all the kernel launches. That brings the difference even higher, to around 60x.
Speculating, I think that the performance hit is in part caused by implicit serialization. The GPU must probably serialize all simultaneous calls to __device__ malloc() in order to provide separate chunks of memory to each caller.
The version that does not use __device__ malloc() allocates all the GPU memory before running the kernel. A pointer to the memory is passed to the kernel. Each thread calculates an index into the previously allocated memory instead of using a __device__ malloc().
The potential issue with allocating memory up front is that, if only some threads need to allocate memory, and it is not known which threads those are, it will be necessary to allocate memory for all the threads. If there is not enough memory for that, it might be more efficient to reduce the number of threads per kernel call then using __device__ malloc(). Other workarounds would probably end up reimplementing what __device__ malloc() is doing in the background, and would see a similar performance hit.
Test the performance of __device__ malloc():
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
const int N_ITEMS(16);
#define USE_DYNAMIC_MALLOC
__global__ void test_malloc(int* totals)
{
int tx(blockIdx.x * blockDim.x + threadIdx.x);
int* s(new int[N_ITEMS]);
for (int i(0); i < N_ITEMS; ++i) {
s[i] = tx * i;
}
int total(0);
for (int i(0); i < N_ITEMS; ++i) {
total += s[i];
}
totals[tx] = total;
delete[] s;
}
__global__ void test_malloc_2(int* items, int* totals)
{
int tx(blockIdx.x * blockDim.x + threadIdx.x);
int* s(items + tx * N_ITEMS);
for (int i(0); i < N_ITEMS; ++i) {
s[i] = tx * i;
}
int total(0);
for (int i(0); i < N_ITEMS; ++i) {
total += s[i];
}
totals[tx] = total;
}
int main()
{
cudaError_t cuda_status;
cudaSetDevice(0);
int blocks_per_launch(1024 * 10);
int threads_per_block(256);
int threads_per_launch(blocks_per_launch * threads_per_block);
int* totals_d;
cudaMalloc((void**)&totals_d, threads_per_launch * sizeof(int));
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaDeviceSynchronize();
cudaEventRecord(start, 0);
#ifdef USE_DYNAMIC_MALLOC
cudaDeviceSetLimit(cudaLimitMallocHeapSize, threads_per_launch * N_ITEMS * sizeof(int));
test_malloc<<<blocks_per_launch, threads_per_block>>>(totals_d);
#else
int* items_d;
cudaMalloc((void**)&items_d, threads_per_launch * sizeof(int) * N_ITEMS);
test_malloc_2<<<blocks_per_launch, threads_per_block>>>(items_d, totals_d);
cudaFree(items_d);
#endif
cuda_status = cudaDeviceSynchronize();
if (cuda_status != cudaSuccess) {
printf("Error: %d\n", cuda_status);
exit(1);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
printf("Elapsed: %f\n", elapsedTime);
int* totals_h(new int[threads_per_launch]);
cuda_status = cudaMemcpy(totals_h, totals_d, threads_per_launch * sizeof(int), cudaMemcpyDeviceToHost);
if (cuda_status != cudaSuccess) {
printf("Error: %d\n", cuda_status);
exit(1);
}
for (int i(0); i < 10; ++i) {
printf("%d ", totals_h[i]);
}
printf("\n");
cudaFree(totals_d);
delete[] totals_h;
return cuda_status;
}
Output:
C:\rd\projects\test_cuda_malloc\Release>test_cuda_malloc.exe
Elapsed: 27.311169
0 120 240 360 480 600 720 840 960 1080
C:\rd\projects\test_cuda_malloc\Release>test_cuda_malloc.exe
Elapsed: 1516.711914
0 120 240 360 480 600 720 840 960 1080
If the value of n and nn were known before the kernel is called, then why not cudaMalloc the memory on host side and pass in the device memory pointer to the kernel?
Ran an experiment based on the concepts in #rogerdahl's post. Assumptions:
4MB of memory allocated in 64B chunks.
1 GPU block and 32 warp threads in that block
Run on a P100
The malloc+free calls local to the GPU seemed to be much faster than the cudaMalloc + cudaFree calls. The program's output:
Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
timer for cuda malloc timer took 1.169631s
Starting timer for device malloc timer
Stopping timer for device malloc timer
timer for device malloc timer took 0.029794s
I'm leaving out the code for timer.h and timer.cpp, but here's the code for the test itself:
#include "cuda_runtime.h"
#include <stdio.h>
#include <thrust/system/cuda/error.h>
#include "timer.h"
static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
const int BLOCK_COUNT = 1;
const int THREADS_PER_BLOCK = 32;
const int ITERATIONS = 1 << 12;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);
const int ARRAY_SIZE = 64;
void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err) {
if (err == cudaSuccess)
return;
std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
exit (1);
}
__global__ void mallocai() {
for (int i = 0; i < ITERATIONS_PER_BLOCKTHREAD; ++i) {
int * foo;
foo = (int *) malloc(sizeof(int) * ARRAY_SIZE);
free(foo);
}
}
int main() {
Timer cuda_malloc_timer("cuda malloc timer");
for (int i = 0; i < ITERATIONS; ++ i) {
if (i == 1) cuda_malloc_timer.start(); // let it warm up one cycle
int * foo;
cudaMalloc(&foo, sizeof(int) * ARRAY_SIZE);
cudaFree(foo);
}
cuda_malloc_timer.stop_and_report();
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
Timer device_malloc_timer("device malloc timer");
device_malloc_timer.start();
mallocai<<<BLOCK_COUNT, THREADS_PER_BLOCK>>>();
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
device_malloc_timer.stop_and_report();
}
If you find mistakes, please lmk in the comments, and I'll try to fix them.
And I ran them again with larger everything:
const int BLOCK_COUNT = 56;
const int THREADS_PER_BLOCK = 1024;
const int ITERATIONS = 1 << 18;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);
const int ARRAY_SIZE = 1024;
And cudaMalloc was still slower by a lot:
Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
timer for cuda malloc timer took 74.878016s
Starting timer for device malloc timer
Stopping timer for device malloc timer
timer for device malloc timer took 0.167331s
Maybe you should test
cudaMalloc(&foo,sizeof(int) * ARRAY_SIZE * ITERATIONS);
cudaFree(foo);
instead
for (int i = 0; i < ITERATIONS; ++ i) {
if (i == 1) cuda_malloc_timer.start(); // let it warm up one cycle
int * foo;
cudaMalloc(&foo, sizeof(int) * ARRAY_SIZE);
cudaFree(foo);
}

cudaMemcpy2D error with large array

I tried to use cudaMallocPitch and cudaMemcpy2D, but when I tried to use cudaMemcpy2D with large array, I encountered a problem:
Segmentation fault
Here is the runnable source code, with no error.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
#include <random>
#define ROW_SIZE 32
#define COL_SIZE 1024
int main()
{
float ** pfTest;
pfTest = (float**)malloc(ROW_SIZE * sizeof(float*));
for (int i = 0; i < ROW_SIZE; i++) {
pfTest[i] = (float*)malloc(COL_SIZE * sizeof(float));
}
std::default_random_engine generator;
std::uniform_real_distribution<float> distribution;
for (int y = 0; y < ROW_SIZE; y++) {
for (int x = 0; x < COL_SIZE; x++) {
pfTest[y][x] = distribution(generator);
}
}
float *dev_Test;
size_t pitch;
cudaMallocPitch(&dev_Test, &pitch, COL_SIZE * sizeof(float), ROW_SIZE);
cudaMemcpy2D(dev_Test, pitch, pfTest, COL_SIZE * sizeof(float), COL_SIZE * sizeof(float), ROW_SIZE, cudaMemcpyHostToDevice);
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
return 0;
}
As you can see, there's no problem at all.
But, when I tried to extend COL_SIZE to around 500,000 (exactly, 524288), it crashes with segmentation fault.
Any help as to the source of the problem?
cudaMemcpy2D can only be used for copying pitched linear memory. Your source array is not pitched linear memory, it is an array of pointers. This is not supported and is the source of the segfault.
Try something like this:
float* buffer;
float** pfTest;
const size_t buffer_pitch = size_t(COL_SIZE) * sizeof(float);
buffer = (float*)malloc(size_t(ROW_SIZE) * buffer_pitch);
pfTest = (float**)malloc(ROW_SIZE * sizeof(float*));
for (size_t i = 0; i < ROW_SIZE; i++) {
pfTest[i] = buffer + i * size_t(COL_SIZE);
}
// ...
cudaMallocPitch(&dev_Test, &pitch, buffer_pitch, ROW_SIZE);
cudaMemcpy2D(dev_Test, pitch, buffer, buffer_pitch,
buffer_pitch, ROW_SIZE, cudaMemcpyHostToDevice);
[Note: written in browser, never tested or compiled, use at own risk]
i.e. store the data to be copied in a single contiguous memory allocation which can act as a pitched linear source for cudaMemcpy2D. If you insist on using [][] style indexing on the host, then you have to pay the penalty of having an additional array of pointers to store alongside the data. Note that isn't actually necessary, and you could just directly index into buffer and achieve the same result, while saving memory at the same time.

CUDA Pinned memory implementation error cannot set when device is active in this process

I want to implement the pinned memory feature of GPU in my code. For doing that I write my code like this:
bool addVectorGPU(float* M, float* N, float* P, int size)
{
// Error return value
cudaError_t status;
cudaSetDeviceFlags(cudaDeviceMapHost);
// Number of bytes in the matrix.
int bytes = size * sizeof(float);
// Pointers to the device arrays
float *Md, *Nd, *Pd;
// Allocate memory on the device to store each matrix
cudaHostAlloc((void**)&M, bytes, cudaHostAllocMapped);
cudaHostAlloc((void**)&N, bytes, cudaHostAllocMapped);
cudaHostAlloc((void**)&P, bytes, cudaHostAllocMapped);
// Copy the host input data to the device
cudaHostGetDevicePointer((void**)&Md, M, 0);
cudaHostGetDevicePointer((void**)&Nd, N, 0);
cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Specify the size of the grid and the size of the block
dim3 dimBlock(TILE_SIZE); // Matrix is contained in a block
dim3 dimGrid((int)ceil((float)size / (float)TILE_SIZE));
// Launch the kernel on a size-by-size block of threads
addVectorKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, size);
// Wait for completion
cudaThreadSynchronize();
cudaDeviceSynchronize();
// Check for errors
status = cudaGetLastError();
if (status != cudaSuccess) {
std::cout << "Kernel failed: " << cudaGetErrorString(status) <<
std::endl;
cudaFreeHost(M);
cudaFreeHost(N);
cudaFreeHost(P);
return false;
}
// Retrieve the result matrix
//cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Free device memory
cudaFreeHost(M);
cudaFreeHost(N);
cudaFreeHost(P);
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);
// Success
return true;
}
Now for evaluating performance on my device I call this function 1000 times and then compute the average time which it takes to run:
int main(){
// Timing data
float tcpuadd, tcpusub, tcpuscale, tgpuadd, tgpusub, tgpuscale, sum, delta, L2norm;
clock_t start, end;
bool success;
//Allocate the four vectors of SIZE floats
float* M = new float[SIZE];
float* N = new float[SIZE];
float* Pcpu = new float[SIZE];
float* Pgpu = new float[SIZE];
//Initialize M and N to random integers
for (int i = 0; i < SIZE; i ++){
M[i] = (float) rand()/(RAND_MAX);
N[i] = (float) rand()/(RAND_MAX);
}
printf("Operating on a vector of length %d\n", SIZE);
//Add two vectors and compute timing in CPU
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorCPU(M, N, Pcpu, SIZE);
}
end = clock();
tcpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf( "CPU Addition took %f ms\n", tcpuadd);
//Add two vectors and compute timing in GPU
success = addVectorGPU(M, N ,Pgpu , SIZE);
if(!success)
{
printf("Device Error!\n");
return 1;
}
//compute GPU timing
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorGPU(M, N, Pgpu, SIZE);
}
end = clock();
tgpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf("GPU Addition took %f ms\n", tgpuadd);
The problem is, for the first time this function works without any errors. But the second time when I call this function, I've got error:
cannot set when device is active in this process
So does anyone know what it is all about?
If you do a better job of cuda error checking by checking the return value of each runtime API call, you'll discover that this error is returned from the second time you call this:
cudaSetDeviceFlags(cudaDeviceMapHost);
Note that description of this runtime API call:
If the current device has been set and that device has already been initialized then this call will fail with the error cudaErrorSetOnActiveProcess.
The solution is to call the function only once, at the beginning of your application, not every time you call the addVectorGPU function. Take that call out of the addVectorGPU function, and put it in your main routine, prior to the first call of addVectorGPU.
Based on the question below, there are various other issues with the code:
I would suggest implementing proper cuda error checking on all kernel calls and all CUDA API calls, rather than once at the end of the routine.
The usage of cudaHostAlloc is incorrect. The intent of the program appears to be to pass host pointers to host-resident data to the GPU routine, and then add that data using a zero-copy technique. This is technically feasible (although it will be very slow), but the correct approach would involve the use of cudaHostRegister, not cudaHostAlloc. cudaHostAlloc creates a new allocation, so the existing data passed to the function would not be used or referenced that way.
Here's a worked example, based on what you have shown. Note that I personally would not benchmark things this way, but I am providing this to show that the process can work in an error-free way:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <iostream>
#define TILE_SIZE 512
#define SIZE 1048576
#define ITERS 10
bool addVectorCPU(float *M, float *N, float *P, int size){
for (int i=0; i< size; i++) P[i] = M[i]+N[i];
return true;
}
__global__ void addVectorKernel(float *M, float *N, float *P,int size){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < size)
P[idx] = M[idx]+N[idx];
}
bool addVectorGPU(float* M, float* N, float* P, int size)
{
// Error return value
cudaError_t status;
// Number of bytes in the matrix.
int bytes = size * sizeof(float);
// Pointers to the device arrays
float *Md, *Nd, *Pd;
// Allocate memory on the device to store each matrix
cudaHostRegister(M, bytes, cudaHostRegisterMapped);
cudaHostRegister(N, bytes, cudaHostRegisterMapped);
cudaHostRegister(P, bytes, cudaHostRegisterMapped);
// Copy the host input data to the device
cudaHostGetDevicePointer((void**)&Md, M, 0);
cudaHostGetDevicePointer((void**)&Nd, N, 0);
cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Specify the size of the grid and the size of the block
dim3 dimBlock(TILE_SIZE); // Matrix is contained in a block
dim3 dimGrid((int)ceil((float)size / (float)TILE_SIZE));
// Launch the kernel on a size-by-size block of threads
addVectorKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, size);
// Wait for completion
cudaDeviceSynchronize();
bool res = true;
// Check for errors
status = cudaGetLastError();
if (status != cudaSuccess) {
std::cout << "Kernel failed: " << cudaGetErrorString(status) << std::endl;
res = false;
}
// Retrieve the result matrix
//cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Free device memory
cudaHostUnregister(M);
cudaHostUnregister(N);
cudaHostUnregister(P);
// Success
return res;
}
int main(){
// Timing data
float tcpuadd, tgpuadd;
clock_t start, end;
bool success;
//Allocate the four vectors of SIZE floats
float* M = new float[SIZE];
float* N = new float[SIZE];
float* Pcpu = new float[SIZE];
float* Pgpu = new float[SIZE];
//Initialize M and N to random integers
for (int i = 0; i < SIZE; i ++){
M[i] = rand()/(float)(RAND_MAX);
N[i] = rand()/(float)(RAND_MAX);
}
printf("Operating on a vector of length %d\n", SIZE);
//Add two vectors and compute timing in CPU
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorCPU(M, N, Pcpu, SIZE);
}
end = clock();
tcpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf( "CPU Addition took %f ms\n", tcpuadd);
//Add two vectors and compute timing in GPU
cudaSetDeviceFlags(cudaDeviceMapHost);
success = addVectorGPU(M, N ,Pgpu , SIZE);
if(!success)
{
printf("Device Error!\n");
return 1;
}
//compute GPU timing
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorGPU(M, N, Pgpu, SIZE);
}
end = clock();
tgpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf("GPU Addition took %f ms\n", tgpuadd);
}
Note I've made a few other changes, as well. For example cudaThreadSynchronize() is deprecated, and it's not necessary to use both cudaThreadSynchronize() and cudaDeviceSynchronize(); they are redundant.

CUDA: Group every n-th point of array passed to GPU

I am trying to implement k-means algorithm on CUDA using Tesla card on external Unix. I read input file and store coordinates of all data points in dataX and dataY arrays. The next step is to select every centreInterval-th point and store it in another array allocated in GPU memory. However, I have no idea how may I even check what's the problem if all I can get is 'Segmentation error' and from obvious reasons can't print any kind of output from kernel.
EDIT 2: I simplified this example to the shortest possible solution. I found my solution during process, but decided to provide the version, which was not solved yet in this question to make more clear what caused the problem.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <strings.h>
#include <math.h>
#include <time.h>
#include <unistd.h>
#define BLOCK_SIZE 16
// My kernel - Selects some centres at the beginning of algorithm and stores it at appropriate place
__global__ void kMeansSelectInitialCentres(float* d_dataX, float* d_dataY, float* d_centresX, float* d_centresY, int centreInterval) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int idx = i * centreInterval;
d_centresX[i] = d_dataX[idx];
d_centresY[i] = d_dataY[idx];
}
// Simplified example
int main(int argn, char ** argc) {
// My data - let's say it is 32 floats in each
int dataSize = 32;
float* dataX = new float[dataSize];
float* dataY = new float[dataSize];
// Fill arrays with numbers
for (int i = 0; i < dataSize; i++) {
dataX[i] = i;
dataY[i] = i;
}
// Interval - we select first number, then 1 + N * centreInterval
int centreInterval = 2;
// There I will store my results in program
int centreSize = dataSize / centreInterval;
float* centresX = new float[centreSize];
float* centresY = new float[centreSize];
// Pointers to the arrays stored in GPU memory
float* d_dataX;
float* d_dataY;
float* d_centresX;
float* d_centresY;
// Allocate memory for those arrays
// Calculate how much space in memory do we need for this
size_t d_centreSize = sizeof(float) * centreSize;
size_t d_dataSize = sizeof(float) * dataSize;
// Memory for raw data
cudaMalloc((void**)&d_dataX, d_dataSize);
cudaMalloc((void**)&d_dataY, d_dataSize);
// Copy raw data to the device memory so we can operate on it freely
cudaMemcpy(d_dataY, dataY, d_dataSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_dataX, dataX, d_dataSize, cudaMemcpyHostToDevice);
// Memory for centre results
cudaMalloc((void**)&d_centresX, d_dataSize);
cudaMalloc((void**)&d_centresY, d_dataSize);
// Call kernel
dim3 dimBlock(BLOCK_SIZE);
dim3 dimGridK((centreSize + dimBlock.x) / dimBlock.x);
kMeansSelectInitialCentres <<<dimGridK, dimBlock>>> (d_dataX, d_dataY, d_centresX, d_centresY, centreInterval);
// Check results - we get every n-th point
float* check_x = new float[centreSize];
float* check_y = new float[centreSize];
cudaMemcpy(check_x, d_centresX, d_dataSize, cudaMemcpyDeviceToHost);
cudaMemcpy(check_y, d_centresY, d_dataSize, cudaMemcpyDeviceToHost);
printf("X: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_x[i]);
printf("\nY: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_y[i]);
printf("\n");
}
Main question: What is wrong with this kernel / check-out of data?
Side question: Is there any fair way to debug program kernels in such situations?
So, here's the solution I came up with after simplifying my case. There was a problem with memory usage - I tried to store / read different amount of data than I claimed to use when allocating it. I hope it will be helpful for anyone in the future:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <strings.h>
#include <math.h>
#include <time.h>
#include <unistd.h>
#define BLOCK_SIZE 16
// My kernel - Selects some centres at the beginning of algorithm and stores it at appropriate place
__global__ void kMeansSelectInitialCentres(float* d_dataX, float* d_dataY, float* d_centresX, float* d_centresY, int centreInterval) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int idx = i * centreInterval;
d_centresX[i] = d_dataX[idx];
d_centresY[i] = d_dataY[idx];
}
// Simplified example
int main(int argn, char ** argc) {
// My data - let's say it is 32 floats in each
int dataSize = 32;
float* dataX = new float[dataSize];
float* dataY = new float[dataSize];
// Fill arrays with numbers
for (int i = 0; i < dataSize; i++) {
dataX[i] = i;
dataY[i] = i;
}
// Interval - we select first number, then 1 + N * centreInterval
int centreInterval = 2;
// There I will store my results in program
int centreSize = dataSize / centreInterval;
float* centresX = new float[centreSize];
float* centresY = new float[centreSize];
// Pointers to the arrays stored in GPU memory
float* d_dataX;
float* d_dataY;
float* d_centresX;
float* d_centresY;
// Allocate memory for those arrays
// Calculate how much space in memory do we need for this
size_t d_centreSize = sizeof(float) * centreSize;
size_t d_dataSize = sizeof(float) * dataSize;
// Memory for raw data
cudaMalloc((void**)&d_dataX, d_dataSize);
cudaMalloc((void**)&d_dataY, d_dataSize);
// Copy raw data to the device memory so we can operate on it freely
cudaMemcpy(d_dataY, dataY, d_dataSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_dataX, dataX, d_dataSize, cudaMemcpyHostToDevice);
// Memory for centre results
cudaMalloc((void**)&d_centresX, d_centreSize);
cudaMalloc((void**)&d_centresY, d_centreSize);
// Call kernel
dim3 dimBlock(BLOCK_SIZE);
dim3 dimGridK((centreSize + dimBlock.x) / dimBlock.x);
kMeansSelectInitialCentres <<<dimGridK, dimBlock>>> (d_dataX, d_dataY, d_centresX, d_centresY, centreInterval);
// Check results - we get every n-th point
float* check_x = new float[centreSize];
float* check_y = new float[centreSize];
cudaMemcpy(check_x, d_centresX, d_centreSize, cudaMemcpyDeviceToHost);
cudaMemcpy(check_y, d_centresY, d_centreSize, cudaMemcpyDeviceToHost);
printf("X: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_x[i]);
printf("\nY: ");
for (int i = 0; i < centreSize; i++)
printf("%.2f ", check_y[i]);
printf("\n");
}

Copying one array into another using OpenCL on CPU is much slower than C++ code

I compared the performance of an OpenCL code, running on the CPU, which simply copies data from one 2D array into another to a pure C++ code which does the same thing. I used a single workgroup in the OpenCL code to make a fair comparison. I used Intel's OpenCL drivers and the Intel compiler. The OpenCL code is about 5 times slower than the CPU code. The compiler gives the following message for the copy loop:
loop was transformed to memset or memcpy.
Any suggestions on how to get the OpenCL code upto speed with the C++ code?
Thanks
OpenCL host code:
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <cmath>
#include <ctime>
#include <CL/cl.hpp>
int main(int argc, char **argv)
{
// Create the two input vectors
const int N = 8192;
double *in = new double[N*N];
double *out = new double[N*N];
for(int i = 0; i < N; i++)
for (int j=0; j < N; j++) {
in[i*N + j] = i + j;
out[i*N + j] = 0.;
}
double time;
std::clock_t start;
int niter = 100;
cl_int cl_err;
std::vector<cl::Platform> platforms;
cl_err = cl::Platform::get(&platforms);
std::vector<cl::Device> devices;
cl_err = platforms.at(1).getDevices(CL_DEVICE_TYPE_CPU,
&devices);
cl_context_properties context_properties[3] = {CL_CONTEXT_PLATFORM,
(cl_context_properties)(platforms.at(1)()),
0};
cl::Context context = cl::Context(devices,
context_properties,
NULL, NULL, &cl_err);
cl::Buffer buffer_in = cl::Buffer(context,
CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY,
N*N*sizeof(double),
in, &cl_err);
cl::Buffer buffer_out = cl::Buffer(context,
CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
N*N*sizeof(double),
out, &cl_err);
cl::CommandQueue queue = cl::CommandQueue(context, devices.at(0), 0, &cl_err);
std::ifstream sourceFile("vector_copy.cl");
std::string sourceCode((std::istreambuf_iterator<char>(sourceFile)),
std::istreambuf_iterator<char>());
cl::Program::Sources source(1, std::make_pair(sourceCode.c_str(),
sourceCode.length()+1));
cl::Program program(context, source, &cl_err);
cl_err = program.build(devices, NULL, NULL, NULL);
cl::Kernel kernel(program, "vector_copy", &cl_err);
cl_err = kernel.setArg(0, buffer_in);
cl_err = kernel.setArg(1, buffer_out);
cl_err = kernel.setArg(2, N);
cl::NDRange global(N);
cl::NDRange local(N);
start = std::clock();
for (int n=0; n < niter; n++) {
cl_err = queue.enqueueNDRangeKernel(kernel,
cl::NullRange,
global,
local,
NULL, NULL);
cl_err = queue.finish();
}
time = (std::clock() - start)/(double)CLOCKS_PER_SEC;
std::cout << "Time/iteration OpenCL (s) = " << time/(double)niter << std::endl;
return(0);
}
OpenCL kernel code:
__kernel void vector_copy(__global const double* restrict in,
__global double* restrict out,
const int N)
{
int i = get_global_id(0);
int j;
for (j=0; j<N; j++) {
out[j + N*i] = in[j + N*i];
}
}
C++ code:
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <cmath>
#include <ctime>
const int N = 8192;
int main(int argc, char **argv)
{
double *in = new double[N*N];
double *out = new double[N*N];
// Create the two input vectors
for(int i = 0; i < N; i++)
for (int j=0; j < N; j++) {
in[j + N*i] = i + j;
out[j + N*i] = 0.;
}
std::clock_t start;
int niter = 100;
start = std::clock();
for (int n=0; n < niter; n++) {
for (int i=0; i<N; i++)
for (int j=0; j<N; j++) {
out[j + N*i] = in[j + N*i];
}
}
double time = (std::clock() - start)/(double)CLOCKS_PER_SEC;
std::cout << "Time/iteration C = " << time/(double)niter << std::endl;
return(0);
}
Intel OpenCL compiler is able to vectorize across workgroups. Basically a single function runs, as an example, 8 threads at the same time in different SSE registers.
Your particular kernel does not do that. But it doesn't really matter. I tested your program using Visual Studio 2010 and the latest Intel OpenCL for applications. I was forced to reduce N from 8192 to 4096 because the integrated GPU I have reduces the maximum OpenCL buffer size into 128MB even if just the CPU is used.
My results: Your OpenCL kernel gave me around 6956MB/s of bandwidth. A trivially changed kernel (This is called with N*N as the global size and NULL as the local size because if we don't care about local memory at all then for CPU's we should leave it undefined).
__kernel void vector_copy2(__global const double* restrict in,
__global double* restrict out)
{
int i = get_global_id(0);
out[i] = in[i];
}
Gave about the same result (7006MB/s). This kernel was actually vectorized across threads, as can be verified using the Intel OpenCL kernel compiler. It produces one kernel for a some multiple (like 4) and one kernel for a single thread. Then it just runs the vectorized kernel until it has to run the single thread kernel for the last few workitems.
The C++ code gave 6494MB/s. So it's quite in line. I don't think it would be even possible for the ICC to make it 5x faster.
I noticed in your code you had platforms.at(1), what was at platform 0 in your computer?
Remember that if you don't care about local memory at all (you don't call get_local_id in your kernels) you should treat the local size for enqueueNDRange as a simple magic parameter. Either leave it as NULL or try to find a value that produces the fastest results.
The OpenCL code, even if optimized, it will still perform the copy 1by1 (work-item by work-item). Because the OpenCL compiler is only allowed to optimize in a per work item basis. While the C++ case will be optimized by the compiler into a memcpy() call probably (as the compiler is telling you).
If you disable the compiler optimizations it will perform much faster in the GPU.
BTW is there a reason for this? You have memcpy() in C++ and clEnqueueCopyBuffer() in OpenCL for this purpose. I think that latter one is what you should use.