why cuda kernel can access host memory?

why cuda kernel can access host memory? - c++

I directly access the host mem in the cuda kernel, and found no error, why is this?
I tried to get smarter from the documentation.
Allocates size bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy*(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().
Why do many cuda programs add cudaMemcpy after cudaMallocHost?
#include <stdio.h>
#include <assert.h>
#define N 64
// cuda kernel access host mem a/b
__global__ void gpu(int *a, int *b, int *c_gpu) {
int r = blockDim.x * blockIdx.x + threadIdx.x;
int c = blockDim.y * blockIdx.y + threadIdx.y;
if (r < N && c < N) {
c_gpu[r * N + c] = a[r * N + c] + b[r * N + c];
}
}
// cpu function
void cpu(int *a, int *b, int *c_cpu) {
for (int r = 0; r < N; r++) {
for (int c = 0; c < N; c++) {
c_cpu[r * N + c] = a[r * N + c] + b[r * N + c];
}
}
}
int main() {
int *a, *b, *c_cpu, *c_gpu, *c_gpu_cpu;
size_t size = N * N * sizeof(int);
cudaMallocHost(&a, size);
cudaMallocHost(&b, size);
cudaMallocHost(&c_cpu, size);
cudaMallocHost(&c_gpu_cpu, size);
cudaMalloc(&c_gpu, size);
for (int r = 0; r < N; r++) {
for (int c = 0; c < N; c++) {
a[r * N + c] = r;
b[r * N + c] = c;
c_gpu_cpu[r * N + c] = 0;
c_cpu[r * N + c] = 0;
}
}
cpu(a, b, c_cpu);
dim3 threads(16, 16, 1);
dim3 blocks((N + threads.x - 1) / threads.x, (N + threads.y - 1) / threads.y, 1);
gpu<<<blocks, threads>>>(a, b, c_gpu); // access cpu host mem
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
printf("Error: %s\n", cudaGetErrorString(err));
}
cudaDeviceSynchronize();
cudaFreeHost(a);
cudaFreeHost(b);
cudaFreeHost(c_cpu);
cudaFreeHost(c_gpu_cpu);
cudaFree(c_gpu);
}

Why do many cuda programs add cudaMemcpy after cudaMallocHost?
Because many CUDA programs were written before the appearance of the unified memory system, and at that time cudaMallocHost allocated page locked memory. That page locked memory still requires an API call for copying. "...accessed directly by the device" means that the GPU can use DMA to read and write across the PCI express bus to the memory without the host memory manager needing to do anything, which is considerably faster than conventional pageable host memory.
As GPU and host architectures and operating systems evolved, it became possible for the GPU to directly access host memory on some systems through some GPU hardware and driver magic. That isn't universal though. The code you have posted won't run correctly on every CUDA system ever made, if it works correctly on yours (your error checking is deficient and it might not actually be working at all).

Related

How to pass a vector's data to a CUDA kernel?

So I have this assignment about convolution, where I have to apply a .wav filter to another .wav file. I have to do this using CUDA. This is my CUDA kernel:
__global__ void MyConvolveCUDA(const double* A, const double* B, double* C, int n, int m) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
int j = threadIdx.y + blockIdx.y * blockDim.y;
int min, max;
if (i >= m - 1) min = i - m + 1; else min = 0;
if (i < n - 1) max = i; else max = n - 1;
if (j <= min) j = min;
else if (j >= max) j = max;
C[i] = A[i] * B[j - i];
}
and this is the function where I try it. I have used a custom lib for reading the audio files (they are read correctly and everything), so I'm gonna simplify the audio files' part of the code:
void MyConvolveCUDA_Run() {
//Let's say that 'filter' is the filter i want to apply to the 'audio' file. 'output' is the file I
//want to export in the end. The '.samples' function accesses the samples' part of the audio file,
//and the 'save' function saves the file using the given name.
int n = audio.samples.size(),
m = filter.samples.size();
//These are the device copies of the data I want to proccess.
double* audioCUDA = nullptr;
double* filterCUDA = nullptr;
double* outputCUDA = nullptr;
cudaMalloc((void **)&audioCUDA, n * sizeof(double));
cudaMalloc((void **)&filterCUDA, n * sizeof(double));
cudaMalloc((void **)&outputCUDA, (n + m - 1) * sizeof(double));
cudaMemcpy(audioCUDA, audio.samples[0].data(), n * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(filterCUDA, filter.samples[0].data(), m * sizeof(double), cudaMemcpyHostToDevice);
MyConvolveCUDA << < 32, 32 >> > (audioCUDA, filterCUDA, outputCUDA, n, m);
cudaDeviceSynchronize();
cudaMemcpy(output.samples[0].data(), outputCUDA, (n + m - 1) * sizeof(double), cudaMemcpyDeviceToHost);
cudaFree(audioCUDA); cudaFree(filterCUDA); cudaFree(outputCUDA);
output.save("CUDA_output.wav");
}
Can you understand what's going wrong?? I want to check on the arrays I pass in to MyConvolveCUDA, but every time I try I get an access violation error.
Thanks in advance!

You are launching the CUDA kernel MyConvolveCUDA as MyConvolveCUDA<<<32,32>>> which means you are launching 32 blocks each having 32 threads(1024 threads). In the kernel you are using 2D thread indexing but you launched only 1D threads.
MyConvolveCUDA<<<M,N>>>is interpreted as
MyConvolveCUDA<<<dim3(M,1,1),dim3(M,1,1)>>>
where M is the number of blocks and N is the number of threads per kernel i.e; we are launching threads in only x direction. For this, threadIdx.y and blockIdx.y will always be 0.
If you want to launch it in 2 dimensions then you should call the kernel as MyConvolveCUDA<<<dim3(M,N),dim3(M,N)>>>.
To check the arrays inside the kernel you can print them like
int i = threadIdx.x + blockIdx.x * blockDim.x;
# if __CUDA_ARCH__>=200
if(i==0){
for(int iter=0;iter<n;iter++)
printf("%ld ", A[iter]);
for(int iter=0;iter<m;iter++)
printf("%ld ", B[iter]);
}
#endif

CUDA Vector Reduction to handle vectors of length less than 512?

I'm working on parallel vector_reduction algorithm tutorial from NVIDIA to implement the algorithm using CUDA C++ API. I have implemented the algorithm but it only works for vector lengths that are fixed to 512. I am not able to figure out how to get it working for vectors less than 512? I want it to work for arbitrary sizes, i.e, 324, 123, 23.
#include <stdio.h>
#define NUM_ELEMENTS 512
__global__ void reduction(float *g_data, int n)
{
__shared__ float partialSum[NUM_ELEMENTS];
int tx = threadIdx.x;
int i = tx + blockIdx.x * blockDim.x;
if (i < n) {
partialSum[tx] = g_data[i];
}
int stride;
for (stride = blockDim.x/2; stride > 0; stride >>= 1) {
__syncthreads();
if (tx < stride) {
partialSum[tx] += partialSum[tx + stride];
}
}
if (tx == 0) {
g_data[blockIdx.x] = partialSum[tx];
}
}
float computeOnDevice(float* h_data, int num_elements)
{
float* d_data = NULL;
float result;
// Memory allocation on device side
cudaMalloc((void**)&d_data, sizeof(float)*num_elements);
// Copy from host memory to device memory
cudaMemcpy(d_data, h_data, num_elements * sizeof(float), cudaMemcpyHostToDevice );
dim3 blockSize, gridSize;
// Number of threads in each thread block
blockSize = dim3(num_elements, 1, 1);
// Number of thread blocks in grid
gridSize = dim3(1, 1, 1);
// Invoke the kernel
reduction<<<gridSize, blockSize>>>(d_data, num_elements);
// Copy from device memory back to host memory
cudaMemcpy(&result, d_data, sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_data);
cudaDeviceReset();
return result;
}
int main() {
float *data = new float[NUM_ELEMENTS];
for (int i = 0; i < NUM_ELEMENTS; i++) data[i] = 1;
float r = computeOnDevice(data, NUM_ELEMENTS);
printf(" result = %f\n" , r);
}

Your code is 100% correct. The problem is that your bitshifts don't account for the last part of your array. You can easily fix this by artificially extending the array to the next power of 2. This way your entire array will be reduced and the extra "elements" (they don't actually exist) are just ignored.
#include <math.h>
__global__ void reduction(float *g_data, int n){
// figure out exponent of next larger power of 2
int exponent = ceilf(log2f(n));
// calculate next larger power of 2
int size = (int)powf(2, exponent);
__shared__ float partialSum[NUM_ELEMENTS];
int tx = threadIdx.x;
int i = tx + blockIdx.x * blockDim.x;
if (i < n){
partialSum[tx] = g_data[i];
}
for (int stride = size / 2; stride > 0; stride >>= 1){
__syncthreads();
if (tx < stride) {
// all threads that run out of bounds do nothing
// equivalent to adding 0
if((tx + stride) < n)
partialSum[tx] += partialSum[tx + stride];
}
}
if (tx == 0){
g_data[blockIdx.x] = partialSum[tx];
}
}
Edit
Regarding your comment, this method of reduction will never work for an array that is being reduced in multiple blocks. So, for compute capability 1.0-1.3, the largest array you can reduce is 512 elements, for compute capability >1.3 you can do up to 1024 elements, this is the maximum number of threads per block.
This is because __shared__ memory is shared among threads not blocks. So, to reduce an array scattered over multiple blocks you'd need to partition the array such that each block reduces a chunk and then leverage __global__ memory to reduce the values from all blocks. However, __global__ memory is approximately 10-20 times slower than the (on-chip) __shared__ memory, so once you start using a lot of blocks, this will become very inefficient.
The alternative would be to have each thread process multiple indices, however, eventually your partialSum array won't fit into shared memory anymore and overflow into global memory anyway. This approach would also mean you can never use more than 512 (or 1024) threads, which defeats the purpose of using CUDA which depends on running a very large number of threads to hide latency and make the expensive memory transfer from host to device worth it.

CUDA - Parallel Reduction Sum

I am trying to implement a parallel reduction sum in CUDA 7.5. I have been trying to follow the NVIDIA PDF that walks you through the initial algorithm and then steadily more optimised versions. I am currently making an array that is filled with 1 as the value in every array position so that I can check the output is correct but I am getting a value of -842159451 for an array of size 64. I am expecting that the kernel code is correct as I have followed the exact code from NVIDIA for it but here is my kernel:
__global__ void reduce0(int *input, int *output) {
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = input[i];
__syncthreads();
for (unsigned int s = 1; s < blockDim.x; s *= 2) {
if (tid % (2 * s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid == 0) output[blockIdx.x] = sdata[0];
}
Here is my code calling the kernel, which is where I expect my problem to be:
int main()
{
int numThreadsPerBlock = 1024;
int *hostInput;
int *hostOutput;
int *deviceInput;
int *deviceOutput;
int numInputElements = 64;
int numOutputElements; // number of elements in the output list, initialised below
numOutputElements = numInputElements / (numThreadsPerBlock / 2);
if (numInputElements % (numThreadsPerBlock / 2)) {
numOutputElements++;
}
hostInput = (int *)malloc(numInputElements * sizeof(int));
hostOutput = (int *)malloc(numOutputElements * sizeof(int));
for (int i = 0; i < numInputElements; ++i) {
hostInput[i] = 1;
}
const dim3 blockSize(numThreadsPerBlock, 1, 1);
const dim3 gridSize(numOutputElements, 1, 1);
cudaMalloc((void **)&deviceInput, numInputElements * sizeof(int));
cudaMalloc((void **)&deviceOutput, numOutputElements * sizeof(int));
cudaMemcpy(deviceInput, hostInput, numInputElements * sizeof(int), cudaMemcpyHostToDevice);
reduce0 << <gridSize, blockSize >> >(deviceInput, deviceOutput);
cudaMemcpy(hostOutput, deviceOutput, numOutputElements * sizeof(int), cudaMemcpyDeviceToHost);
for (int ii = 1; ii < numOutputElements; ii++) {
hostOutput[0] += hostOutput[ii]; //accumulates the sum in the first element
}
int sumGPU = hostOutput[0];
printf("GPU Result: %d\n", sumGPU);
std::string wait;
std::cin >> wait;
return 0;
}
I have also tried bigger and smaller array sizes for the input and I get the same result of a very large negative value no matter the size of the array.

Seems you are using a dynamically allocated shared array:
extern __shared__ int sdata[];
but you are not allocating it in the kernel invocation:
reduce0 <<<gridSize, blockSize >>>(deviceInput, deviceOutput);
You have two options:
Option 1
Allocate the shared memory statically in the kernel, e.g.
constexpr int threadsPerBlock = 1024;
__shared__ int sdata[threadsPerBlock];
More often than not I find this the cleanest approach, as it works without a problem when you have multiple arrays in shared memory. The drawback is that while the size usually depends on the number of threads in the block, you need the size to be known at compile-time.
Option 2
Specify the amount of dynamically allocated shared memory in the kernel invocation.
reduce0 <<<gridSize, blockSize, numThreadsPerBlock*sizeof(int) >>>(deviceInput, deviceOutput);
This will work for any value of numThreadsPerBlock (provided it is within the allowed range of course). The drawback is that if you have multiple extern shared arrays, you need to figure out how to put then in the memory yourself, so that one does not overwrite the other.
Note, there may be other problems in your code. I didn't test it. This is something I spotted immediately upon glancing over your code.

Why is my CUDA implementation equally fast as my CPU implementation

I created some code to do a 2D convlution on a 1300x1300 grayscale image and a 15x15 kernel, in standard C++ and in CUDA. Both versions:
CPU:
#include <iostream>
#include <exception>
#define N 1300
#define K 15
#define K2 ((K - 1) / 2)
template<int mx, int my>
inline int index(int x, int y)
{
return x*my + y;
}
int main() {
double *image = new double[N * N];
double *kernel = new double[K * K];
double *result = new double[N * N];
for (int x=0; x<N; ++x)
for (int y=0; y<N; ++y)
{
double r = 0;
for(int i=0; i<K; ++i)
for(int j=0; j<K; ++j)
{
if (x + i - K2 >= 0 and
x + i - K2 < N and
y + j - K2 >= 0 and
y + j - K2 < N)
{
r += kernel[index<K,K>(i,j)] * image[index<N,N>(x+i-K2, y+j-K2)];
}
}
result[index<N,N>(x, y)] = r;
}
delete[] image;
delete[] kernel;
delete[] result;
}
GPU:
#include <iostream>
#include <exception>
// ignore, just for error handling
struct ErrorHandler {
int d_line;
char const *d_file;
ErrorHandler(int line, char const *file) : d_line(line), d_file(file) {};
};
#define EH ErrorHandler(__LINE__, __FILE__)
ErrorHandler operator<<(ErrorHandler eh, cudaError_t err)
{
if (err != cudaSuccess)
{
std::cerr << cudaGetErrorString( err ) << " in " << eh.d_file << " at line " << eh.d_line << '\n';
throw std::exception();
}
return eh;
}
// end.
#define N 1300
#define K 15
#define K2 ((K - 1) / 2)
template<int mx, int my>
__device__ inline int index(int x, int y)
{
return x*my + y;
}
__global__ void kernelkernel(double *image, double *kernel, double *result)
{
int x = blockIdx.x;
int y = blockIdx.y; // becomes: int y = threadIdx.x;
double r = 0;
for(int i=0; i<K; ++i)
for(int j=0; j<K; ++j)
{
if (x + i - K2 >= 0 and
x + i - K2 < N and
y + j - K2 >= 0 and
y + j - K2 < N)
{
r += kernel[index<K,K>(i,j)] * image[index<N,N>(x+i-K2, y+j-K2)];
}
}
result[index<N,N>(x, y)] = r;
}
int main() {
double *image = new double[N * N];
double *kernel = new double[K * K];
double *result = new double[N * N];
double *image_cuda;
double *kernel_cuda;
double *result_cuda;
EH << cudaMalloc((void **) &image_cuda, N*N*sizeof(double));
EH << cudaMalloc((void **) &kernel_cuda, K*K*sizeof(double));
EH << cudaMalloc((void **) &result_cuda, N*N*sizeof(double));
EH << cudaMemcpy(image_cuda, image, N*N*sizeof(double), cudaMemcpyHostToDevice);
EH << cudaMemcpy(kernel_cuda, kernel, K*K*sizeof(double), cudaMemcpyHostToDevice);
dim3 grid ( N, N );
kernelkernel<<<grid, 1>>>(image_cuda, kernel_cuda, result_cuda);
// replace previous 2 statements with:
// kernelkernel<<<N, N>>>(image_cuda, kernel_cuda, result_cuda);
EH << cudaMemcpy(result, result_cuda, N*N*sizeof(double), cudaMemcpyDeviceToHost);
cudaFree( image_cuda );
cudaFree( kernel_cuda );
cudaFree( result_cuda );
delete[] image;
delete[] kernel;
delete[] result;
}
I would expect the cuda code to be a lot faster, however:
$ nvprof ./gpuversion
==17806== NVPROF is profiling process 17806, command: ./gpuversion
==17806== Profiling application: ./gpuversion
==17806== Profiling result:
Time(%) Time Calls Avg Min Max Name
99.89% 3.83149s 1 3.83149s 3.83149s 3.83149s kernelkernel(double*, double*, double*)
0.07% 2.6420ms 1 2.6420ms 2.6420ms 2.6420ms [CUDA memcpy DtoH]
0.04% 1.5111ms 2 755.54us 736ns 1.5103ms [CUDA memcpy HtoD]
And:
$ time ./cpuversion
real 0m3.382s
user 0m3.371s
sys 0m0.012s
Their difference is statistically insignificant. The CUDA-kernel takes approximately 3-4 seconds, why isn't it a lot faster? Is my code run in parallel?
PS: I'm new to CUDA, so I could be missing something trivial.
SOLUTION
What I found out, is that CUDA does not let you access memory willy-nilly from blocks. I guess the general strategy of CUDA programming is:
allocate and copy memory from RAM to cuda using cudaMalloc and cudaMemCpy
divide the workload among blocks and threads in such a way that the memory accessed by different blocks doesn't overlap much.
If there is overlap between the memory used by blocks, start each block by copying the memory inside a shared array. Notice that:
the size of this array must be known compile time
it's size is limited
this memory is shared by each thread in ONE block, so __shared double foo[10] allocates 10 doubles for each BLOCK.
copy the memory needed by one block to the shared variables inside the kernel. Of course, you use the different threads to do this 'efficiently'
sync the threads, such that all data is there before it is used.
process the data, and write the result. it to the output array of the kernel
synch again, I'm not sure why, but everyone on the internet is doing it :S
copy the GPU memory back to RAM
clean up the GPU memory.
This gives the following code. It is mex-code, for Matlab for the structural similarity, which also works via a sliding kernel, but over 2 images and with a different aggregate than the dot-product.
// author: Herbert Kruitbosch, CC: be nice, include my name in documentation/papers/publications when used
#include <matrix.h>
#include <mex.h>
#include <cmath>
#include <iostream>
#include <fstream>
#include <iostream>
#include <stdio.h>
static void HandleError(
cudaError_t err,
const char *file,
int line )
{
if (err != cudaSuccess)
{
printf( "%s in %s at line %d\n", cudaGetErrorString( err ), file, line );
exit( EXIT_FAILURE );
}
}
#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))
#define TILE_WIDTH 31
__device__ inline double sim(double v0, double v1, double c)
{
return (c + 2*v0*v1) / (c + v1*v1 + v0*v0);
}
__device__ inline int index(int rows, int cols, int row, int col)
{
return row + col*rows;
}
__global__ void ssimkernel(double *test, double *reference, const double * __restrict__ kernel, double *ssim, int k, int rows, int cols, int tile_batches_needed)
{
int radius = k / 2;
int block_width = TILE_WIDTH - k + 1;
__shared__ double tile_test [TILE_WIDTH][TILE_WIDTH];
__shared__ double tile_reference[TILE_WIDTH][TILE_WIDTH];
for(int offset=0; offset < tile_batches_needed; ++offset)
{
int dest = block_width*block_width*offset + threadIdx.y * block_width + threadIdx.x;
int destRow = dest / TILE_WIDTH;
int destCol = dest % TILE_WIDTH;
int srcRow = blockIdx.y * block_width + destRow - radius;
int srcCol = blockIdx.x * block_width + destCol - radius;
int src = srcCol * rows + srcRow;
if (destRow < TILE_WIDTH)
{
if (srcRow >= 0 and srcRow < rows and
srcCol >= 0 and srcCol < cols)
{
tile_test [destRow][destCol] = test [src];
tile_reference[destRow][destCol] = reference[src];
}
else
{
tile_test [destRow][destCol] = 0;
tile_reference[destRow][destCol] = 0;
}
}
}
__syncthreads();
double mean_test = 0;
double mean_reference = 0;
for(int i=0; i<k; ++i)
for(int j=0; j<k; ++j)
{
double w = kernel[i * k + j];
mean_test += w * tile_test [threadIdx.y+i][threadIdx.x+j];
mean_reference += w * tile_reference[threadIdx.y+i][threadIdx.x+j];
}
double var_test = 0;
double var_reference = 0;
double correlation = 0;
for(int i=0; i<k; ++i)
for(int j=0; j<k; ++j)
{
double w = kernel[i * k + j];
double a = (tile_test [threadIdx.y+i][threadIdx.x+j] - mean_test );
double b = (tile_reference[threadIdx.y+i][threadIdx.x+j] - mean_reference);
var_test += w * a * a;
var_reference += w * b * b;
correlation += w * a * b;
}
int destRow = blockIdx.y * block_width + threadIdx.y;
int destCol = blockIdx.x * block_width + threadIdx.x;
if (destRow < rows and destCol < cols)
ssim[destCol * rows + destRow] = sim(mean_test, mean_reference, 0.01) * (0.03 + 2*correlation) / (0.03 + var_test + var_reference);
__syncthreads();
}
template<typename T>
inline T sim(T v0, T v1, T c)
{
return (c + 2*v0*v1) / (c + v1*v1 + v0*v0);
}
inline int upperdiv(int a, int b) {
return (a + b - 1) / b;
}
void mexFunction(int nargout, mxArray *argout[], int nargin, const mxArray *argin[])
{
mwSize rows = mxGetDimensions(argin[0])[0];
mwSize cols = mxGetDimensions(argin[0])[1];
mwSize k = mxGetDimensions(argin[2])[0];
mwSize channels = mxGetNumberOfDimensions(argin[0]) <= 2 ? 1 : mxGetDimensions(argin[0])[2];
int dims[] = {rows, cols, channels};
argout[0] = mxCreateNumericArray(3, dims, mxDOUBLE_CLASS, mxREAL);
double *test = (double *)mxGetData(argin[0]);
double *reference = (double *)mxGetData(argin[1]);
double *gaussian = (double *)mxGetData(argin[2]);
double *ssim = (double *)mxGetData(argout[0]);
double *test_cuda;
double *reference_cuda;
double *gaussian_cuda;
double *ssim_cuda;
HANDLE_ERROR( cudaMalloc((void **) &test_cuda, rows*cols*sizeof(double)) );
HANDLE_ERROR( cudaMalloc((void **) &reference_cuda, rows*cols*sizeof(double)) );
HANDLE_ERROR( cudaMalloc((void **) &gaussian_cuda, k*k*sizeof(double)) );
HANDLE_ERROR( cudaMalloc((void **) &ssim_cuda, rows*cols*sizeof(double)) );
HANDLE_ERROR( cudaMemcpy(gaussian_cuda, gaussian, k*k*sizeof(double), cudaMemcpyHostToDevice) );
int block_width = TILE_WIDTH - k + 1;
int tile_batches_needed = upperdiv(TILE_WIDTH*TILE_WIDTH, block_width*block_width);
for(int c=0; c<channels; ++c)
{
HANDLE_ERROR( cudaMemcpy(test_cuda, test + rows*cols*c, rows*cols*sizeof(double), cudaMemcpyHostToDevice) );
HANDLE_ERROR( cudaMemcpy(reference_cuda, reference + rows*cols*c, rows*cols*sizeof(double), cudaMemcpyHostToDevice) );
dim3 dimGrid(upperdiv(cols, block_width), upperdiv(rows, block_width), 1);
dim3 dimBlock(block_width, block_width, 1);
ssimkernel<<<dimGrid, dimBlock>>>(test_cuda, reference_cuda, gaussian_cuda, ssim_cuda, k, rows, cols, tile_batches_needed);
HANDLE_ERROR( cudaMemcpy(ssim + rows*cols*c, ssim_cuda, rows*cols*sizeof(double), cudaMemcpyDeviceToHost) );
}
cudaFree( test_cuda );
cudaFree( reference_cuda );
cudaFree( gaussian_cuda );
cudaFree( ssim_cuda );
}

kernelkernel<<<grid, 1>>>
This is a significant issue; threads on nVidia GPUs work in warps of 32 threads. However, you've only assigned a single thread to each block, which means 31 of those threads will sit idle while a single thread does work. And usually, for kernels where you have the flexibility, you'll usually want several warps per block rather than just one.
You could get an immediate speedup by using N blocks and N threads per block, rather than using N^2 blocks.
Actually, N might be too big, since there's an upper limit on the number of threads per block. Although you could choose a suitable M so that that you use N/M threads per block, and N * M blocks.
In fact, you'll probably get the best results in this regard by picking some M (I'm guessing 256 will probably be near optimal) and launching with L=ceiling(N*N/M) blocks and M blocks per thread. Then each thread figures reconstructs an index in [0, M*L) based on its block and thread ID, and then those whose index is in [0,N*N) will proceed to split that index into an x and y coordinate and do work.

Accessing global memory in a kernel is costly, because of its latency. A global memory request (both reading and writing) takes hundreds of clock cycles to complete. You want to minimise the amount of times global memory is accessed, and access it in contiguous blocks.
If each piece of data is accessed exactly once, there's nothing to do about the latency, but that's seldom the case. And definitely not the case in your code, where the kernel array is accessed by all threads in the same pattern, and a lot of image is accessed by multiple threads as well.
The solution for that is to start the kernel by fetching the data from the high-latency global memory into the low-latency shared memory. Shared memory is a block of memory on the multiprocessor, and its latency is comparable to that of registers. So most simple kernels follow a structure like this:
Each thread fetches data from global memory to shared memory. You want to fetch data in contiguous sequences if possible, as global memory is accessed through transactions. If there's not enough data for all threads to fetch, leave some of them idle.
Threads operate on the data in shared memory.
Data is written from shared memory back to global memory in the same pattern as it was fetched in step 1.
Shared memory is shared by all threads within a thread block. Which leads us to the second big issue in your code: you're not using thread blocks at all. Threads in one block run on one multiprocessor, share shared memory, can be synchronised with each other etc. You need to organise threads into blocks well to get the most out of them.
The grid of blocks is just a mechanism to be able to run more blocks at one invocation. All the goodies of parallel instruction execution and shared memory access are within a block. The grid of blocks is just "yeah, sorry, my data's so big a single block won't do, just run many of them."
You're doing the exact opposite: your blocks have one thread each, which means that in each step, only one thread from each warp runs on the multiprocessor (based on your device's compute capability and the number of warp schedulers available, this means something like 2–4 threads on one multiprocessor at most).
You'll have to re-structure your threads to mirror the data access patterns, and prefetch data into shared memory. This will give you the performance boost you expect.
The above is just a short summary. Refer to the CUDA programming guide for details on block organisation, shared memory, and global memory transactions.

If you're using global memory in CUDA, all the data access will be synchronized in something like queue, and you'll receive almost linear solution, not parallel.
Also, transfering a large dataset from your RAM memory to GPU memory also takes a lot of time (the speed of bus is limited).
So, i think you have to somehow parallel your data across computation units in your GPU (part them into shared memory).
Check this to see solution of how to improve your GPU memory usage in the case that similar to yours.

count3's in cuda is very slow

I have written a small program in CUDA that counts how many 3's are in a C array and prints them.
#include <stdio.h>
#include <assert.h>
#include <cuda.h>
#include <cstdlib>
__global__ void incrementArrayOnDevice(int *a, int N, int *count)
{
int id = blockIdx.x * blockDim.x + threadIdx.x;
//__shared__ int s_a[512]; // one for each thread
//s_a[threadIdx.x] = a[id];
if( id < N )
{
//if( s_a[threadIdx.x] == 3 )
if( a[id] == 3 )
{
atomicAdd(count, 1);
}
}
}
int main(void)
{
int *a_h; // host memory
int *a_d; // device memory
int N = 16777216;
// allocate array on host
a_h = (int*)malloc(sizeof(int) * N);
for(int i = 0; i < N; ++i)
a_h[i] = (i % 3 == 0 ? 3 : 1);
// allocate arrays on device
cudaMalloc(&a_d, sizeof(int) * N);
// copy data from host to device
cudaMemcpy(a_d, a_h, sizeof(int) * N, cudaMemcpyHostToDevice);
// do calculation on device
int blockSize = 512;
int nBlocks = N / blockSize + (N % blockSize == 0 ? 0 : 1);
printf("number of blocks: %d\n", nBlocks);
int count;
int *devCount;
cudaMalloc(&devCount, sizeof(int));
cudaMemset(devCount, 0, sizeof(int));
incrementArrayOnDevice<<<nBlocks, blockSize>>> (a_d, N, devCount);
// retrieve result from device
cudaMemcpy(&count, devCount, sizeof(int), cudaMemcpyDeviceToHost);
printf("%d\n", count);
free(a_h);
cudaFree(a_d);
cudaFree(devCount);
}
The result I get is:
real 0m3.025s
user 0m2.989s
sys 0m0.029s
When I run it on the CPU with 4 threads I get:
real 0m0.101s
user 0m0.100s
sys 0m0.024s
Note that the GPU is an old one - I don't know the exact model because I do not have root access to it, but the OpenGL version it runs is 1.2 using the MESA driver.
Am I doing something wrong? What can I do to make it run faster?
Note: I have tried using buckets for each block (so the atomicAdd()s would be reduced for each one) but I get exactly the same performance.
I have also tried copying the 512 integers that are assigned to this block to a shared block of memory (you can see it in the comments) and the time is the same again.

This is in response to your question "What can I do to make it run faster?" As I mentioned in the comments, there are issues (probably) with the timing methodology, and the main suggestion I have for speed improvement is to use a "classical parallel reduction" algorithm. The following code implements a better (in my opinion) timing measurement, and also converts your kernel to a reduction style kernel:
#include <stdio.h>
#include <assert.h>
#include <cstdlib>
#define N (1<<24)
#define nTPB 512
#define NBLOCKS 32
__global__ void incrementArrayOnDevice(int *a, int n, int *count)
{
__shared__ int lcnt[nTPB];
int id = blockIdx.x * blockDim.x + threadIdx.x;
int lcount = 0;
while (id < n) {
if (a[id] == 3) lcount++;
id += gridDim.x * blockDim.x;
}
lcnt[threadIdx.x] = lcount;
__syncthreads();
int stride = blockDim.x;
while(stride > 1) {
// assume blockDim.x is a power of 2
stride >>= 1;
if (threadIdx.x < stride) lcnt[threadIdx.x] += lcnt[threadIdx.x + stride];
__syncthreads();
}
if (threadIdx.x == 0) atomicAdd(count, lcnt[0]);
}
int main(void)
{
int *a_h; // host memory
int *a_d; // device memory
cudaEvent_t gstart1,gstart2,gstop1,gstop2,cstart,cstop;
float etg1, etg2, etc;
cudaEventCreate(&gstart1);
cudaEventCreate(&gstart2);
cudaEventCreate(&gstop1);
cudaEventCreate(&gstop2);
cudaEventCreate(&cstart);
cudaEventCreate(&cstop);
// allocate array on host
a_h = (int*)malloc(sizeof(int) * N);
for(int i = 0; i < N; ++i)
a_h[i] = (i % 3 == 0 ? 3 : 1);
// allocate arrays on device
cudaMalloc(&a_d, sizeof(int) * N);
int blockSize = nTPB;
int nBlocks = NBLOCKS;
printf("number of blocks: %d\n", nBlocks);
int count;
int *devCount;
cudaMalloc(&devCount, sizeof(int));
cudaMemset(devCount, 0, sizeof(int));
// copy data from host to device
cudaEventRecord(gstart1);
cudaMemcpy(a_d, a_h, sizeof(int) * N, cudaMemcpyHostToDevice);
cudaMemset(devCount, 0, sizeof(int));
cudaEventRecord(gstart2);
// do calculation on device
incrementArrayOnDevice<<<nBlocks, blockSize>>> (a_d, N, devCount);
cudaEventRecord(gstop2);
// retrieve result from device
cudaMemcpy(&count, devCount, sizeof(int), cudaMemcpyDeviceToHost);
cudaEventRecord(gstop1);
printf("GPU count = %d\n", count);
int hostCount = 0;
cudaEventRecord(cstart);
for (int i=0; i < N; i++)
if (a_h[i] == 3) hostCount++;
cudaEventRecord(cstop);
printf("CPU count = %d\n", hostCount);
cudaEventSynchronize(cstop);
cudaEventElapsedTime(&etg1, gstart1, gstop1);
cudaEventElapsedTime(&etg2, gstart2, gstop2);
cudaEventElapsedTime(&etc, cstart, cstop);
printf("GPU total time = %fs\n", (etg1/(float)1000) );
printf("GPU compute time = %fs\n", (etg2/(float)1000));
printf("CPU time = %fs\n", (etc/(float)1000));
free(a_h);
cudaFree(a_d);
cudaFree(devCount);
}
When I run this on a reasonably fast GPU (a Quadro 5000, a little slower than a Tesla M2050) I get the following:
number of blocks: 32
GPU count = 5592406
CPU count = 5592406
GPU total time = 0.025714s
GPU compute time = 0.000793s
CPU time = 0.017332s
We see that the GPU is substantially faster than this (naive, single-threaded) CPU implementation for the compute portion. When we add in the cost to transfer the data, the GPU version is slower but is not 30x slower.
By way of comparison, when I timed your original algorithm, I got numbers like this:
GPU total time = 0.118131s
GPU compute time = 0.093213s
My system config for this was Xeon X5560 CPU, RHEL 5.5, CUDA 5.0, Quadro5000 GPU.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js