C++ class dll with CUDA member? - c++

I have a C++ class-based dll. I'd like to convert some of the class members to CUDA based operation.
I am using VS2012, WINDOWS 7, CUDA6.5, sm_20;
Say the original SuperProjector.h file is like:
class __declspec(dllexport) SuperProjector
{
public:
SuperProjector(){};
~SuperProjector(){};
void sumVectors(float* c, float* a, float* b, int N);
};
and the original sumVector() function in SuperProjector.cpp
void SuperProjector::sumVectors(float* c, float* a, float* b, int N)
{
for (int n = 1; n < N; b++)
c[n] = a[n] + b[n];
}
I am stuck on how I should convert sumVector() to CUDA. Specifically:
I read some posts saying add __global__ __device__ keywords in front
of class members will work, but so I need to change the suffix of
the cpp file to cu?
I also tried to create a cuda project from the beginning, but it seems VS2012 does not give me the option of creating a dll once I chose to create a CUDA project.
I am very confused what is the best way to convert some of the members of tthis C++ class based dll into some CUDA kernel functions. I appreciate anyone can offer some ideas, or better with some very simple examples.

Create CUDA project, let's call it cudaSuperProjector and add two files SuperProjector.cu and SuperProjector.h
cudaSuperProjector.h
class __declspec(dllexport) cudaSuperProjector {
public:
cudaSuperProjector(){ }
~cudaSuperProjector(){ }
void sumVectors(float* c, float* a, float* b, int N);
};
cudaSuperProjector.cu
#include <stdio.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cudaSuperProjector.h"
__global__ void addKernel(float *c, const float *a, const float *b) {
int i = threadIdx.x;
c[i] = a[i] + b[i];
}
// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(float *c, const float *a, const float *b, unsigned int size) {
float *dev_a = 0;
float *dev_b = 0;
float *dev_c = 0;
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
// Allocate GPU buffers for three vectors (two input, one output) .
cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(float));
cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(float));
cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(float));
// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(float), cudaMemcpyHostToDevice);
cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(float), cudaMemcpyHostToDevice);
// Launch a kernel on the GPU with one thread for each element.
addKernel << <1, size >> >(dev_c, dev_a, dev_b);
// Check for any errors launching the kernel
cudaStatus = cudaGetLastError();
// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(float), cudaMemcpyDeviceToHost);
return cudaStatus;
}
void cudaSuperProjector::sumVectors(float* c, float* a, float* b, int N) {
cudaError_t cudaStatus = addWithCuda(c, a, b, N);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSuperProjector::sumVectors failed!");
}
}
Note: In properties of file cudaSuperProjector.cu Item Type should be CUDA C/C++.
Go to properties of the project and in General set value of Configuration Type to Dynamic Library (.dll). Now everything for creating library is ready. Compile this project and in output folder you will find cudaSuperProjector.dll and cudaSuperProjector.lib. Create directory cudaSuperProjector\lib and copy cudaSuperProjector.dll and cudaSuperProjector.lib there. Also create cudaSuperProjector\include and copy cudaSuperProjector.h in it.
Create another Visual C++ project, let's call it SuperProjector. Add file SuperProjector.cpp to the project.
SuperProjector.cpp
#include <stdio.h>
#include "cudaSuperProjector/cudaSuperProjector.h"
int main(int argc, char** argv) {
float a[6] = { 0, 1, 2, 3, 4, 5 };
float b[6] = { 1, 2, 3, 4, 5, 6 };
float c[6] = { };
cudaSuperProjector csp;
csp.sumVectors(c, a, b, 6);
printf("c = {%f, %f, %f, %f, %f, %f}\n",
c[0], c[1], c[2], c[3], c[4], c[5]);
return 0;
}
In properties of the project add path to the dll and lib files to the VC++ Directories -> Library Directories, for example D:\cudaSuperProjector\lib;, in VC++ Directories -> Include Directories add path to the header, for example D:\cudaSuperProjector\include;. Then go to the Linker -> Input and add cudaSuperProjector.lib;.
Now your project should compile fine, but when you run it it will show you the error
The program can't start because cudaSuperProjector.dll is missing from
your computer. Try reinstalling the program to fix this problem.
You need to copy cudaSuperProjector.dll to the output folder of the project, so it will be under the same folder as SuperProjector.exe. You can do it manually or add
copy D:\cudaSuperProjector\lib\cudaSuperProjector.dll $(SolutionDir)$(Configuration)\
in Build Events -> Post-Build Events -> Command Line,
where $(SolutionDir)$(Configuration)\ is output path for solution (see Configuration Properties -> General -> Output Directory).

Related

nvcc cuda from command prompt not using gpu

Trying to run a CUDA program from command prompt using nvcc, but it seems like GPU code is not running as expected. The exact same code runs successfully on Visual Studio and outputs the expected output.
nvcc -arch=sm_60 -std=c++11 -o test.cu test.exe
test.exe
Environment:
Windows 10,
NVIDIA Quadro k4200,
CUDA 10.2
Source Code
#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <iostream>
/* this is the vector addition kernel.
:inputs: n -> Size of vector, integer
a -> constant multiple, float
x -> input 'vector', constant pointer to float
y -> input and output 'vector', pointer to float */
__global__ void saxpy(int n, float a, const float x[], float y[])
{
int id = threadIdx.x + blockDim.x*blockIdx.x; /* Performing that for loop */
// check to see if id is greater than size of array
if(id < n){
y[id] += a*x[id];
}
}
int main()
{
int N = 256;
//create pointers and device
float *d_x, *d_y;
const float a = 2.0f;
//allocate and initializing memory on host
std::vector<float> x(N, 1.f);
std::vector<float> y(N, 1.f);
//allocate our memory on GPU
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
//Memory Transfer!
cudaMemcpy(d_x, x.data(), N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y.data(), N*sizeof(float), cudaMemcpyHostToDevice);
//Launch the Kernel! In this configuration there is 1 block with 256 threads
//Use gridDim = int((N-1)/256) in general
saxpy<<<1, 256>>>(N, a, d_x, d_y);
//Transfering Memory back!
cudaMemcpy(y.data(), d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
std::cout << y[0] << std::endl;
cudaFree(d_x);
cudaFree(d_y);
return 0;
}
Output
1
Expected Output
3
Things I tried
When I first tried to compile with nvcc, it had the same error as discussed here.
Cuda compilation error: class template has already been defined
So I tried the suggested solution
"now: D:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.22.27905\bin\Hostx64\x64"
and now it compiles and runs but the output is not as expected.
"Also, -arch=sm_60 is an incorrect arch specification for a Quadro K4200. It should be -arch=sm_30" by Robert Crovella

How do you allocate GPU memory in a separate CUDA function?

I'm new to CUDA and sure that I'm doing something that's simple enough to fix, but I'm also not sure what to exactly search to find an answer. I've tried looking around but to no avail.
I have a few functions in my code that I want to perform matrix operations with, so instead of writing the code to allocate the memory multiple times, I want to use a function to do that for me. My issue is that the memory location is not being passed back to the function calling my MatrixInitCUDA function.
If I directly allocate the memory in my matrix functions it works as expected, but the issue I'm running into is that my pointer to device memory is only being assigned to the pointer inside of the MatrixInitCUDA function.
Initially I thought that there might have been some kind of type conversion of the arguments, so I included the typeinfo header and printed out the type of the device argument before and after cudaMalloc (no change - not surprising). I've tried passing in double pointers for the device matrix arguments but that doesn't seem to work either, although I'm not I did it properly either.
// Compile using nvcc <file> -lcublas -o <output>
#include <cublas_v2.h>
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <typeinfo>
// Define block size for thread allocation
#define BLOCK_DIM 32
#define N 10
typedef struct _matrixSize // Optional Command-line multiplier for matrix sizes
{
unsigned int A_height, A_width, B_height, B_width, C_height, C_width;
} MatrixSize;
void SetMatrixSize(MatrixSize *matrixSize,
unsigned int widthA, unsigned int heightA,
unsigned int widthB, unsigned int heightB,
unsigned int widthC, unsigned int heightC)
{
matrixSize->A_height = heightA;
matrixSize->A_width = widthA;
matrixSize->B_height = heightB;
matrixSize->B_width = widthB;
matrixSize->C_height = heightC;
matrixSize->C_width = widthC;
}
void MatrixInitCUDA(int argc, char **argv, int &devID, MatrixSize *matrixSize,
float *host_matrixA, float *host_matrixB, float *host_matrixC,
float *dev_matrixA, float *dev_matrixB, float *dev_matrixC)
{
// Assign CUDA variables
devID = 0;
cudaGetDevice(&devID);
cudaError_t err;
// Assign size variables
size_t matrixA_size = matrixSize->A_height * matrixSize->A_width * sizeof(float);
printf("Allocation size: %d\tMatrix Size: %d\n", (int) matrixA_size, matrixSize->A_height * matrixSize->A_width);
size_t matrixB_size = matrixSize->B_height * matrixSize->B_width * sizeof(float);
size_t matrixC_size = matrixSize->C_height * matrixSize->C_width * sizeof(float);
printf("PRE ALLOC TYPE: %s\n", typeid(typeof(dev_matrixA)).name());
// Allocate memory on GPU
err = cudaMalloc((void **) &dev_matrixA, matrixA_size);
printf("POST ALLOC TYPE: %s\n", typeid(typeof(dev_matrixA)).name());
printf("DEV A POST ALLOC: %p\n", dev_matrixA);
if (err != cudaSuccess) printf("Allocate matrix A: %s\n", cudaGetErrorString(err));
err = cudaMalloc((void **) &dev_matrixB, matrixB_size);
if (err != cudaSuccess) printf("Allocate matrix B: %s\n", cudaGetErrorString(err));
err = cudaMalloc((void **) &dev_matrixC, matrixC_size);
if (err != cudaSuccess) printf("Allocate matrix C: %s\n", cudaGetErrorString(err));
// Copy data from host PC to GPU
err = cudaMemcpy(dev_matrixA, host_matrixA, matrixA_size, cudaMemcpyHostToDevice);
if (err != cudaSuccess) printf("Copy matrix A to GPU: %s\n", cudaGetErrorString(err));
err =cudaMemcpy(dev_matrixB, host_matrixB, matrixB_size, cudaMemcpyHostToDevice);
if (err != cudaSuccess) printf("Copy matrix B to GPU: %s\n", cudaGetErrorString(err));
err =cudaMemcpy(dev_matrixC, host_matrixC, matrixC_size, cudaMemcpyHostToDevice);
if (err != cudaSuccess) printf("Copy matrix C to GPU: %s\n", cudaGetErrorString(err));
}
int main(int argc, char **argv)
{
// Create memory for Layer 1, Layer 2, Layer 3 vectors
// float *layer1 = malloc(784*sizeof(floats)))
// Create memory for Weight 1->2, Weight 2->3 matrices
// Layer 1 will read from file for input (X) values
// Layer 2 and 3 will be calculated
int devID = 0;
cudaGetDevice(&devID);
// Testing hadamard product, init function, and set matrix size function
float *host_A, *host_B, *host_C, *dev_A = NULL, *dev_B = NULL, *dev_C = NULL;
MatrixSize *mallocTest = (MatrixSize *) calloc(sizeof(MatrixSize), 1);
size_t calcSize = N * N * sizeof(float);
host_A = (float *) calloc(calcSize, 1);
host_B = (float *) calloc(calcSize, 1);
host_C = (float *) calloc(calcSize, 1);
SetMatrixSize(mallocTest, N, N, N, N, N, N);
printf("DEV A PRE ALLOC: %p\n", dev_A);
// Initialize memory on GPU
MatrixInitCUDA(argc, argv, devID, mallocTest,
host_A, host_B, host_C,
dev_A, dev_B, dev_C);
printf("DEV A POST INIT: %p\n", dev_A);
return 0;
}
Here's the output I get if I compile and run this code:
DEV A PRE ALLOC: (nil)
Allocation size: 400 Matrix Size: 100
PRE ALLOC TYPE: Pf
POST ALLOC TYPE: Pf
DEV A POST ALLOC: 0x10208400000
DEV A POST INIT: (nil)
There are multiple ways using which the desired behavior can be achieved.
Method 1
One of the ways is to modify the MatrixInitCUDA arguments to accept double pointers (**) for device pointers and modify the code as follows:
Modify the function signature:
void MatrixInitCUDA(int argc, char **argv, int &devID, MatrixSize *matrixSize,
float *host_matrixA, float *host_matrixB, float *host_matrixC,
float **dev_matrixA, float **dev_matrixB, float **dev_matrixC)
{
}
Allocate device memory as follows inside MatrixInitCUDA:
err = cudaMalloc((void **) dev_matrixA, matrixA_size);
Call MatrixInitCUDA from main like this:
MatrixInitCUDA(argc, argv, devID, mallocTest,
host_A, host_B, host_C,
&dev_A, &dev_B, &dev_C);
Method 2
My personal favorite way is that don't do any of the above and just modify the function signature to accept references for device pointers as follows:
void MatrixInitCUDA(int argc, char **argv, int &devID, MatrixSize *matrixSize,
float *host_matrixA, float *host_matrixB, float *host_matrixC,
float *&dev_matrixA, float *&dev_matrixB, float *&dev_matrixC)
{
}

Cuda "invalid argument" 2d array - Cellular automata

I'm trying to calculate 2d cellular automata redistribution using Cuda. I'm completely new to it so I have no idea what I do wrong. I've tried many solutions that I've seen here but all give "invalid argument" when I call the kernel.
Here is a simplified version of the kernel:
//kernel definition
__global__ void stepCalc(float B[51][51], int L, int flag, float m, float en)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
float g=B[i][j]-0.25*(B[i+1][j]+B[i-1][j]+B[i][j+1]+B[i][j-1]);
flag = 0;
if (i < L-2 && j < L-2 && i>2 && j>2 && abs(g)>m)
{
flag = 1;
en+=-16*g*g+8*B[i][j]*abs(g);
B[i][j]+=-4*f*g;
B[i+1][j]+=f*g;
B[i-1][j]+=f*g;
B[i][j+1]+=f*g;
B[i][j-1]+=f*g;
}
}
The main function looks like this:
#define L 50
float B[L+1][L+1];
//initialize B[i][j]
float g=0;
int flag = 1;
float m=0.1;
float en = 0;
while (flag==1)
{
float (*dB)[L+1];
int *dFlag=NULL;
float *dEn=NULL;
cudaMalloc((void **)&dFlag,sizeof(int));
cudaMalloc((void **)&dEn,sizeof(float));
cudaMalloc((void **)&dB, ((L+1)*(L+1))*sizeof(float));
cudaMemcpy(dB, B, sizeB, cudaMemcpyHostToDevice);
cudaMemcpy(dFlag, &flag, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(dEn, &en, sizeof(float), cudaMemcpyDeviceToHost);
dim3 threadsPerBlock(16,16);
dim3 numBlocks((L+1)/threadsPerBlock.x,(L+1)/threadsPerBlock.y);
stepCalc<<<numBlocks, threadsPerBlock>>>(dB, L, dflag, m, dEn);
GPUerrchk(cudaPeekAtLastError()); //gives "invalid argument" at this line
cudaMemcpy(B, (dB), sizeB, cudaMemcpyDeviceToHost);
cudaMemcpy(&flag, dFlag, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(&en, dEn, sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(dB);
cudaFree(dFlag);
cudaFree(dEn);
}
I need to extract the new array B, the flag value and the sum 'en' over all threads. Am I even close to how a solution should look? Is it even possible? I've also tried making the host array B as float** B with no luck.
There are various problems with your code.
You may be overlooking the difference between passing a value to a kernel and passing a pointer:
__global__ void stepCalc(float B[51][51], int L, int flag, float m, float en)
^ ^
| |
a pointer a value
we'll come back to B in a moment, but for values like flag and en, passing these by value to a kernel has similar implications to passing by value to a C function. It is a one-way communication path. Since it's evident from your code that you want to use these values modified by the kernel later in host code, you will need to pass pointers, instead. In a few cases, you have already allocated pointers for this purpose, so you have an additional type of error in that in some cases (dFlag) you are passing a pointer whereas the kernel definition expects a value.
Regarding B, passing a 2D array from host to device can be more difficult than you might initially expect, due to the deep copy problem. Without covering all that ground here, search on "CUDA 2D array" in the upper right hand corner of this page, and you'll get a lot of information about it and various ways to deal with it. Since you seem to be willing to consider an array of fixed width (known at compile-time), we can simplify the handling of a 2D array by leveraging the compiler to help us with a particular typedef.
When you're having trouble with a cuda code, it's good practice to do rigorous CUDA error checking throughout your code, not in just one place. One reason for this is that CUDA errors incurred in a particular place will often be returned at any subsequent place in the code. This makes it confusing if you don't check every CUDA API call, as a particular "invalid argument" error might not be due to the kernel itself, but some API call that occurred previously.
You typically don't want cudaMalloc operations in a data-processing while loop. These are normally operations you do once, at the beginning of your code. Doing the cudaMalloc at each iteration of the while-loop has several negative issues, one of which is that you will run out of memory (although you have cudaFree statements, so perhaps not), eventually, and you are effectively throwing away your data at each iteration. Also, it will negatively impact your performance.
You have some of your cudaMemcpy transfer directions wrong, like here:
cudaMemcpy(dFlag, &flag, sizeof(int), cudaMemcpyDeviceToHost);
Setting flag to zero in your kernel code will be problematic. Warps can execute in any order, and after some warps have already set flag to 1 later in the kernel, other warps could begin executing and set flag to zero again. This is probably not what you want. One possible fix is to set flag to zero before executing the kernel (i.e. in host code, and copy it to the device).
Your kernel will generate out-of-bounds indexing here:
float g=B[i][j]-0.25*(B[i+1][j]+B[i-1][j]+B[i][j+1]+B[i][j-1]);
(just ask yourself what happens when i=0 and j=0). The fix for this is to move this line of code inside the if-check you have for bounds checking right after it.
Your kernel uses a variable f which is defined nowhere that I can see, for example here:
B[i+1][j]+=f*g;
The following code is my attempt to rework your code, create a complete example, and remove the above issues. It doesn't do anything useful, but it compiles without errors and runs without errors for me. I haven't provided any data, so it's just a proof-of-concept at this point. I'm sure it still contains data processing errors.
#include <stdio.h>
#define my_L 50
typedef float farray[my_L+1];
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
//kernel definition
__global__ void stepCalc(farray B[], int L, int *flag, float m, float *en)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
//float g=B[i][j]-0.25*(B[i+1][j]+B[i-1][j]+B[i][j+1]+B[i][j-1]);
// flag = 0;
float f = 1.0f;
if (i < L-2 && j < L-2 && i>2 && j>2){
float g=B[i][j]-0.25*(B[i+1][j]+B[i-1][j]+B[i][j+1]+B[i][j-1]);
if (abs(g)>m)
{
*flag = 1;
*en+=-16*g*g+8*B[i][j]*abs(g);
B[i][j]+=-4*f*g;
B[i+1][j]+=f*g;
B[i-1][j]+=f*g;
B[i][j+1]+=f*g;
B[i][j-1]+=f*g;
}
}
}
int main(){
farray B[my_L+1];
//initialize B[i][j]
farray *dB;
int flag = 1;
float m=0.1;
float en = 0;
int *dFlag=NULL;
float *dEn=NULL;
cudaMalloc((void **)&dFlag,sizeof(int));
cudaCheckErrors("1");
cudaMalloc((void **)&dEn,sizeof(float));
cudaCheckErrors("2");
size_t sizeB = (my_L+1)*sizeof(farray);
cudaMalloc((void **)&dB, sizeB);
cudaCheckErrors("3");
cudaMemcpy(dB, B, sizeB, cudaMemcpyHostToDevice);
cudaCheckErrors("4");
cudaMemcpy(dEn, &en, sizeof(float), cudaMemcpyHostToDevice);
cudaCheckErrors("5");
dim3 threadsPerBlock(16,16);
dim3 numBlocks((my_L+1)/threadsPerBlock.x,(my_L+1)/threadsPerBlock.y);
while (flag==1)
{
flag = 0;
cudaMemcpy(dFlag, &flag, sizeof(int), cudaMemcpyHostToDevice);
cudaCheckErrors("6");
stepCalc<<<numBlocks, threadsPerBlock>>>(dB, my_L, dFlag, m, dEn);
cudaDeviceSynchronize();
cudaCheckErrors("7");
cudaMemcpy(&flag, dFlag, sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("8");
}
cudaMemcpy(B, (dB), sizeB, cudaMemcpyDeviceToHost);
cudaCheckErrors("9");
cudaMemcpy(&en, dEn, sizeof(float), cudaMemcpyDeviceToHost);
cudaCheckErrors("10");
// process B
cudaFree(dB);
cudaFree(dFlag);
cudaFree(dEn);
}

Copying a dynamically allocated 2D array from host to device in CUDA

I want to copy a dynamically allocated 2D array from host to device to get its Discrete Fourier Transform.
I'm using below code to copy the array to the device
cudaMalloc((void**)&array_d, sizeof(cufftComplex)*NX*(NY/2+1));
cudaMemcpy(array_d, array_h, sizeof(float)*NX*NY, cudaMemcpyHostToDevice);
This works fine with static arrays, i get the intended output from my fft.
But it doesn't work with dynamic arrays. After little bit searching I learnt I can not copy dynamic arrays like this from host to device. So I found this solution.
cudaMalloc((void**)&array_d, sizeof(cufftComplex)*NX*(NY/2+1));
for(int i=0; i<NX; ++i){
cudaMemcpy(array_d+ i*NY, array_h[i], sizeof(float)*NY, cudaMemcpyHostToDevice);
}
But it's also not doing the task properly since I get wrong values from my fft.
Given below is my fft code.
cufftPlanMany(&plan, NRANK, n,NULL, 1, 0,NULL, 1, 0,CUFFT_R2C,BATCH);
cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(plan, (cufftReal*)data, data);
cudaThreadSynchronize();
cudaMemcpy(c, data, sizeof(float)*NX*NY, cudaMemcpyDeviceToHost);
How can I overcome this problem ?
EDIT
given below is the code
#define NX 4
#define NY 5
#define NRANK 2
#define BATCH 10
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cufft.h>
#include <stdio.h>
#include <iostream>
int check();
int main()
{
// static array
float b[NX][NY] ={
{0.7943 , 0.6020 , 0.7482 , 0.9133 , 0.9961},
{0.3112 , 0.2630 , 0.4505 , 0.1524 , 0.0782},
{0.5285 , 0.6541 , 0.0838 , 0.8258 , 0.4427},
{0.1656 , 0.6892 , 0.2290 , 0.5383 , 0.1067}
};
// dynamic array
float **a = new float*[NX];
for (int r = 0; r < NX; ++r)
{
a[r] = new float[NY];
for (int c = 0; c < NY; ++c)
{
a[r][c] = b[r][c];
}
}
// arrray to store the results - host side
float c[NX][NY] = { 0 };
cufftHandle plan;
cufftComplex *data;
int n[NRANK] = {NX, NY};
cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*(NY/2+1));
cudaMemcpy(data, b, sizeof(float)*NX*NY, cudaMemcpyHostToDevice);
/* Create a 2D FFT plan. */
cufftPlanMany(&plan, NRANK, n,NULL, 1, 0,NULL, 1, 0,CUFFT_R2C,BATCH);
cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(plan, (cufftReal*)data, data);
cudaThreadSynchronize();
cudaMemcpy(c, data, sizeof(float)*NX*NY, cudaMemcpyDeviceToHost);
cufftDestroy(plan);
cudaFree(data);
return 0;
}
data is of type cufftComplex which is series of typedefs eventually resulting in a float2. That means data + n will advance data by n objects of type float2, or by 2 * n object of type float. This makes your "dynamic array" copying incorrect; you have to halve the increment of data.
EDIT
Looking at the parameter types of cufftExecR2C(), I think this should work:
for(int i=0; i<NX; ++i){
cudaMemcpy(reinterpret_cast<float*>(data) + i*NY, a[i], sizeof(float)*NY, cudaMemcpyHostToDevice);
}
Side note: you don't actually have a dynamic 2D array (that would be new float[NX * NY]). What you have is a dynamic array of pointers to dynamic arrays of floats. I believe it would make more sense for you to use a true 2D array instead, which would allow you to keep the static-case copy code as well.
And since you've tagged this C++, you should seriously consider using std::vector instead of managing your dynamic memory manually. That is, change a like this:
std::vector<float> a(NX * NY);
And while you're at it, I'd suggest turning NX, NY etc. from macros to constants:
const size_t NX = 4;
const size_t NY = 5;
etc.

Installing CUDA C++ library?

Apologies for sounding a complete n00b, but I have learnt I can call CUDA extension functions to C++ and have the GPU calculate. However, I cant seem to find instructions how to download the library (nor which library I need to download)? Strangely enough I have a great example but I don't know how to get the libraries!
Just so my post is more useful, this is the example I wish to implement:
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
// Alloc space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Alloc space for host copies of a, b, c and setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N blocks
add<<<N,1>>>(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
You can find Cuda SDK here : Cuda SDK
Wasn't very hard to find to be honest... However, if you are facing this kind of problem in the future, you will usually find the libraries by searching for it's name (here Cuda), followed by "SDK" on Google. Should always be in the first results.
If you want to get started, NVIDIA provides a very nice documentation in my opinion as well as a section to get started, including an introduction to parallel programming : Getting Started