CUDA Constant Memory Error - c++

I am trying to do a sample code with constant memory with CUDA 5.5. I have 2 constant arrays of size 3000 each. I have another global array X of size N.
I want to compute
Y[tid] = X[tid]*A[tid%3000] + B[tid%3000]
Here is the code.
#include <iostream>
#include <stdio.h>
using namespace std;
#include <cuda.h>
__device__ __constant__ int A[3000];
__device__ __constant__ int B[3000];
__global__ void kernel( int *dc_A, int *dc_B, int *X, int *out, int N)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
if( tid<N )
{
out[tid] = dc_A[tid%3000]*X[tid] + dc_B[tid%3000];
}
}
int main()
{
int N=100000;
// set affine constants on host
int *h_A, *h_B ; //host vectors
h_A = (int*) malloc( 3000*sizeof(int) );
h_B = (int*) malloc( 3000*sizeof(int) );
for( int i=0 ; i<3000 ; i++ )
{
h_A[i] = (int) (drand48() * 10);
h_B[i] = (int) (drand48() * 10);
}
//set X and Y on host
int * h_X = (int*) malloc( N*sizeof(int) );
int * h_out = (int *) malloc( N*sizeof(int) );
//set the vector
for( int i=0 ; i<N ; i++ )
{
h_X[i] = i;
h_out[i] = 0;
}
// copy, A,B,X,Y to device
int * d_X, *d_out;
cudaMemcpyToSymbol( A, h_A, 3000 * sizeof(int) ) ;
cudaMemcpyToSymbol( B, h_B, 3000 * sizeof(int) ) ;
cudaMalloc( (void**)&d_X, N*sizeof(int) ) );
cudaMemcpy( d_X, h_X, N*sizeof(int), cudaMemcpyHostToDevice ) ;
cudaMalloc( (void**)&d_out, N*sizeof(int) ) ;
//call kernel for vector addition
kernel<<< (N+1024)/1024,1024 >>>(A,B, d_X, d_out, N);
cudaPeekAtLastError() ;
cudaDeviceSynchronize() ;
// D --> H
cudaMemcpy(h_out, d_out, N * sizeof(int), cudaMemcpyDeviceToHost ) ;
free(h_A);
free(h_B);
return 0;
}
I am trying to run the debugger over this code to analyze. Turns out that on the line which copies to constant memory I get the following error with debugger
Coalescing of the CUDA commands output is off.
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff5c5b700 (LWP 31200)]
Can somebody please help me out with constant memory

There are several problems here. It is probably easier to start by showing the "correct" way to use those two constant arrays, then explain why what you did doesn't work. So the kernel should look like this:
__global__ void kernel(int *X, int *out, int N)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
if( tid<N )
{
out[tid] = A[tid%3000]*X[tid] + B[tid%3000];
}
}
ie. don't try passing A and B to the kernel. The reasons are as follows:
Somewhat confusingly, A and B in host code are not valid device memory addresses. They are host symbols which provide hooks into a runtime device symbol lookup. It is illegal to pass them to a kernel- If you want their device memory address, you must use cudaGetSymbolAddress to retrieve it at runtime.
Even if you did call cudaGetSymbolAddress and retrieve the symbols device addresses in constant memory, you shouldn't pass them to a kernel as an argument, because doing do would not yield uniform memory access in the running kernel. Correct use of constant memory requires the compiler to emit special PTX instructions, and the compiler will only do that when it knows that a particular global memory location is in constant memory. If you pass a constant memory address by value as an argument, the __constant__ property is lost and the compiler can't know to produce the correct load instructions
Once you get this working, you will find it is terribly slow and if you profile it you will find that there is very high degrees of instruction replay and serialization. The whole idea of using constant memory is that you can exploit a constant cache broadcast mechanism in cases when every thread in a warp accesses the same value in constant memory. Your example is the complete opposite of that - every thread is accessing a different value. Regular global memory will be faster in such a use case. Also be aware that the performance of the modulo operator on current GPUs is poor, and you should avoid it wherever possible.

Related

Memcopy multiple gpus in cuda programming [duplicate]

How can I use two devices in order to improve for example
the performance of the following code (sum of vectors)?
Is it possible to use more devices "at the same time"?
If yes, how can I manage the allocations of the vectors on the global memory of the different devices?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <cuda.h>
#define NB 32
#define NT 500
#define N NB*NT
__global__ void add( double *a, double *b, double *c);
//===========================================
__global__ void add( double *a, double *b, double *c){
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid < N){
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
}
}
//============================================
//BEGIN
//===========================================
int main( void ) {
double *a, *b, *c;
double *dev_a, *dev_b, *dev_c;
// allocate the memory on the CPU
a=(double *)malloc(N*sizeof(double));
b=(double *)malloc(N*sizeof(double));
c=(double *)malloc(N*sizeof(double));
// allocate the memory on the GPU
cudaMalloc( (void**)&dev_a, N * sizeof(double) );
cudaMalloc( (void**)&dev_b, N * sizeof(double) );
cudaMalloc( (void**)&dev_c, N * sizeof(double) );
// fill the arrays 'a' and 'b' on the CPU
for (int i=0; i<N; i++) {
a[i] = (double)i;
b[i] = (double)i*2;
}
// copy the arrays 'a' and 'b' to the GPU
cudaMemcpy( dev_a, a, N * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy( dev_b, b, N * sizeof(double), cudaMemcpyHostToDevice);
for(int i=0;i<10000;++i)
add<<<NB,NT>>>( dev_a, dev_b, dev_c );
// copy the array 'c' back from the GPU to the CPU
cudaMemcpy( c, dev_c, N * sizeof(double), cudaMemcpyDeviceToHost);
// display the results
// for (int i=0; i<N; i++) {
// printf( "%g + %g = %g\n", a[i], b[i], c[i] );
// }
printf("\nGPU done\n");
// free the memory allocated on the GPU
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
// free the memory allocated on the CPU
free( a );
free( b );
free( c );
return 0;
}
Thank you in advance.
Michele
Since CUDA 4.0 was released, multi-GPU computations of the type you are asking about are relatively easy. Prior to that, you would have need to use a multi-threaded host application with one host thread per GPU and some sort of inter-thread communication system in order to use mutliple GPUs inside the same host application.
Now it is possible to do something like this for the memory allocation part of your host code:
double *dev_a[2], *dev_b[2], *dev_c[2];
const int Ns[2] = {N/2, N-(N/2)};
// allocate the memory on the GPUs
for(int dev=0; dev<2; dev++) {
cudaSetDevice(dev);
cudaMalloc( (void**)&dev_a[dev], Ns[dev] * sizeof(double) );
cudaMalloc( (void**)&dev_b[dev], Ns[dev] * sizeof(double) );
cudaMalloc( (void**)&dev_c[dev], Ns[dev] * sizeof(double) );
}
(disclaimer: written in browser, never compiled, never tested, use at own risk).
The basic idea here is that you use cudaSetDevice to select between devices when you are preforming operations on a device. So in the above snippet, I have assumed two GPUs and allocated memory on each [(N/2) doubles on the first device and N-(N/2) on the second].
The transfer of data from the host to device could be as simple as:
// copy the arrays 'a' and 'b' to the GPUs
for(int dev=0,pos=0; dev<2; pos+=Ns[dev], dev++) {
cudaSetDevice(dev);
cudaMemcpy( dev_a[dev], a+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy( dev_b[dev], b+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);
}
(disclaimer: written in browser, never compiled, never tested, use at own risk).
The kernel launching section of your code could then look something like:
for(int i=0;i<10000;++i) {
for(int dev=0; dev<2; dev++) {
cudaSetDevice(dev);
add<<<NB,NT>>>( dev_a[dev], dev_b[dev], dev_c[dev], Ns[dev] );
}
}
(disclaimer: written in browser, never compiled, never tested, use at own risk).
Note that I have added an extra argument to your kernel call, because each instance of the kernel may be called with a different number of array elements to process. I Will leave it to you to work out the modifications required.
But, again, the basic idea is the same: use cudaSetDevice to select a given GPU, then run kernels on it in the normal way, with each kernel getting its own unique arguments.
You should be able to put these parts together to produce a simple multi-GPU application. There are a lot of other features which can be used in recent CUDA versions and hardware to assist multiple GPU applications (like unified addressing, the peer-to-peer facilities are more), but this should be enough to get you started. There is also a simple muLti-GPU application in the CUDA SDK you can look at for more ideas.

Passing a Constant Integer in a CUDA Kernel [duplicate]

This question already has answers here:
allocating shared memory
(5 answers)
Closed 5 years ago.
I am having a problem with the following code. In the global kernel, loop_d, M has an integer value of 84. When I try to create a shared array, temp, and use M as the size of the array, I get the following error:
error: expression must have a constant value
I am not sure why that is. I know that if I declare M as a global variable, then it works, but the problem is that I get the value of M by calling the function d_two in a different Fortran program, so I am not sure how to get around that. I know that if I replace temp[M] with temp[84], then my program runs perfectly, but that is not very practical, since different problems might have different values of M. Thank you for your help!
The program
// Parallelized 2D Three-Point Guassian Quadrature Numerical Integration Method
// The following program is part of two linked programs, Integral_2D_Cuda.f.
// This is a CUDA kernel that could be called in the Integral_2D_Cuda.f Fortran code to compute
// the integral of a given 2D-function
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cuda.h>
#include <cuda_runtime.h>
// The following is a definition for the atomicAddd function that is called in the loop_d kernel
// This is needed because the "regular" atomicAdd function only works for floats and integers
__device__ double atomicAddd(double* address, double val)
{
unsigned long long int* address_as_ull = (unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val + __longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
// GPU kernel that computes the function of interest. This is good for a two dimensional problem.
__global__ void loop_d(double *a_sx, double *b_swx, double *c_sy, double *d_swy, double *e_ans0, int N, int M)
{
// Declaring a shared array that threads of the same block have access to
__shared__ double temp[M];
int idxX = blockIdx.x * blockDim.x + threadIdx.x; // Thread indices responsible for the swx and sx arrays
int idxY = threadIdx.y; // Thread indices responsible for the swy and sy arrays
// Computing the multiplication of elements
if (idxX < N && idxY < M)
{
temp[idxY] = a_sx[idxX] * b_swx[idxX] * c_sy[idxY] * d_swy[idxY];
}
// synchronizing all threads before summing all the mupltiplied elements int he temp array
__syncthreads();
// Allowing the 0th thread of y to do the summation of the multiplied elements in the temp array of one block
if (0 == idxY)
{
double sum = 0.00;
for(int k = 0; k < M; k++)
{
sum = sum + temp[k];
}
// Adding the result of this instance of calculation to the final answer, ans0
atomicAddd(e_ans0, sum);
}
}
extern "C" void d_two_(double *sx, double *swx, int *nptx, double *sy, double *swy, int *npty, double *ans0)
{
// Assigning GPU pointers
double *sx_d, *swx_d;
int N = *nptx;
double *sy_d, *swy_d;
int M = *npty;
double *ans0_d;
dim3 threadsPerBlock(1,M); // Creating a two dimesional block with 1 thread in the x dimesion and M threads in the y dimesion
dim3 numBlocks(N); // specifying the number of blocks to use of dimesion 1xM
// Allocating GPU Memory
cudaMalloc( (void **)&sx_d, sizeof(double) * N);
cudaMalloc( (void **)&swx_d, sizeof(double) * N);
cudaMalloc( (void **)&sy_d, sizeof(double) * M);
cudaMalloc( (void **)&swy_d, sizeof(double) * M);
cudaMalloc( (void **)&ans0_d, sizeof(double) );
// Copying information fromm CPU to GPU
cudaMemcpy( sx_d, sx, sizeof(double) * N, cudaMemcpyHostToDevice );
cudaMemcpy( swx_d, swx, sizeof(double) * N, cudaMemcpyHostToDevice );
cudaMemcpy( sy_d, sy, sizeof(double) * M, cudaMemcpyHostToDevice );
cudaMemcpy( swy_d, swy, sizeof(double) * M, cudaMemcpyHostToDevice );
cudaMemcpy( ans0_d, ans0, sizeof(double), cudaMemcpyHostToDevice );
// Calling the function on the GPU
loop_d<<< numBlocks, threadsPerBlock >>>(sx_d, swx_d, sy_d, swy_d, ans0_d, N, M);
// Copying from GPU to CPU
cudaMemcpy( ans0, ans0_d, sizeof(double), cudaMemcpyDeviceToHost );
// freeing GPU memory
cudaFree(sx_d);
cudaFree(swx_d);
cudaFree(sy_d);
cudaFree(swy_d);
cudaFree(ans0_d);
return;
}
The compiler needs M to be a compile-time constant. At compile time it cannot determine what M is actually going to be (it doesn't know you will just pass it 84 eventually).
When you want to use shared memory of size you only know at runtime, you use dynamic shared memory.
See this example here on the site or Using Shared Memory in CUDA on the Parallel4All blog.

Why is my CUDA code not working properly for zero filling a large matrix?

It is a simple CUDA code for initializing a big matrix (filling in zeros).
I output the first 1*3 matrix, if the code works. It should be all zeros.
If I set the matrix size to be small, then the program works properly. But when I make the size larger (> 43200 * 2400), what is inside the matrix are all garbage.
I had cudaDeviceSynchronize() append at the end of each CUDA functions already.
I am using NVIDIA Quadro K4200, Xeon E5-2630 with Ubuntu 14.04.
Thanks for anyone helping me here.
Attached below is my full code.
#include <stdio.h>
#include <math.h>
#include <iostream>
#include <cuComplex.h>
#define BLOCK_SIZE 16 // change it to 16 to get maximum performance
// populate the matrix using first row
__global__ void RepmatKernel (cuComplex *Mat, const unsigned int N, const unsigned int Cols)
{
unsigned int i = (unsigned int)blockIdx.x * (unsigned int)blockDim.x + (unsigned int)threadIdx.x;
if (i < N)
{
Mat[i].x = 0;
Mat[i].y = 0;
}
}
// main routine
int main ()
{
const unsigned int Rows = 43200;
const unsigned int Cols = 2400;
const unsigned int Num_thrd = 256; // max threads per block
unsigned int Mat_size = Rows * Cols; // size of array
cuComplex *vec; // supposedly the input
cuComplex *mat_debug; // for debug
vec = new cuComplex [Cols];
mat_debug = new cuComplex [Rows*Cols];
cuComplex *mat_in_d; // device array
//input in host array
for(unsigned int i = 0; i < Cols; i++)
{
vec[i].x = 3*i+4;
vec[i].y = 0.2*i+1;
}
const unsigned int size_mat_d = Rows * Cols * sizeof(cuComplex);
//create device array cudaMalloc ( (void **)&array_name, sizeofmatrixinbytes) ;
if (cudaMalloc((void **) &mat_in_d , size_mat_d) != cudaSuccess) std::cout<<"Error allocating GPU";
cudaDeviceSynchronize() ;
//copy host array to device array; cudaMemcpy ( dest , source , WIDTH , direction )
cudaMemcpy ( mat_in_d , vec , Cols , cudaMemcpyHostToDevice ) ;
cudaDeviceSynchronize() ;
// ========================================================================
cudaMemcpy(mat_debug , mat_in_d , size_mat_d , cudaMemcpyDeviceToHost) ;
cudaDeviceSynchronize() ;
std::cout<<"before repmat="<<std::endl;
std::cout<<"[";
for(unsigned int i = 0; i < 3; i++)
{
std::cout<< mat_debug[i * Cols].x <<"+"<<mat_debug[i * Cols].y <<"i, ";
std::cout<<";"<<std::endl;
}
std::cout<<"]"<<std::endl;
// ==========================================================================
RepmatKernel<<<(unsigned int)ceil((float)(Mat_size)/(float)(Num_thrd)),
(Num_thrd)>>>(mat_in_d,
Mat_size,
Cols);
cudaDeviceSynchronize();
// ========================================================================
cudaMemcpy(mat_debug , mat_in_d , size_mat_d , cudaMemcpyDeviceToHost) ;
cudaDeviceSynchronize() ;
std::cout<<"after repmat="<<std::endl;
std::cout<<"[";
for(unsigned int i = 0; i < 3; i++)
{
std::cout<< mat_debug[i * Cols].x <<"+"<<mat_debug[i * Cols].y <<"i, ";
std::cout<<";"<<std::endl;
}
std::cout<<"]"<<std::endl;
// ==========================================================================
cudaFree(mat_in_d);
delete [] vec;
delete [] mat_debug;
return 0;
}
Your call to cudaMalloc states that there is a problem, but doesn't actually terminate the computation. You should put a
if (cudaMalloc((void **) &mat_in_d , size_mat_d) != cudaSuccess)
{
std::cout<<"Error allocating GPU\n";
return 1;
}
so that the computation actually stops when you overflow the memory, rather than attempt to work anyway with only a warning to std::cout. Even better would be to use an error handling macro.
Another problem is here:
cudaMemcpy ( mat_in_d , vec , Cols , cudaMemcpyHostToDevice );
First, mat_in_d is size Rows * Cols * sizeof(cuComplex), but you are only copying Cols bytes into it. Even if you only wanted to copy vec into the first part of the mat_in_d vector, you'd need to change this to
cudaMemcpy ( mat_in_d , vec , Cols*sizeof(cuComplex) , cudaMemcpyHostToDevice );
At this point, you'd expect the first Cols entries of you matrix to be reasonable, at the rest to be garbage. (Making the suggested change shows that this is indeed the case; why you would want to do this is a better question).
Next comes your kernel call, whose entire goal is to set the entries of Mat to zero. This should be done with cudaMemset, i.e., just use
cudaMemset(mat_in_d, 0, Mat_size*sizeof(cuComplex));
We could look more carefully at the execution configuration to see what went wrong with your kernel call, but for now this fixes your problem.
For debugging CUDA errors; I find a header from samples, helper_cuda.h, quite convenient. I almost always include this header, which is located in the common directory of samples, in my projects.
Then, wrapping all CUDA calls with checkCudaErrors(), like checkCudaErrors(cudaMalloc((void **) &mat_in_d , size_mat_d)); gives explicit error messages.
In my case, since just mat_in_d is close to 1 GB and my GPU's memory is only 512 MB, it failed for sure and threw cudaErrorMemoryAllocation. However, an NVIDIA Quadro K4200 should not fail that easily!
Did you check the actual available memory information using cudaMemGetInfo ?

How to allocate Array of Pointers and preserve them for multiple kernel calls in cuda

I am trying to implement an algorithm in cuda and I need to allocate an Array of Pointers that point to an Array of Structs. My struct is, lets say:
typedef struct {
float x, y;
} point;
I know that If I want to preserve the arrays for multiple kernel calls I have to control them from the host, is that right? The initialization of the pointers must be done from within the kernel. To be more specific, the Array of Struct P will contain random order of cartesian points while the dev_S_x will be a sorted version as to x coordinate of the points in P.
I have tried with:
__global__ void test( point *dev_P, point **dev_S_x) {
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
dev_P[tid].x = 3.141516;
dev_P[tid].y = 3.141516;
dev_S_x[tid] = &dev_P[tid];
...
}
and:
int main( void ) {
point *P, *dev_P, **S_x, *dev_S_x;
P = (point*) malloc (N * sizeof (point) );
S_x = (point**) malloc (N * sizeof (point*));
// allocate the memory on the GPU
cudaMalloc( (void**) &dev_P, N * sizeof(point) );
cudaMalloc( (void***) &dev_S_x, N * sizeof(point*));
// copy the array P to the GPU
cudaMemcpy( dev_P, P, N * sizeof(point), cudaMemcpyHostToDevice);
cudaMemcpy( dev_S_x,S_x,N * sizeof(point*), cudaMemcpyHostToDevice);
test <<<1, 1 >>>( dev_P, &dev_S_x);
...
return 0;
}
which leads to many
First-chance exception at 0x000007fefcc89e5d (KernelBase.dll) in Test_project_cuda.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0020f920..
Critical error detected c0000374
Am I doing something wrong in the cudamalloc of the array of pointers or is it something else? Is the usage of (void***) correct? I would like to use for example dev_S_x[tid]->x or dev_S_x[tid]->y from within the kernels pointing to device memory addresses. Is that feasible?
Thanks in advance
dev_S_x should be declared as point ** and should be passed to the kernel as a value (i.e. test <<<1, 1 >>>(dev_P, dev_S_x);).
Putting that to one side, what you describe sounds like a natural fit for Thrust, which will give you a simpler memory management strategy and access to fast sort routines.

Access vector of pointers to other vectors on a GPU

so this is a followup to a question i had, at the moment in a CPU version of some Code, i have many things that look like the following:
for(int i =0;i<N;i++){
dgemm(A[i], B[i],C[i], Size[i][0], Size[i][1], Size[i][2], Size[i][3], 'N','T');
}
where A[i] will be a 2D matrix of some size.
I would like to be able to do this on a GPU using CULA (I'm not just doing multiplies, so i need the Linear ALgebra operations in CULA), so for example:
for(int i =0;i<N;i++){
status = culaDeviceDgemm('T', 'N', Size[i][0], Size[i][0], Size[i][0], alpha, GlobalMat_d[i], Size[i][0], NG_d[i], Size[i][0], beta, GG_d[i], Size[i][0]);
}
but I would like to store my B's on the GPU in advance at the start of the program as they dont change, so I need to have a vector that contains pointers to the set of vectors that make up my B's.
i currently have the following code that compiles:
double **GlobalFVecs_d;
double **GlobalFPVecs_d;
extern "C" void copyFNFVecs_(double **FNFVecs, int numpulsars, int numcoeff){
cudaError_t err;
GlobalFPVecs_d = (double **)malloc(numpulsars * sizeof(double*));
err = cudaMalloc( (void ***)&GlobalFVecs_d, numpulsars*sizeof(double*) );
checkCudaError(err);
for(int i =0; i < numpulsars;i++){
err = cudaMalloc( (void **) &(GlobalFPVecs_d[i]), numcoeff*numcoeff*sizeof(double) );
checkCudaError(err);
err = cudaMemcpy( GlobalFPVecs_d[i], FNFVecs[i], sizeof(double)*numcoeff*numcoeff, cudaMemcpyHostToDevice );
checkCudaError(err);
}
err = cudaMemcpy( GlobalFVecs_d, GlobalFPVecs_d, sizeof(double*)*numpulsars, cudaMemcpyHostToDevice );
checkCudaError(err);
}
but if i now try and access it with:
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid;//((G + dimBlock.x - 1) / dimBlock.x,(N + dimBlock.y - 1) / dimBlock.y);
dimGrid.x=(numcoeff + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (numcoeff + dimBlock.y - 1)/dimBlock.y;
for(int i =0; i < numpulsars; i++){
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);
}
it seg faults here, is this not how to get at the data?
The kernal function that i'm calling is just:
__global__ void CopyPPFNF(double *FNF_d, double *PPFNF_d, int numpulsars, int numcoeff, int thispulsar) {
// Each thread computes one element of C
// by accumulating results into Cvalue
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int subrow=row-thispulsar*numcoeff;
int subcol=row-thispulsar*numcoeff;
__syncthreads();
if(row >= (thispulsar+1)*numcoeff || col >= (thispulsar+1)*numcoeff) return;
if(row < thispulsar*numcoeff || col < thispulsar*numcoeff) return;
FNF_d[row * numpulsars*numcoeff + col] += PPFNF_d[subrow*numcoeff+subcol];
}
What am i not doing right? Note eventually I would also like to do as the first example, calling cula functions on each GlobalFVecs_d[i], but for now not even this works.
Do you think this is the best way to go about doing this? If it were possible to just pass CULA functions a slice of a large continuous vector I could do that to, but i don't know if it supports that.
Cheers
Lindley
change this:
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);
to this:
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFPVecs_d[i], numpulsars, numcoeff, i);
and I believe it will work.
Your methodology of handling pointers is mostly correct. However, when you put GlobalFVecs_d[i] in the parameter list, you are forcing the kernel setup code (running on the host) to take GlobalFVecs_d (a device pointer, created with cudaMalloc), add an appropriately scaled i to the pointer value, and then dereference the resultant pointer to retrieve the value to pass as a parameter to the kernel. But we are not allowed to dereference device pointers in host code.
However, because your methodology was mostly correct, you have a convenient parallel array of the same pointers that resides on the host. This array (GlobalFPVecs_d) is something that we are allowed to dereference into, in host code, to retrieve the resultant device pointer, to pass to the kernel.
It's an interesting bug because normally kernels do not seg fault (although they may throw an error), so a seg fault on a kernel invocation line is unusual. But in this case, the seg fault is occurring in the kernel setup code, not the kernel itself.