so this is a followup to a question i had, at the moment in a CPU version of some Code, i have many things that look like the following:
for(int i =0;i<N;i++){
dgemm(A[i], B[i],C[i], Size[i][0], Size[i][1], Size[i][2], Size[i][3], 'N','T');
}
where A[i] will be a 2D matrix of some size.
I would like to be able to do this on a GPU using CULA (I'm not just doing multiplies, so i need the Linear ALgebra operations in CULA), so for example:
for(int i =0;i<N;i++){
status = culaDeviceDgemm('T', 'N', Size[i][0], Size[i][0], Size[i][0], alpha, GlobalMat_d[i], Size[i][0], NG_d[i], Size[i][0], beta, GG_d[i], Size[i][0]);
}
but I would like to store my B's on the GPU in advance at the start of the program as they dont change, so I need to have a vector that contains pointers to the set of vectors that make up my B's.
i currently have the following code that compiles:
double **GlobalFVecs_d;
double **GlobalFPVecs_d;
extern "C" void copyFNFVecs_(double **FNFVecs, int numpulsars, int numcoeff){
cudaError_t err;
GlobalFPVecs_d = (double **)malloc(numpulsars * sizeof(double*));
err = cudaMalloc( (void ***)&GlobalFVecs_d, numpulsars*sizeof(double*) );
checkCudaError(err);
for(int i =0; i < numpulsars;i++){
err = cudaMalloc( (void **) &(GlobalFPVecs_d[i]), numcoeff*numcoeff*sizeof(double) );
checkCudaError(err);
err = cudaMemcpy( GlobalFPVecs_d[i], FNFVecs[i], sizeof(double)*numcoeff*numcoeff, cudaMemcpyHostToDevice );
checkCudaError(err);
}
err = cudaMemcpy( GlobalFVecs_d, GlobalFPVecs_d, sizeof(double*)*numpulsars, cudaMemcpyHostToDevice );
checkCudaError(err);
}
but if i now try and access it with:
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid;//((G + dimBlock.x - 1) / dimBlock.x,(N + dimBlock.y - 1) / dimBlock.y);
dimGrid.x=(numcoeff + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (numcoeff + dimBlock.y - 1)/dimBlock.y;
for(int i =0; i < numpulsars; i++){
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);
}
it seg faults here, is this not how to get at the data?
The kernal function that i'm calling is just:
__global__ void CopyPPFNF(double *FNF_d, double *PPFNF_d, int numpulsars, int numcoeff, int thispulsar) {
// Each thread computes one element of C
// by accumulating results into Cvalue
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int subrow=row-thispulsar*numcoeff;
int subcol=row-thispulsar*numcoeff;
__syncthreads();
if(row >= (thispulsar+1)*numcoeff || col >= (thispulsar+1)*numcoeff) return;
if(row < thispulsar*numcoeff || col < thispulsar*numcoeff) return;
FNF_d[row * numpulsars*numcoeff + col] += PPFNF_d[subrow*numcoeff+subcol];
}
What am i not doing right? Note eventually I would also like to do as the first example, calling cula functions on each GlobalFVecs_d[i], but for now not even this works.
Do you think this is the best way to go about doing this? If it were possible to just pass CULA functions a slice of a large continuous vector I could do that to, but i don't know if it supports that.
Cheers
Lindley
change this:
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);
to this:
CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFPVecs_d[i], numpulsars, numcoeff, i);
and I believe it will work.
Your methodology of handling pointers is mostly correct. However, when you put GlobalFVecs_d[i] in the parameter list, you are forcing the kernel setup code (running on the host) to take GlobalFVecs_d (a device pointer, created with cudaMalloc), add an appropriately scaled i to the pointer value, and then dereference the resultant pointer to retrieve the value to pass as a parameter to the kernel. But we are not allowed to dereference device pointers in host code.
However, because your methodology was mostly correct, you have a convenient parallel array of the same pointers that resides on the host. This array (GlobalFPVecs_d) is something that we are allowed to dereference into, in host code, to retrieve the resultant device pointer, to pass to the kernel.
It's an interesting bug because normally kernels do not seg fault (although they may throw an error), so a seg fault on a kernel invocation line is unusual. But in this case, the seg fault is occurring in the kernel setup code, not the kernel itself.
Related
I'm just starting with CUDA and this is my very first project. I've done a search for this issue and while I've noticed other people have had similar problems, none of the suggestions seemed relevant to my specific issue or have helped in my case.
As an exercise, I'm trying to write an n-body simulation using CUDA. At this stage I'm not interested whether my specific implementation is efficient or not, I'm just looking for something that works and I can refine it later. I'll also need to update the code later, once it's working, to work on my SLI configuration.
Here's a brief outline of the process:
Create X and Y position, velocity, acceleration vectors.
Create same vectors on GPU and copy values across
In a loop: (i) calculate acceleration for the iteration, (ii) apply acceleration to velocities and positions, and (iii) copy positions back to host for display.
(Display not implemented yet. I'll do this later)
Don't worry about the acceleration calculation function for now, here is the update function:
__global__ void apply_acc(double* pos_x, double* pos_y, double* vel_x, double* vel_y, double* acc_x, double* acc_y, int N)
{
int i = threadIdx.x;
if (i < N);
{
vel_x[i] += acc_x[i];
vel_y[i] += acc_y[i];
pos_x[i] += vel_x[i];
pos_y[i] += vel_y[i];
}
}
And here's some of the code in the main method:
cudaError t;
t = cudaMalloc(&d_pos_x, N * sizeof(double));
t = cudaMalloc(&d_pos_y, N * sizeof(double));
t = cudaMalloc(&d_vel_x, N * sizeof(double));
t = cudaMalloc(&d_vel_y, N * sizeof(double));
t = cudaMalloc(&d_acc_x, N * sizeof(double));
t = cudaMalloc(&d_acc_y, N * sizeof(double));
t = cudaMemcpy(d_pos_x, pos_x, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_pos_y, pos_y, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_vel_x, vel_x, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_vel_y, vel_y, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_acc_x, acc_x, N * sizeof(double), cudaMemcpyHostToDevice);
t = cudaMemcpy(d_acc_y, acc_y, N * sizeof(double), cudaMemcpyHostToDevice);
while (true)
{
calc_acc<<<1, N>>>(d_pos_x, d_pos_y, d_vel_x, d_vel_y, d_acc_x, d_acc_y, N);
apply_acc<<<1, N>>>(d_pos_x, d_pos_y, d_vel_x, d_vel_y, d_acc_x, d_acc_y, N);
t = cudaMemcpy(pos_x, d_pos_x, N * sizeof(double), cudaMemcpyDeviceToHost);
t = cudaMemcpy(pos_y, d_pos_y, N * sizeof(double), cudaMemcpyDeviceToHost);
std::cout << pos_x[0] << std::endl;
}
Every loop, cout writes the same value, whatever random value it was set to when the position arrays were original created. If I change the code in apply_acc to something like:
__global__ void apply_acc(double* pos_x, double* pos_y, double* vel_x, double* vel_y, double* acc_x, double* acc_y, int N)
{
int i = threadIdx.x;
if (i < N);
{
pos_x[i] += 1.0;
pos_y[i] += 1.0;
}
}
then it still gives the same value, so either apply_acc isn't being called or the cudaMemcpy isn't copying the data back.
All the cudaMalloc and cudaMemcpy calls return cudaScuccess.
Here's a PasteBin link to the complete code. It should be fairly simple to follow as there's a lot of repetition for the various arrays.
Like I said, I've never written CUDA code before, and I wrote this based on the #2 CUDA example video from NVidia where the guy writes the parallel array addition code. I'm not sure if it makes any difference, but I'm using 2x GTX970's with the latest NVidia drivers and CUDA 7.0 RC, and I chose not to install the bundled drivers when installing CUDA as they were older than what I had.
This won't work:
const int N = 100000;
...
calc_acc<<<1, N>>>(...);
apply_acc<<<1, N>>>(...);
The second parameter of a kernel launch config (<<<...>>>) is the threads per block parameter. It is limited to either 512 or 1024 depending on how you are compiling. These kernels will not launch, and the type of error this produces needs to be caught by using correct CUDA error checking. Simply looking at the return values of subsequent CUDA API functions will not indicate the presence of this type of error (which is why you are seeing cudaSuccess subsequently).
Regarding the concept itself, I suggest you learn more about CUDA thread and block hierarchy. To launch a large number of threads, you need to use both parameters (i.e. niether of the first two parameters should be 1) of the kernel launch config. This is usually advisable from a performance perspective as well.
I'm trying to calculate 2d cellular automata redistribution using Cuda. I'm completely new to it so I have no idea what I do wrong. I've tried many solutions that I've seen here but all give "invalid argument" when I call the kernel.
Here is a simplified version of the kernel:
//kernel definition
__global__ void stepCalc(float B[51][51], int L, int flag, float m, float en)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
float g=B[i][j]-0.25*(B[i+1][j]+B[i-1][j]+B[i][j+1]+B[i][j-1]);
flag = 0;
if (i < L-2 && j < L-2 && i>2 && j>2 && abs(g)>m)
{
flag = 1;
en+=-16*g*g+8*B[i][j]*abs(g);
B[i][j]+=-4*f*g;
B[i+1][j]+=f*g;
B[i-1][j]+=f*g;
B[i][j+1]+=f*g;
B[i][j-1]+=f*g;
}
}
The main function looks like this:
#define L 50
float B[L+1][L+1];
//initialize B[i][j]
float g=0;
int flag = 1;
float m=0.1;
float en = 0;
while (flag==1)
{
float (*dB)[L+1];
int *dFlag=NULL;
float *dEn=NULL;
cudaMalloc((void **)&dFlag,sizeof(int));
cudaMalloc((void **)&dEn,sizeof(float));
cudaMalloc((void **)&dB, ((L+1)*(L+1))*sizeof(float));
cudaMemcpy(dB, B, sizeB, cudaMemcpyHostToDevice);
cudaMemcpy(dFlag, &flag, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(dEn, &en, sizeof(float), cudaMemcpyDeviceToHost);
dim3 threadsPerBlock(16,16);
dim3 numBlocks((L+1)/threadsPerBlock.x,(L+1)/threadsPerBlock.y);
stepCalc<<<numBlocks, threadsPerBlock>>>(dB, L, dflag, m, dEn);
GPUerrchk(cudaPeekAtLastError()); //gives "invalid argument" at this line
cudaMemcpy(B, (dB), sizeB, cudaMemcpyDeviceToHost);
cudaMemcpy(&flag, dFlag, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(&en, dEn, sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(dB);
cudaFree(dFlag);
cudaFree(dEn);
}
I need to extract the new array B, the flag value and the sum 'en' over all threads. Am I even close to how a solution should look? Is it even possible? I've also tried making the host array B as float** B with no luck.
There are various problems with your code.
You may be overlooking the difference between passing a value to a kernel and passing a pointer:
__global__ void stepCalc(float B[51][51], int L, int flag, float m, float en)
^ ^
| |
a pointer a value
we'll come back to B in a moment, but for values like flag and en, passing these by value to a kernel has similar implications to passing by value to a C function. It is a one-way communication path. Since it's evident from your code that you want to use these values modified by the kernel later in host code, you will need to pass pointers, instead. In a few cases, you have already allocated pointers for this purpose, so you have an additional type of error in that in some cases (dFlag) you are passing a pointer whereas the kernel definition expects a value.
Regarding B, passing a 2D array from host to device can be more difficult than you might initially expect, due to the deep copy problem. Without covering all that ground here, search on "CUDA 2D array" in the upper right hand corner of this page, and you'll get a lot of information about it and various ways to deal with it. Since you seem to be willing to consider an array of fixed width (known at compile-time), we can simplify the handling of a 2D array by leveraging the compiler to help us with a particular typedef.
When you're having trouble with a cuda code, it's good practice to do rigorous CUDA error checking throughout your code, not in just one place. One reason for this is that CUDA errors incurred in a particular place will often be returned at any subsequent place in the code. This makes it confusing if you don't check every CUDA API call, as a particular "invalid argument" error might not be due to the kernel itself, but some API call that occurred previously.
You typically don't want cudaMalloc operations in a data-processing while loop. These are normally operations you do once, at the beginning of your code. Doing the cudaMalloc at each iteration of the while-loop has several negative issues, one of which is that you will run out of memory (although you have cudaFree statements, so perhaps not), eventually, and you are effectively throwing away your data at each iteration. Also, it will negatively impact your performance.
You have some of your cudaMemcpy transfer directions wrong, like here:
cudaMemcpy(dFlag, &flag, sizeof(int), cudaMemcpyDeviceToHost);
Setting flag to zero in your kernel code will be problematic. Warps can execute in any order, and after some warps have already set flag to 1 later in the kernel, other warps could begin executing and set flag to zero again. This is probably not what you want. One possible fix is to set flag to zero before executing the kernel (i.e. in host code, and copy it to the device).
Your kernel will generate out-of-bounds indexing here:
float g=B[i][j]-0.25*(B[i+1][j]+B[i-1][j]+B[i][j+1]+B[i][j-1]);
(just ask yourself what happens when i=0 and j=0). The fix for this is to move this line of code inside the if-check you have for bounds checking right after it.
Your kernel uses a variable f which is defined nowhere that I can see, for example here:
B[i+1][j]+=f*g;
The following code is my attempt to rework your code, create a complete example, and remove the above issues. It doesn't do anything useful, but it compiles without errors and runs without errors for me. I haven't provided any data, so it's just a proof-of-concept at this point. I'm sure it still contains data processing errors.
#include <stdio.h>
#define my_L 50
typedef float farray[my_L+1];
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
//kernel definition
__global__ void stepCalc(farray B[], int L, int *flag, float m, float *en)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
//float g=B[i][j]-0.25*(B[i+1][j]+B[i-1][j]+B[i][j+1]+B[i][j-1]);
// flag = 0;
float f = 1.0f;
if (i < L-2 && j < L-2 && i>2 && j>2){
float g=B[i][j]-0.25*(B[i+1][j]+B[i-1][j]+B[i][j+1]+B[i][j-1]);
if (abs(g)>m)
{
*flag = 1;
*en+=-16*g*g+8*B[i][j]*abs(g);
B[i][j]+=-4*f*g;
B[i+1][j]+=f*g;
B[i-1][j]+=f*g;
B[i][j+1]+=f*g;
B[i][j-1]+=f*g;
}
}
}
int main(){
farray B[my_L+1];
//initialize B[i][j]
farray *dB;
int flag = 1;
float m=0.1;
float en = 0;
int *dFlag=NULL;
float *dEn=NULL;
cudaMalloc((void **)&dFlag,sizeof(int));
cudaCheckErrors("1");
cudaMalloc((void **)&dEn,sizeof(float));
cudaCheckErrors("2");
size_t sizeB = (my_L+1)*sizeof(farray);
cudaMalloc((void **)&dB, sizeB);
cudaCheckErrors("3");
cudaMemcpy(dB, B, sizeB, cudaMemcpyHostToDevice);
cudaCheckErrors("4");
cudaMemcpy(dEn, &en, sizeof(float), cudaMemcpyHostToDevice);
cudaCheckErrors("5");
dim3 threadsPerBlock(16,16);
dim3 numBlocks((my_L+1)/threadsPerBlock.x,(my_L+1)/threadsPerBlock.y);
while (flag==1)
{
flag = 0;
cudaMemcpy(dFlag, &flag, sizeof(int), cudaMemcpyHostToDevice);
cudaCheckErrors("6");
stepCalc<<<numBlocks, threadsPerBlock>>>(dB, my_L, dFlag, m, dEn);
cudaDeviceSynchronize();
cudaCheckErrors("7");
cudaMemcpy(&flag, dFlag, sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("8");
}
cudaMemcpy(B, (dB), sizeB, cudaMemcpyDeviceToHost);
cudaCheckErrors("9");
cudaMemcpy(&en, dEn, sizeof(float), cudaMemcpyDeviceToHost);
cudaCheckErrors("10");
// process B
cudaFree(dB);
cudaFree(dFlag);
cudaFree(dEn);
}
I have a question-related to copying structure containing 2D pointer to the device from the host, my code is as follow
struct mymatrix
{
matrix m;
int x;
};
size_t pitch;
mymatrix m_h[5];
for(int i=0; i<5;i++){
m_h[i].m = (float**) malloc(4 * sizeof(float*));
for (int idx = 0; idx < 4; ++idx)
{
m_h[i].m[idx] = (float*)malloc(4 * sizeof(float));
}
}
mymatrix *m_hh = (mymatrix*)malloc(5*sizeof(mymatrix));
memcpy(m_hh,m_h,5*sizeof(mymatrix));
for(int i=0 ; i<5 ;i++)
{
cudaMallocPitch((void**)&(m_hh[i].m),&pitch,4*sizeof(float),4);
cudaMemcpy2D(m_hh[i].m, pitch, m_h[i].m, 4*sizeof(float), 4*sizeof(float),4,cudaMemcpyHostToDevice);
}
mymatrix *m_d;
cudaMalloc((void**)&m_d,5*sizeof(mymatrix));
cudaMemcpy(m_d,m_hh,5*sizeof(mymatrix),cudaMemcpyHostToDevice);
distance_calculation_begins<<<1,16>>>(m_d,pitch);
Problem
With this code I am unable to access 2D pointer elements of the structure, but I can access x from that structure in device. e.g. such as I have receive m_d with pointer mymatrix* m if I initialize
m[0].m[0][0] = 5;
and printing this value such as
cuPrintf("The value is %f",m[0].m[0][0]);
in the device, I get no output. Means I am unable to use 2D pointer, but if I try to access
m[0].x = 5;
then I am able to print this. I think my initializations are correct, but I am unable to figure out the problem. Help from anyone will be greatly appreciated.
In addition to the issues that #RobertCrovella noted on your code, also note:
You are only getting a shallow copy of your structure with the memcpy that copies m_h to m_hh.
You are assuming that pitch is the same in all calls to cudaMemcpy2D() (you overwrite the pitch and use only the latest copy at the end). I think that might be safe assumption for now but it could change in the future.
You are using cudaMemcpyHostToDevice() with cudaMemcpyHostToDevice to copy to m_hh, which is on the host, not the device.
Using many small buffers and tables of pointers is not efficient in CUDA. The small allocations and deallocations can end up taking a lot of time. Also, using tables of pointers cause extra memory transactions because the pointers must be retrieved from memory before they can be used as bases for indexing. So, if you consider a construct such as this:
a[10][20][30] = 3
The pointer at a[10] must first be retrieved from memory, causing your warp to be put on hold for a long time (up to around 600 cycles on Fermi). Then, the same thing happens for the second pointer, adding another 600 cycles. In addition, these requests are unlikely to be coalesced causing even more memory transactions.
As Robert mentioned, the solution is to flatten your memory structures. I've included an example for this, which you may be able to use as a basis for your program. As you can see, the code is overall much simpler. The part that does become a bit more complex is the index calculations. Also, this approach assumes that your matrixes are all of the same size.
I have added error checking as well. If you had added error checking in your code, you would have found at least a couple of the bugs without any extra effort.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
typedef float* mymatrix;
const int n_matrixes(5);
const int w(4);
const int h(4);
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void test(mymatrix m_d, size_t pitch_floats)
{
// Print the value at [2][3][4].
printf("%f ", m_d[3 + (2 * h + 4) * pitch_floats]);
}
int main()
{
mymatrix m_h;
gpuErrchk(cudaMallocHost(&m_h, n_matrixes * w * sizeof(float) * h));
// Set the value at [2][3][4].
m_h[2 * (w * h) + 3 + 4 * w] = 5.0f;
// Create a device copy of the matrix.
mymatrix m_d;
size_t pitch;
gpuErrchk(cudaMallocPitch((void**)&m_d, &pitch, w * sizeof(float), n_matrixes * h));
gpuErrchk(cudaMemcpy2D(m_d, pitch, m_h, w * sizeof(float), w * sizeof(float), n_matrixes * h, cudaMemcpyHostToDevice));
test<<<1,1>>>(m_d, pitch / sizeof(float));
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
}
Your matrix m class/struct member appears to be some sort of double pointer based on how you are initializing it on the host:
m_h[i].m = (float**) malloc(4 * sizeof(float*));
Copying an array of structures with embedded pointers between host and device is somewhat compilicated. Copying a data structure that is pointed to by a double pointer is also complicated.
For an array of structures with embedded pointers, refer to this posting.
For copying a 2D array (double pointer, i.e. **), refer to this posting. We don't use cudaMallocPitch/cudaMemcpy2D to accomplish this. (Note that cudaMemcpy2D takes single pointer * arguments, you are passing it double pointer ** arguments e.g. m_h[i].m)
Instead of the above approaches, it's recommended that you flatten your data so that it can all be referenced with single pointer referencing, with no embedded pointers.
I am trying to do a sample code with constant memory with CUDA 5.5. I have 2 constant arrays of size 3000 each. I have another global array X of size N.
I want to compute
Y[tid] = X[tid]*A[tid%3000] + B[tid%3000]
Here is the code.
#include <iostream>
#include <stdio.h>
using namespace std;
#include <cuda.h>
__device__ __constant__ int A[3000];
__device__ __constant__ int B[3000];
__global__ void kernel( int *dc_A, int *dc_B, int *X, int *out, int N)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
if( tid<N )
{
out[tid] = dc_A[tid%3000]*X[tid] + dc_B[tid%3000];
}
}
int main()
{
int N=100000;
// set affine constants on host
int *h_A, *h_B ; //host vectors
h_A = (int*) malloc( 3000*sizeof(int) );
h_B = (int*) malloc( 3000*sizeof(int) );
for( int i=0 ; i<3000 ; i++ )
{
h_A[i] = (int) (drand48() * 10);
h_B[i] = (int) (drand48() * 10);
}
//set X and Y on host
int * h_X = (int*) malloc( N*sizeof(int) );
int * h_out = (int *) malloc( N*sizeof(int) );
//set the vector
for( int i=0 ; i<N ; i++ )
{
h_X[i] = i;
h_out[i] = 0;
}
// copy, A,B,X,Y to device
int * d_X, *d_out;
cudaMemcpyToSymbol( A, h_A, 3000 * sizeof(int) ) ;
cudaMemcpyToSymbol( B, h_B, 3000 * sizeof(int) ) ;
cudaMalloc( (void**)&d_X, N*sizeof(int) ) );
cudaMemcpy( d_X, h_X, N*sizeof(int), cudaMemcpyHostToDevice ) ;
cudaMalloc( (void**)&d_out, N*sizeof(int) ) ;
//call kernel for vector addition
kernel<<< (N+1024)/1024,1024 >>>(A,B, d_X, d_out, N);
cudaPeekAtLastError() ;
cudaDeviceSynchronize() ;
// D --> H
cudaMemcpy(h_out, d_out, N * sizeof(int), cudaMemcpyDeviceToHost ) ;
free(h_A);
free(h_B);
return 0;
}
I am trying to run the debugger over this code to analyze. Turns out that on the line which copies to constant memory I get the following error with debugger
Coalescing of the CUDA commands output is off.
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff5c5b700 (LWP 31200)]
Can somebody please help me out with constant memory
There are several problems here. It is probably easier to start by showing the "correct" way to use those two constant arrays, then explain why what you did doesn't work. So the kernel should look like this:
__global__ void kernel(int *X, int *out, int N)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
if( tid<N )
{
out[tid] = A[tid%3000]*X[tid] + B[tid%3000];
}
}
ie. don't try passing A and B to the kernel. The reasons are as follows:
Somewhat confusingly, A and B in host code are not valid device memory addresses. They are host symbols which provide hooks into a runtime device symbol lookup. It is illegal to pass them to a kernel- If you want their device memory address, you must use cudaGetSymbolAddress to retrieve it at runtime.
Even if you did call cudaGetSymbolAddress and retrieve the symbols device addresses in constant memory, you shouldn't pass them to a kernel as an argument, because doing do would not yield uniform memory access in the running kernel. Correct use of constant memory requires the compiler to emit special PTX instructions, and the compiler will only do that when it knows that a particular global memory location is in constant memory. If you pass a constant memory address by value as an argument, the __constant__ property is lost and the compiler can't know to produce the correct load instructions
Once you get this working, you will find it is terribly slow and if you profile it you will find that there is very high degrees of instruction replay and serialization. The whole idea of using constant memory is that you can exploit a constant cache broadcast mechanism in cases when every thread in a warp accesses the same value in constant memory. Your example is the complete opposite of that - every thread is accessing a different value. Regular global memory will be faster in such a use case. Also be aware that the performance of the modulo operator on current GPUs is poor, and you should avoid it wherever possible.
I am trying to implement an algorithm in cuda and I need to allocate an Array of Pointers that point to an Array of Structs. My struct is, lets say:
typedef struct {
float x, y;
} point;
I know that If I want to preserve the arrays for multiple kernel calls I have to control them from the host, is that right? The initialization of the pointers must be done from within the kernel. To be more specific, the Array of Struct P will contain random order of cartesian points while the dev_S_x will be a sorted version as to x coordinate of the points in P.
I have tried with:
__global__ void test( point *dev_P, point **dev_S_x) {
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
dev_P[tid].x = 3.141516;
dev_P[tid].y = 3.141516;
dev_S_x[tid] = &dev_P[tid];
...
}
and:
int main( void ) {
point *P, *dev_P, **S_x, *dev_S_x;
P = (point*) malloc (N * sizeof (point) );
S_x = (point**) malloc (N * sizeof (point*));
// allocate the memory on the GPU
cudaMalloc( (void**) &dev_P, N * sizeof(point) );
cudaMalloc( (void***) &dev_S_x, N * sizeof(point*));
// copy the array P to the GPU
cudaMemcpy( dev_P, P, N * sizeof(point), cudaMemcpyHostToDevice);
cudaMemcpy( dev_S_x,S_x,N * sizeof(point*), cudaMemcpyHostToDevice);
test <<<1, 1 >>>( dev_P, &dev_S_x);
...
return 0;
}
which leads to many
First-chance exception at 0x000007fefcc89e5d (KernelBase.dll) in Test_project_cuda.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0020f920..
Critical error detected c0000374
Am I doing something wrong in the cudamalloc of the array of pointers or is it something else? Is the usage of (void***) correct? I would like to use for example dev_S_x[tid]->x or dev_S_x[tid]->y from within the kernels pointing to device memory addresses. Is that feasible?
Thanks in advance
dev_S_x should be declared as point ** and should be passed to the kernel as a value (i.e. test <<<1, 1 >>>(dev_P, dev_S_x);).
Putting that to one side, what you describe sounds like a natural fit for Thrust, which will give you a simpler memory management strategy and access to fast sort routines.