cudaMallocPitch and cudaMemcpy2D - c++

I have an error when transfering C++ 2D array into CUDA 1D array.
Let me show my source code.
int main(void)
{
float h_arr[1024][256];
float *d_arr;
// --- Some codes to populate h_arr
// --- cudaMallocPitch
size_t pitch;
cudaMallocPitch((void**)&d_arr, &pitch, 256, 1024);
// --- Copy array to device
cudaMemcpy2D(d_arr, pitch, h_arr, 256, 256, 1024, cudaMemcpyHostToDevice);
}
I tried to run the code, but it pops up an error.
How to use cudaMallocPitch() and cudaMemcpy2D() properly?

Talonmies has already satisfactorily answered this question. Here, some further explanation that could be useful to the Community.
When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned.
CUDA provides the cudaMallocPitch function to “pad” 2D matrix rows with extra bytes so to achieve the desired alignment. Please, refer to the “CUDA C Programming Guide”, Sections 3.2.2 and 5.3.2, for more information.
Assuming that we want to allocate a 2D padded array of floating point (single precision) elements, the syntax for cudaMallocPitch is the following:
cudaMallocPitch(&devPtr, &devPitch, Ncols * sizeof(float), Nrows);
where
devPtr is an output pointer to float (float *devPtr).
devPitch is a size_t output variable denoting the length, in bytes, of the padded row.
Nrows and Ncols are size_t input variables representing the matrix size.
Recalling that C/C++ and CUDA store 2D matrices by row, cudaMallocPitch will allocate a memory space of size, in bytes, equal to Nrows * pitch. However, only the first Ncols * sizeof(float) bytes of each row will contain the matrix data. Accordingly, cudaMallocPitch consumes more memory than strictly necessary for the 2D matrix storage, but this is returned in more efficient memory accesses.
CUDA provides also the cudaMemcpy2D function to copy data from/to host memory space to/from device memory space allocated with cudaMallocPitch. Under the above hypotheses (single precision 2D matrix), the syntax is the following:
cudaMemcpy2D(devPtr, devPitch, hostPtr, hostPitch, Ncols * sizeof(float), Nrows, cudaMemcpyHostToDevice)
where
devPtr and hostPtr are input pointers to float (float *devPtr and float *hostPtr) pointing to the (source) device and (destination) host memory spaces, respectively;
devPitch and hostPitch are size_t input variables denoting the length, in bytes, of the padded rows for the device and host memory spaces, respectively;
Nrows and Ncols are size_t input variables representing the matrix size.
Note that cudaMemcpy2D allows also for pitched memory allocation on the host side. If the host memory has no pitch, then hostPtr = Ncols * sizeof(float). Furthermore, cudaMemcpy2D is bidirectional. For the above example, we are copying data from host to device. If we want to copy data from device to host, then the above line changes to
cudaMemcpy2D(hostPtr, hostPitch, devPtr, devPitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost)
The access to elements of a 2D matrix allocated by cudaMallocPitch can be performed as in the following example:
int tidx = blockIdx.x*blockDim.x + threadIdx.x;
int tidy = blockIdx.y*blockDim.y + threadIdx.y;
if ((tidx < Ncols) && (tidy < Nrows))
{
float *row_a = (float *)((char*)devPtr + tidy * pitch);
row_a[tidx] = row_a[tidx] * tidx * tidy;
}
In such an example, tidx and tidy are used as column and row indices, respectively (remember that, in CUDA, x-threads span the columns and y-threads span the rows to favor coalescence). The pointer to the first element of a row is calculated by offsetting the initial pointer devPtr by the row length tidy * pitch in bytes (char * is a pointer to bytes and sizeof(char) is 1 byte), where the length of each row is computed by using the pitch information.
Below, I'm providing a fully worked example to show these concepts.
#include<stdio.h>
#include<cuda.h>
#include<cuda_runtime.h>
#include<device_launch_parameters.h>
#include<conio.h>
#define BLOCKSIZE_x 16
#define BLOCKSIZE_y 16
#define Nrows 3
#define Ncols 5
/*****************/
/* CUDA MEMCHECK */
/*****************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "GPUassert: %s %s %dn", cudaGetErrorString(code), file, line);
if (abort) { getch(); exit(code); }
}
}
/*******************/
/* iDivUp FUNCTION */
/*******************/
int iDivUp(int hostPtr, int b){ return ((hostPtr % b) != 0) ? (hostPtr / b + 1) : (hostPtr / b); }
/******************/
/* TEST KERNEL 2D */
/******************/
__global__ void test_kernel_2D(float *devPtr, size_t pitch)
{
int tidx = blockIdx.x*blockDim.x + threadIdx.x;
int tidy = blockIdx.y*blockDim.y + threadIdx.y;
if ((tidx < Ncols) && (tidy < Nrows))
{
float *row_a = (float *)((char*)devPtr + tidy * pitch);
row_a[tidx] = row_a[tidx] * tidx * tidy;
}
}
/********/
/* MAIN */
/********/
int main()
{
float hostPtr[Nrows][Ncols];
float *devPtr;
size_t pitch;
for (int i = 0; i < Nrows; i++)
for (int j = 0; j < Ncols; j++) {
hostPtr[i][j] = 1.f;
//printf("row %i column %i value %f \n", i, j, hostPtr[i][j]);
}
// --- 2D pitched allocation and host->device memcopy
gpuErrchk(cudaMallocPitch(&devPtr, &pitch, Ncols * sizeof(float), Nrows));
gpuErrchk(cudaMemcpy2D(devPtr, pitch, hostPtr, Ncols*sizeof(float), Ncols*sizeof(float), Nrows, cudaMemcpyHostToDevice));
dim3 gridSize(iDivUp(Ncols, BLOCKSIZE_x), iDivUp(Nrows, BLOCKSIZE_y));
dim3 blockSize(BLOCKSIZE_y, BLOCKSIZE_x);
test_kernel_2D << <gridSize, blockSize >> >(devPtr, pitch);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy2D(hostPtr, Ncols * sizeof(float), devPtr, pitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost));
for (int i = 0; i < Nrows; i++)
for (int j = 0; j < Ncols; j++)
printf("row %i column %i value %f \n", i, j, hostPtr[i][j]);
return 0;
}

The cudaMallocPitch call you have written looks ok, but this:
cudaMemcpy2D(d_arr, pitch, h_arr, 256, 256, 1024, cudaMemcpyHostToDevice);
is incorrect. Quoting from the documentation
Copies a matrix (height rows of width bytes each) from the memory area
pointed to by src to the memory area pointed to by dst, where kind is
one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice,
cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the
direction of the copy. dpitch and spitch are the widths in memory in
bytes of the 2D arrays pointed to by dst and src, including any
padding added to the end of each row. The memory areas may not
overlap. width must not exceed either dpitch or spitch. Calling
cudaMemcpy2D() with dst and src pointers that do not match the
direction of the copy results in an undefined behavior. cudaMemcpy2D()
returns an error if dpitch or spitch exceeds the maximum allowed.
So the source pitch and width to copy must be specified in bytes. Your host matrix has a pitch of sizeof(float) * 256 bytes, and because the source pitch and the width of the source you will copy are the same, this means your cudaMemcpy2Dcall should look like:
cudaMemcpy2D(d_arr, pitch, h_arr, 256*sizeof(float),
256*sizeof(float), 1024, cudaMemcpyHostToDevice);

Related

CUDA C++ Pointer Typecasting

I was looking at CUDA C++ documentation. But there is something I didn't get about pointer typecasting. Below there are host and device code.
// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr,
size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}
As you can see devPtr is typecasted into char. But I didn't get why typecasted into char rather than incrementing as float type.
This is to handle a pitched allocation (the type created by cudaMallocPitch()).
A pitched allocation "rounds up" the requested width of the allocation to a particular pitch, and this pitch is specified in bytes:
cudaMallocPitch(&devPtr, &pitch,
^
|
this value is indicated by the function as a row width or "pitch" in bytes
Because the pitch is specified in bytes, to get proper pointer arithmetic:
((char*)devPtr + r * pitch);
^
|
pointer arithmetic
the pointer type must also be a byte-type. The objective of that code snippet is to increment devPtr by a number of rows specified by r, each row consists of pitch bytes.
AFAIK, in CUDA, there is nothing that guarantees any particular granularity of pitch as returned by cudaMallocPitch. It is theoretically possible for it to be an odd number of bytes, or a prime number of bytes, for example. So playing tricks to pre-convert the pitch value to an equivalent (pointer arithmetic) offset in other type-widths would be frowned on.

How to get the real and imaginary parts of a complex matrix separately in CUDA?

I'm trying to get the fft of a 2D array. The input is a NxM real matrix, therefore the output matrix is also a NxM matrix (2xNxM output matrix which is complex is saved in a NxM matrix using the property Hermitian symmetry).
So i want to know whether there is method to extract in cuda to extract real and complex matrices separately ? In opencv split function does the duty. So I'm looking for a similar function in cuda, but I couldn't find it yet.
Given below is my complete code
#define NRANK 2
#define BATCH 10
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cufft.h>
#include <stdio.h>
#include <iostream>
#include <vector>
using namespace std;
int main()
{
const size_t NX = 4;
const size_t NY = 5;
// Input array - host side
float b[NX][NY] ={
{0.7943 , 0.6020 , 0.7482 , 0.9133 , 0.9961},
{0.3112 , 0.2630 , 0.4505 , 0.1524 , 0.0782},
{0.5285 , 0.6541 , 0.0838 , 0.8258 , 0.4427},
{0.1656 , 0.6892 , 0.2290 , 0.5383 , 0.1067}
};
// Output array - host side
float c[NX][NY] = { 0 };
cufftHandle plan;
cufftComplex *data; // Holds both the input and the output - device side
int n[NRANK] = {NX, NY};
// Allocated memory and copy from host to device
cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*(NY/2+1));
for(int i=0; i<NX; ++i){
// Uses this because my actual array is a dynamically allocated.
// but here I've replaced it with a static 2D array to make it simple.
cudaMemcpy(reinterpret_cast<float*>(data) + i*NY, b[i], sizeof(float)*NY, cudaMemcpyHostToDevice);
}
// Performe the fft
cufftPlanMany(&plan, NRANK, n,NULL, 1, 0,NULL, 1, 0,CUFFT_R2C,BATCH);
cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(plan, (cufftReal*)data, data);
cudaThreadSynchronize();
cudaMemcpy(c, data, sizeof(float)*NX*NY, cudaMemcpyDeviceToHost);
// Here c is a NxM matrix. I want to split it to 2 seperate NxM matrices with each
// having the complex and real component of the output
// Here c is in
cufftDestroy(plan);
cudaFree(data);
return 0;
}
EDIT
As suggested by JackOLanter, I modified the code as below. But still the problem is not solved.
float real_vec[NX][NY] = {0}; // host vector, real part
float imag_vec[NX][NY] = {0}; // host vector, imaginary part
cudaError cudaStat1 = cudaMemcpy2D (real_vec, sizeof(real_vec[0]), data, sizeof(data[0]),NY*sizeof(float2), NX, cudaMemcpyDeviceToHost);
cudaError cudaStat2 = cudaMemcpy2D (imag_vec, sizeof(imag_vec[0]),data + 1, sizeof(data[0]),NY*sizeof(float2), NX, cudaMemcpyDeviceToHost);
The error i get is 'invalid pitch argument error'. But i can't understand why. For the destination I use a pitch size of 'float' while for the source i use size of 'float2'
Your question and your code do not make much sense to me.
You are performing a batched FFT, but it seems you are not foreseeing enough memory space neither for the input, nor for the output data;
The output of cufftExecR2C is a NX*(NY/2+1) float2 matrix, which can be interpreted as a NX*(NY+2) float matrix. Accordingly, you are not allocating enough space for c (which is only NX*NY float) for the last cudaMemcpy. You would need still one complex memory location for the continuous component of the output;
Your question does not seem to be related to the cufftExecR2C command, but is much more general: how can I split a complex NX*NY matrix into 2 NX*NY real matrices containing the real and imaginary parts, respectively.
If I correctly interpret your question, then the solution proposed by #njuffa at
Copying data to “cufftComplex” data struct?
could be a good clue to you.
EDIT
In the following, a small example on how "assembling" and "disassembling" the real and imaginary parts of complex vectors when copying them from/to host to/from device. Please, add your own CUDA error checking.
#include <stdio.h>
#define N 16
int main() {
// Declaring, allocating and initializing a complex host vector
float2* b = (float2*)malloc(N*sizeof(float2));
printf("ORIGINAL DATA\n");
for (int i=0; i<N; i++) {
b[i].x = (float)i;
b[i].y = 2.f*(float)i;
printf("%f %f\n",b[i].x,b[i].y);
}
printf("\n\n");
// Declaring and allocating a complex device vector
float2 *data; cudaMalloc((void**)&data, sizeof(float2)*N);
// Copying the complex host vector to device
cudaMemcpy(data, b, N*sizeof(float2), cudaMemcpyHostToDevice);
// Declaring and allocating space on the host for the real and imaginary parts of the complex vector
float* cr = (float*)malloc(N*sizeof(float));
float* ci = (float*)malloc(N*sizeof(float));
/*******************************************************************/
/* DISASSEMBLING THE COMPLEX DATA WHEN COPYING FROM DEVICE TO HOST */
/*******************************************************************/
float* tmp_d = (float*)data;
cudaMemcpy2D(cr, sizeof(float), tmp_d, 2*sizeof(float), sizeof(float), N, cudaMemcpyDeviceToHost);
cudaMemcpy2D(ci, sizeof(float), tmp_d+1, 2*sizeof(float), sizeof(float), N, cudaMemcpyDeviceToHost);
printf("DISASSEMBLED REAL AND IMAGINARY PARTS\n");
for (int i=0; i<N; i++)
printf("cr[%i] = %f; ci[%i] = %f\n",i,cr[i],i,ci[i]);
printf("\n\n");
/******************************************************************************/
/* REASSEMBLING THE REAL AND IMAGINARY PARTS WHEN COPYING FROM HOST TO DEVICE */
/******************************************************************************/
cudaMemcpy2D(tmp_d, 2*sizeof(float), cr, sizeof(float), sizeof(float), N, cudaMemcpyHostToDevice);
cudaMemcpy2D(tmp_d + 1, 2*sizeof(float), ci, sizeof(float), sizeof(float), N, cudaMemcpyHostToDevice);
// Copying the complex device vector to host
cudaMemcpy(b, data, N*sizeof(float2), cudaMemcpyHostToDevice);
printf("REASSEMBLED DATA\n");
for (int i=0; i<N; i++)
printf("%f %f\n",b[i].x,b[i].y);
printf("\n\n");
getchar();
return 0;
}

Copying a dynamically allocated 2D array from host to device in CUDA

I want to copy a dynamically allocated 2D array from host to device to get its Discrete Fourier Transform.
I'm using below code to copy the array to the device
cudaMalloc((void**)&array_d, sizeof(cufftComplex)*NX*(NY/2+1));
cudaMemcpy(array_d, array_h, sizeof(float)*NX*NY, cudaMemcpyHostToDevice);
This works fine with static arrays, i get the intended output from my fft.
But it doesn't work with dynamic arrays. After little bit searching I learnt I can not copy dynamic arrays like this from host to device. So I found this solution.
cudaMalloc((void**)&array_d, sizeof(cufftComplex)*NX*(NY/2+1));
for(int i=0; i<NX; ++i){
cudaMemcpy(array_d+ i*NY, array_h[i], sizeof(float)*NY, cudaMemcpyHostToDevice);
}
But it's also not doing the task properly since I get wrong values from my fft.
Given below is my fft code.
cufftPlanMany(&plan, NRANK, n,NULL, 1, 0,NULL, 1, 0,CUFFT_R2C,BATCH);
cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(plan, (cufftReal*)data, data);
cudaThreadSynchronize();
cudaMemcpy(c, data, sizeof(float)*NX*NY, cudaMemcpyDeviceToHost);
How can I overcome this problem ?
EDIT
given below is the code
#define NX 4
#define NY 5
#define NRANK 2
#define BATCH 10
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cufft.h>
#include <stdio.h>
#include <iostream>
int check();
int main()
{
// static array
float b[NX][NY] ={
{0.7943 , 0.6020 , 0.7482 , 0.9133 , 0.9961},
{0.3112 , 0.2630 , 0.4505 , 0.1524 , 0.0782},
{0.5285 , 0.6541 , 0.0838 , 0.8258 , 0.4427},
{0.1656 , 0.6892 , 0.2290 , 0.5383 , 0.1067}
};
// dynamic array
float **a = new float*[NX];
for (int r = 0; r < NX; ++r)
{
a[r] = new float[NY];
for (int c = 0; c < NY; ++c)
{
a[r][c] = b[r][c];
}
}
// arrray to store the results - host side
float c[NX][NY] = { 0 };
cufftHandle plan;
cufftComplex *data;
int n[NRANK] = {NX, NY};
cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*(NY/2+1));
cudaMemcpy(data, b, sizeof(float)*NX*NY, cudaMemcpyHostToDevice);
/* Create a 2D FFT plan. */
cufftPlanMany(&plan, NRANK, n,NULL, 1, 0,NULL, 1, 0,CUFFT_R2C,BATCH);
cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(plan, (cufftReal*)data, data);
cudaThreadSynchronize();
cudaMemcpy(c, data, sizeof(float)*NX*NY, cudaMemcpyDeviceToHost);
cufftDestroy(plan);
cudaFree(data);
return 0;
}
data is of type cufftComplex which is series of typedefs eventually resulting in a float2. That means data + n will advance data by n objects of type float2, or by 2 * n object of type float. This makes your "dynamic array" copying incorrect; you have to halve the increment of data.
EDIT
Looking at the parameter types of cufftExecR2C(), I think this should work:
for(int i=0; i<NX; ++i){
cudaMemcpy(reinterpret_cast<float*>(data) + i*NY, a[i], sizeof(float)*NY, cudaMemcpyHostToDevice);
}
Side note: you don't actually have a dynamic 2D array (that would be new float[NX * NY]). What you have is a dynamic array of pointers to dynamic arrays of floats. I believe it would make more sense for you to use a true 2D array instead, which would allow you to keep the static-case copy code as well.
And since you've tagged this C++, you should seriously consider using std::vector instead of managing your dynamic memory manually. That is, change a like this:
std::vector<float> a(NX * NY);
And while you're at it, I'd suggest turning NX, NY etc. from macros to constants:
const size_t NX = 4;
const size_t NY = 5;
etc.

CUDA - cudaMallocPitch and cudaMemcpy2D use, Error: InvalidValue, InvalidPitchValue

okay so I'm trying to get a 2D array for cuda to work on, but it's becoming a pain. the error's are in the title and occur at the cudaMemcpy2D. I think the problem is obvious to trained eyes. Thank you in advance for any help, I've stepped ahead of my class which are currently learning Pointers.
#include <cuda_runtime.h>
#include <iostream>
#pragma comment (lib, "cudart")
/* Program purpose: pass a 10 x 10 matrix and multiply it by another 10x10 matrix */
float matrix1_host[100][100];
float matrix2_host[100][100];
float* matrix1_device;
float* matrix2_device;
size_t pitch;
cudaError_t err;
__global__ void addMatrix(float* matrix1_device,float* matrix2_device, size_t pitch){
// How this works
// first we start to cycle through the rows by using the thread's ID
// then we calculate an address from the address of a point in the row, by adding the pitch (size of each row) and * it by
// the amount of rows we've already completed, then we can use that address of somewhere at a start of a row to get the colums
// in the row with a normal array grab.
int r = threadIdx.x;
float* rowofMat1 = (float*)((char*)matrix1_device + r * pitch);
float* rowofMat2 = (float*)((char*)matrix2_device + r * pitch);
for (int c = 0; c < 100; ++c) {
rowofMat1[c] += rowofMat2[c];
}
}
void initCuda(){
err = cudaMallocPitch((void**)matrix1_device, &pitch, 100 * sizeof(float), 100);
err = cudaMallocPitch((void**)matrix2_device, &pitch, 100 * sizeof(float), 100);
//err = cudaMemcpy(matrix1_device, matrix1_host, 100*100*sizeof(float), cudaMemcpyHostToDevice);
//err = cudaMemcpy(matrix2_device, matrix2_host, 100*100*sizeof(float), cudaMemcpyHostToDevice);
err = cudaMemcpy2D(matrix1_device, 100*sizeof(float), matrix1_host, pitch, 100*sizeof(float), 100, cudaMemcpyHostToDevice);
err = cudaMemcpy2D(matrix2_device, 100*sizeof(float), matrix2_host, pitch, 100*sizeof(float), 100, cudaMemcpyHostToDevice);
}
void populateArrays(){
for(int x = 0; x < 100; x++){
for(int y = 0; y < 100; y++){
matrix1_host[x][y] = (float) x + y;
matrix2_host[y][x] = (float) x + y;
}
}
}
void runCuda(){
dim3 dimBlock ( 100 );
dim3 dimGrid ( 1 );
addMatrix<<<dimGrid, dimBlock>>>(matrix1_device, matrix2_device, 100*sizeof(float));
//err = cudaMemcpy(matrix1_host, matrix1_device, 100*100*sizeof(float), cudaMemcpyDeviceToHost);
err = cudaMemcpy2D(matrix1_host, 100*sizeof(float), matrix1_device, pitch, 100*sizeof(float),100, cudaMemcpyDeviceToHost);
//cudaMemcpy(matrix1_host, matrix1_device, 100*100*sizeof(float), cudaMemcpyDeviceToHost);
}
void cleanCuda(){
err = cudaFree(matrix1_device);
err = cudaFree(matrix2_device);
err = cudaDeviceReset();
}
int main(){
populateArrays();
initCuda();
runCuda();
cleanCuda();
std::cout << cudaGetErrorString(cudaGetLastError());
system("pause");
return 0;
}
First of all, in general you should have a separate pitch variable for matrix1 and matrix2. In this case they will be the same value returned from the API call to cudaMallocPitch, but in the general case they may not be.
In your cudaMemcpy2D line, the second parameter to the call is the destination pitch. This is just the pitch value that was returned when you did the cudaMallocPitch call for this particular destination matrix (ie. the first parameter).
The fourth parameter is the source pitch. Since this was allocated with an ordinary host allocation, it has no pitch other than its width in bytes.
So you have your second and fourth parameters swapped.
so instead of this:
err = cudaMemcpy2D(matrix1_device, 100*sizeof(float), matrix1_host, pitch, 100*sizeof(float), 100, cudaMemcpyHostToDevice);
try this:
err = cudaMemcpy2D(matrix1_device, pitch, matrix1_host, 100*sizeof(float), 100*sizeof(float), 100, cudaMemcpyHostToDevice);
and similarly for the second call to cudaMemcpy2D. The third call is actually OK since it's going in the opposite direction, the source and destination matrices are swapped, so they line up with your pitch parameters correctly.

Casting float* to char* while looping over a 2-D array in linear memory on device

On Page 21 of the CUDA 4.0 programming guide there is an example (given below) to illustrate looping over the
elements of a 2D array of floats in device memory. The dimensions of the 2D are width*height
// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r)
{
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c)
{
float element = row[c];
}
}
}
Why has the devPtr device memory pointer been cast to a character pointer ,char*, in the global kernel function? Can someone explain that line please. It looks a bit weird.
This is due to the way pointer arithmetic works in C. When you add an integer x to a pointer p, it doesn't always add x bytes. It adds x times sizeof(*p) (the size of the object to which p points).
float* row = (float*)((char*)devPtr + r * pitch);
By casting devPtr to a char*, the offset that is applied (r * pitch*) is in number of 1-byte increments. (because a char is one byte). Had the cast not been there, the offset applied to devPtr would be r * pitch times 4 bytes, as a float is four bytes.
For example, if we have:
float* devPtr = 1000;
int r = 4;
Now, let's leave out the cast:
float* result1 = (devPtr + r);
// result1 = devPtr + (r * sizeof(float)) = 1016;
Now, if we include the cast:
float* result2 = (float*)((char*)devPtr + r);
// result2 = devPtr + (r * sizeof(char)) = 1004;
The cast is just to make the pointer arithmetic work right;
(float*)((char*)devPtr + r * pitch);
moves r*pitch bytes forward while
(float*)(devPtr + r * pitch);
would move r*pitch floats forward (ie 4 times as many bytes)
*(devPtr + 1) will offset the pointer by 4 bytes (sizeof(float)) before the * dereferences it.
*((char)devPtr + 1) will offset the pointer by 1 byte (sizeof(char)) before the * dereferences it..