CUDA - cudaMallocPitch and cudaMemcpy2D use, Error: InvalidValue, InvalidPitchValue - c++

okay so I'm trying to get a 2D array for cuda to work on, but it's becoming a pain. the error's are in the title and occur at the cudaMemcpy2D. I think the problem is obvious to trained eyes. Thank you in advance for any help, I've stepped ahead of my class which are currently learning Pointers.
#include <cuda_runtime.h>
#include <iostream>
#pragma comment (lib, "cudart")
/* Program purpose: pass a 10 x 10 matrix and multiply it by another 10x10 matrix */
float matrix1_host[100][100];
float matrix2_host[100][100];
float* matrix1_device;
float* matrix2_device;
size_t pitch;
cudaError_t err;
__global__ void addMatrix(float* matrix1_device,float* matrix2_device, size_t pitch){
// How this works
// first we start to cycle through the rows by using the thread's ID
// then we calculate an address from the address of a point in the row, by adding the pitch (size of each row) and * it by
// the amount of rows we've already completed, then we can use that address of somewhere at a start of a row to get the colums
// in the row with a normal array grab.
int r = threadIdx.x;
float* rowofMat1 = (float*)((char*)matrix1_device + r * pitch);
float* rowofMat2 = (float*)((char*)matrix2_device + r * pitch);
for (int c = 0; c < 100; ++c) {
rowofMat1[c] += rowofMat2[c];
}
}
void initCuda(){
err = cudaMallocPitch((void**)matrix1_device, &pitch, 100 * sizeof(float), 100);
err = cudaMallocPitch((void**)matrix2_device, &pitch, 100 * sizeof(float), 100);
//err = cudaMemcpy(matrix1_device, matrix1_host, 100*100*sizeof(float), cudaMemcpyHostToDevice);
//err = cudaMemcpy(matrix2_device, matrix2_host, 100*100*sizeof(float), cudaMemcpyHostToDevice);
err = cudaMemcpy2D(matrix1_device, 100*sizeof(float), matrix1_host, pitch, 100*sizeof(float), 100, cudaMemcpyHostToDevice);
err = cudaMemcpy2D(matrix2_device, 100*sizeof(float), matrix2_host, pitch, 100*sizeof(float), 100, cudaMemcpyHostToDevice);
}
void populateArrays(){
for(int x = 0; x < 100; x++){
for(int y = 0; y < 100; y++){
matrix1_host[x][y] = (float) x + y;
matrix2_host[y][x] = (float) x + y;
}
}
}
void runCuda(){
dim3 dimBlock ( 100 );
dim3 dimGrid ( 1 );
addMatrix<<<dimGrid, dimBlock>>>(matrix1_device, matrix2_device, 100*sizeof(float));
//err = cudaMemcpy(matrix1_host, matrix1_device, 100*100*sizeof(float), cudaMemcpyDeviceToHost);
err = cudaMemcpy2D(matrix1_host, 100*sizeof(float), matrix1_device, pitch, 100*sizeof(float),100, cudaMemcpyDeviceToHost);
//cudaMemcpy(matrix1_host, matrix1_device, 100*100*sizeof(float), cudaMemcpyDeviceToHost);
}
void cleanCuda(){
err = cudaFree(matrix1_device);
err = cudaFree(matrix2_device);
err = cudaDeviceReset();
}
int main(){
populateArrays();
initCuda();
runCuda();
cleanCuda();
std::cout << cudaGetErrorString(cudaGetLastError());
system("pause");
return 0;
}

First of all, in general you should have a separate pitch variable for matrix1 and matrix2. In this case they will be the same value returned from the API call to cudaMallocPitch, but in the general case they may not be.
In your cudaMemcpy2D line, the second parameter to the call is the destination pitch. This is just the pitch value that was returned when you did the cudaMallocPitch call for this particular destination matrix (ie. the first parameter).
The fourth parameter is the source pitch. Since this was allocated with an ordinary host allocation, it has no pitch other than its width in bytes.
So you have your second and fourth parameters swapped.
so instead of this:
err = cudaMemcpy2D(matrix1_device, 100*sizeof(float), matrix1_host, pitch, 100*sizeof(float), 100, cudaMemcpyHostToDevice);
try this:
err = cudaMemcpy2D(matrix1_device, pitch, matrix1_host, 100*sizeof(float), 100*sizeof(float), 100, cudaMemcpyHostToDevice);
and similarly for the second call to cudaMemcpy2D. The third call is actually OK since it's going in the opposite direction, the source and destination matrices are swapped, so they line up with your pitch parameters correctly.

Related

CUDA C++ Pointer Typecasting

I was looking at CUDA C++ documentation. But there is something I didn't get about pointer typecasting. Below there are host and device code.
// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr,
size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}
As you can see devPtr is typecasted into char. But I didn't get why typecasted into char rather than incrementing as float type.
This is to handle a pitched allocation (the type created by cudaMallocPitch()).
A pitched allocation "rounds up" the requested width of the allocation to a particular pitch, and this pitch is specified in bytes:
cudaMallocPitch(&devPtr, &pitch,
^
|
this value is indicated by the function as a row width or "pitch" in bytes
Because the pitch is specified in bytes, to get proper pointer arithmetic:
((char*)devPtr + r * pitch);
^
|
pointer arithmetic
the pointer type must also be a byte-type. The objective of that code snippet is to increment devPtr by a number of rows specified by r, each row consists of pitch bytes.
AFAIK, in CUDA, there is nothing that guarantees any particular granularity of pitch as returned by cudaMallocPitch. It is theoretically possible for it to be an odd number of bytes, or a prime number of bytes, for example. So playing tricks to pre-convert the pitch value to an equivalent (pointer arithmetic) offset in other type-widths would be frowned on.

cudaMallocPitch and cudaMemcpy2D

I have an error when transfering C++ 2D array into CUDA 1D array.
Let me show my source code.
int main(void)
{
float h_arr[1024][256];
float *d_arr;
// --- Some codes to populate h_arr
// --- cudaMallocPitch
size_t pitch;
cudaMallocPitch((void**)&d_arr, &pitch, 256, 1024);
// --- Copy array to device
cudaMemcpy2D(d_arr, pitch, h_arr, 256, 256, 1024, cudaMemcpyHostToDevice);
}
I tried to run the code, but it pops up an error.
How to use cudaMallocPitch() and cudaMemcpy2D() properly?
Talonmies has already satisfactorily answered this question. Here, some further explanation that could be useful to the Community.
When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned.
CUDA provides the cudaMallocPitch function to “pad” 2D matrix rows with extra bytes so to achieve the desired alignment. Please, refer to the “CUDA C Programming Guide”, Sections 3.2.2 and 5.3.2, for more information.
Assuming that we want to allocate a 2D padded array of floating point (single precision) elements, the syntax for cudaMallocPitch is the following:
cudaMallocPitch(&devPtr, &devPitch, Ncols * sizeof(float), Nrows);
where
devPtr is an output pointer to float (float *devPtr).
devPitch is a size_t output variable denoting the length, in bytes, of the padded row.
Nrows and Ncols are size_t input variables representing the matrix size.
Recalling that C/C++ and CUDA store 2D matrices by row, cudaMallocPitch will allocate a memory space of size, in bytes, equal to Nrows * pitch. However, only the first Ncols * sizeof(float) bytes of each row will contain the matrix data. Accordingly, cudaMallocPitch consumes more memory than strictly necessary for the 2D matrix storage, but this is returned in more efficient memory accesses.
CUDA provides also the cudaMemcpy2D function to copy data from/to host memory space to/from device memory space allocated with cudaMallocPitch. Under the above hypotheses (single precision 2D matrix), the syntax is the following:
cudaMemcpy2D(devPtr, devPitch, hostPtr, hostPitch, Ncols * sizeof(float), Nrows, cudaMemcpyHostToDevice)
where
devPtr and hostPtr are input pointers to float (float *devPtr and float *hostPtr) pointing to the (source) device and (destination) host memory spaces, respectively;
devPitch and hostPitch are size_t input variables denoting the length, in bytes, of the padded rows for the device and host memory spaces, respectively;
Nrows and Ncols are size_t input variables representing the matrix size.
Note that cudaMemcpy2D allows also for pitched memory allocation on the host side. If the host memory has no pitch, then hostPtr = Ncols * sizeof(float). Furthermore, cudaMemcpy2D is bidirectional. For the above example, we are copying data from host to device. If we want to copy data from device to host, then the above line changes to
cudaMemcpy2D(hostPtr, hostPitch, devPtr, devPitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost)
The access to elements of a 2D matrix allocated by cudaMallocPitch can be performed as in the following example:
int tidx = blockIdx.x*blockDim.x + threadIdx.x;
int tidy = blockIdx.y*blockDim.y + threadIdx.y;
if ((tidx < Ncols) && (tidy < Nrows))
{
float *row_a = (float *)((char*)devPtr + tidy * pitch);
row_a[tidx] = row_a[tidx] * tidx * tidy;
}
In such an example, tidx and tidy are used as column and row indices, respectively (remember that, in CUDA, x-threads span the columns and y-threads span the rows to favor coalescence). The pointer to the first element of a row is calculated by offsetting the initial pointer devPtr by the row length tidy * pitch in bytes (char * is a pointer to bytes and sizeof(char) is 1 byte), where the length of each row is computed by using the pitch information.
Below, I'm providing a fully worked example to show these concepts.
#include<stdio.h>
#include<cuda.h>
#include<cuda_runtime.h>
#include<device_launch_parameters.h>
#include<conio.h>
#define BLOCKSIZE_x 16
#define BLOCKSIZE_y 16
#define Nrows 3
#define Ncols 5
/*****************/
/* CUDA MEMCHECK */
/*****************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "GPUassert: %s %s %dn", cudaGetErrorString(code), file, line);
if (abort) { getch(); exit(code); }
}
}
/*******************/
/* iDivUp FUNCTION */
/*******************/
int iDivUp(int hostPtr, int b){ return ((hostPtr % b) != 0) ? (hostPtr / b + 1) : (hostPtr / b); }
/******************/
/* TEST KERNEL 2D */
/******************/
__global__ void test_kernel_2D(float *devPtr, size_t pitch)
{
int tidx = blockIdx.x*blockDim.x + threadIdx.x;
int tidy = blockIdx.y*blockDim.y + threadIdx.y;
if ((tidx < Ncols) && (tidy < Nrows))
{
float *row_a = (float *)((char*)devPtr + tidy * pitch);
row_a[tidx] = row_a[tidx] * tidx * tidy;
}
}
/********/
/* MAIN */
/********/
int main()
{
float hostPtr[Nrows][Ncols];
float *devPtr;
size_t pitch;
for (int i = 0; i < Nrows; i++)
for (int j = 0; j < Ncols; j++) {
hostPtr[i][j] = 1.f;
//printf("row %i column %i value %f \n", i, j, hostPtr[i][j]);
}
// --- 2D pitched allocation and host->device memcopy
gpuErrchk(cudaMallocPitch(&devPtr, &pitch, Ncols * sizeof(float), Nrows));
gpuErrchk(cudaMemcpy2D(devPtr, pitch, hostPtr, Ncols*sizeof(float), Ncols*sizeof(float), Nrows, cudaMemcpyHostToDevice));
dim3 gridSize(iDivUp(Ncols, BLOCKSIZE_x), iDivUp(Nrows, BLOCKSIZE_y));
dim3 blockSize(BLOCKSIZE_y, BLOCKSIZE_x);
test_kernel_2D << <gridSize, blockSize >> >(devPtr, pitch);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy2D(hostPtr, Ncols * sizeof(float), devPtr, pitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost));
for (int i = 0; i < Nrows; i++)
for (int j = 0; j < Ncols; j++)
printf("row %i column %i value %f \n", i, j, hostPtr[i][j]);
return 0;
}
The cudaMallocPitch call you have written looks ok, but this:
cudaMemcpy2D(d_arr, pitch, h_arr, 256, 256, 1024, cudaMemcpyHostToDevice);
is incorrect. Quoting from the documentation
Copies a matrix (height rows of width bytes each) from the memory area
pointed to by src to the memory area pointed to by dst, where kind is
one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice,
cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the
direction of the copy. dpitch and spitch are the widths in memory in
bytes of the 2D arrays pointed to by dst and src, including any
padding added to the end of each row. The memory areas may not
overlap. width must not exceed either dpitch or spitch. Calling
cudaMemcpy2D() with dst and src pointers that do not match the
direction of the copy results in an undefined behavior. cudaMemcpy2D()
returns an error if dpitch or spitch exceeds the maximum allowed.
So the source pitch and width to copy must be specified in bytes. Your host matrix has a pitch of sizeof(float) * 256 bytes, and because the source pitch and the width of the source you will copy are the same, this means your cudaMemcpy2Dcall should look like:
cudaMemcpy2D(d_arr, pitch, h_arr, 256*sizeof(float),
256*sizeof(float), 1024, cudaMemcpyHostToDevice);

Why is my CUDA kernel returning old values?

Kind of almost at the point of ripping my hair out over this issue.
I have a CUDA kernel that does some math on data stored in a 3D array. While testing this, I used to assign some values (non-zero) to the array and observe results. I commented out those lines since, but the result is still the same. It is as if it is completely ignoring the fact that I'm doing a memset to 0.
The code works correctly when I step through it in Debug... But not in Release! My guess is I have a memory leak from this matrix.
I allocate this array as:
cudaExtent m_extent = make_cudaExtent(sizeof(float)*matdim.x, matdim.y, matdim.z); // width, height, depth
cudaPitchedPtr m_device;
cudaMalloc3D(&m_device, m_extent);
cudaMemset3D(m_device, 0, m_extent);
I call the kernel in a loop like this:
for (int iter = 0; iter < gpu_iterations; iter++)
{
PF_iteration_kernel<<<grids,threads>>>(m_device, m_extent, matdim);
cudaDeviceSynchronize();
}
After which I release the m_device pitched pointer:
cudaFree(m_device.ptr);
matdim is just matrix dimensions held by a dim3.
Within the kernel I do the following (well, I commented everything functional out...):
__global__ void PF_iteration_kernel(cudaPitchedPtr mPtr, cudaExtent mExt, dim3 matrix_dimensions)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// Find location within the pitched memory
char *m = (char*)mPtr.ptr;
int sof = sizeof(float);
size_t pitch = mPtr.pitch;
size_t slice_pitch = pitch*mExt.height;
char* m_addroff = m + y * pitch + x * sof;
printf("m(%d,%d) is %f \n", x, y, *m_addroff); // display the slice
*m_addroff = 0; // WILL THIS RESET IT?!
__syncthreads();
}
That should be just showing 0s, but it displays my old values (25, 26, 27, 28, etc).
I have cleaned and re-cleaned and re-built everything several times. I have relaunched the IDE.
My IDE is Visual Studio 2010 With NSight 4.6 (CUDA 7.0).
I am on Windows 7 x64
Consider this
char* m_addroff = m + y * pitch + x * sof;
printf("m(%d,%d) is %f \n", x, y, *m_addroff);
The compiler will see a char and promote it to int pushed on to stack - not a float promoted to double that the format requires.
The compiler does not provide arguments to fit the format spec, but some compilers will examine the format specs and warn of problems.
I suggest you cast the argument. I risk guessing and failing, but something like this
printf("m(%d,%d) is %f \n", x, y, *(float*)m_addroff);
Herer is a simple example.
#include <stdio.h>
int main()
{
char car [4] = {0};
char *cptr = car;
printf ("Hello %f\n", *(float*)cptr);
return 0;
}

fftw - Access violation error

I implemented a fftw (fftw.org) example to use Fast Fourier transforms...
This is the code....
I load an image that I convert from uint8_t to double (this code works fine...).
string bmpFileNameImage = "files/testDummyFFTWWithWisdom/onechannel_image.bmp";
BMPImage bmpImage(bmpFileNameImage);
vector<double>pixelColors;
vector<uint8_t> image = bmpImage.copyBits();
toDouble(image,pixelColors,256,256, 1);
int width = bmpImage.width();
int height = bmpImage.height();
I use wisdom files to improve the performance
FILE * file = fopen("wisdom.fftw", "r");
if (file) {
fftw_import_wisdom_from_file(file);
fclose(file);
}
///* fftw variables */
fftw_complex *out;
double *wisdomInput = (double *) fftw_malloc(sizeof(double)*width*2*(height/2 +1 ));
const fftw_plan forward =fftw_plan_dft_r2c_2d(width,height, wisdomInput,reinterpret_cast<fftw_complex *>(wisdomInput),FFTW_PATIENT);
const fftw_plan inverse = fftw_plan_dft_c2r_2d(width, height,reinterpret_cast<fftw_complex *>(wisdomInput),wisdomInput, FFTW_PATIENT);
file = fopen("wisdom.fftw", "w");
if (file) {
fftw_export_wisdom_to_file(file);
fclose(file);
}
Finally, I execute the fftw library.... I receive an Access violation error with the first
function (fftw_execute_dft_r2c) and I don't know why... I read this tutorial:
http://www.fftw.org/fftw3_doc/Multi_002dDimensional-DFTs-of-Real-Data.html#Multi_002dDimensional-DFTs-of-Real-Data.
I do a malloc with (ny/2+1) how it is explained.... . I don't understand why it is not working.... I am testing different sizes...
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * width *(height / 2 + 1));
double *result =(double *)fftw_malloc(width * (height+2) * sizeof(double));
fftw_execute_dft_r2c(forward,&pixelColors[0],out);
fftw_execute_dft_c2r(inverse,out,result);
Regards.
This is the corrected code.
It had a few mistakes:
It was reading a wrong wisdom.fftw file (from some old test...). Now, It always creates a new fftw_plan and a new file.
I misunderstood how it works the fftw library with in-place and out-of-place parameters. I had to change mallocs for the correct padding for "in-place" (I added +2 in malloc functions).
In order to restore the image, I had to divide by its size ((width+2) * height) how it is explained in this link.
`
/* load image */
string bmpFileNameImage = "files/polyp.bmp";
BMPImage bmpImage(bmpFileNameImage);
int width = bmpImage.width();
int height = bmpImage.height();
vector<double> pixelColors;
vector<uint8_t> image = bmpImage.copyBits();
//get one channel from the image
Uint8ToDouble(image,pixelColors,bmpImage.width(),bmpImage.height(),1);
//We don't reuse old wisdom.fftw... It can be corrupt
/*
FILE * file = fopen("wisdom.fftw", "r");
if (file) {
fftw_import_wisdom_from_file(file);
fclose(file);
} */
double *wisdomInput = (double *) fftw_malloc(sizeof(double)*height*(width+2));
const fftw_plan forward =fftw_plan_dft_r2c_2d(width,height,wisdomInput,reinterpret_cast<fftw_complex *>(wisdomInput),FFTW_PATIENT);
const fftw_plan inverse = fftw_plan_dft_c2r_2d(width,height,reinterpret_cast<fftw_complex *>(wisdomInput),wisdomInput, FFTW_PATIENT);
double *bitsColors =(double *)fftw_malloc((width) * height * sizeof(double));
for (int y = 0; y < height; y++) {
for (int x = 0; x < width+2; x++) {
if (x < width) {
int currentIndex = ((y * width) + (x));
bitsColors[currentIndex] = (static_cast<double>(result[y * (width+2) + x])) / (height*width);
}
}
}
fftw_free (wisdomInput);
fftw_free (out);
fftw_free (result);
fftw_free (bitsColors);
fftw_destroy_plan(forward);
fftw_destroy_plan(inverse);
fftw_cleanup();
}
`
fftw_execute_dft_r2c(forward,&pixelColors[0],out);
What are you doing here ? The array has already a pointer.
Change it to fftw_execute_dft_r2c(forward,pixelColors[0],out); it should work now.
Maybe the problem is here (http://www.fftw.org/doc/New_002darray-Execute-Functions.html):
[...] that the following conditions are met:
The input and output arrays are the same (in-place) or different (out-of-place) if the plan was originally created to be in-place or
out-of-place, respectively.
In the plan you are using in-place transformation parameters (with bad allocation, BTW, since:
double *wisdomInput = (double *) fftw_malloc(sizeof(double)*width*2*(height/2 +1 ));
should be:
double *wisdomInput = (double *) fftw_malloc(sizeof(fftw_complex)*width*2*(height/2 +1 ));
to be suitable for output too).
But you're calling fftw_execute_dft_r2c function with out-of-place parameters.

Casting float* to char* while looping over a 2-D array in linear memory on device

On Page 21 of the CUDA 4.0 programming guide there is an example (given below) to illustrate looping over the
elements of a 2D array of floats in device memory. The dimensions of the 2D are width*height
// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r)
{
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c)
{
float element = row[c];
}
}
}
Why has the devPtr device memory pointer been cast to a character pointer ,char*, in the global kernel function? Can someone explain that line please. It looks a bit weird.
This is due to the way pointer arithmetic works in C. When you add an integer x to a pointer p, it doesn't always add x bytes. It adds x times sizeof(*p) (the size of the object to which p points).
float* row = (float*)((char*)devPtr + r * pitch);
By casting devPtr to a char*, the offset that is applied (r * pitch*) is in number of 1-byte increments. (because a char is one byte). Had the cast not been there, the offset applied to devPtr would be r * pitch times 4 bytes, as a float is four bytes.
For example, if we have:
float* devPtr = 1000;
int r = 4;
Now, let's leave out the cast:
float* result1 = (devPtr + r);
// result1 = devPtr + (r * sizeof(float)) = 1016;
Now, if we include the cast:
float* result2 = (float*)((char*)devPtr + r);
// result2 = devPtr + (r * sizeof(char)) = 1004;
The cast is just to make the pointer arithmetic work right;
(float*)((char*)devPtr + r * pitch);
moves r*pitch bytes forward while
(float*)(devPtr + r * pitch);
would move r*pitch floats forward (ie 4 times as many bytes)
*(devPtr + 1) will offset the pointer by 4 bytes (sizeof(float)) before the * dereferences it.
*((char)devPtr + 1) will offset the pointer by 1 byte (sizeof(char)) before the * dereferences it..