Why does my data not fit into a CUDA Texture Object? - c++

I'm trying to fill a CUDA Texture Object with some data but the call to cudaCreateTextureObject fails with the following error (edit: on both a GTX 1080TI and a RTX 2080TI):
GPU ERROR! 'invalid argument' (err code 11)
It works if I put less data into my texture so my guess is that my computation about how much data I can fit into a texture is off.
My thought process is as follows:
(executable code follows below)
My data comes in the form of (76,76) images where each pixel is a float. What I would like to do is to store a column of images in a Texture Object; as I understand it, cudaMallocPitch is the way to do this.
When computing the number of images I can store in one texture I'm using the following formula to determine how much space a single image needs:
GTX_1080TI_MEM_PITCH * img_dim_y * sizeof(float)
Where the first argument should be the memory pitch on a GTX 1080TI card (512 bytes). The number of bytes that I can store in a 1D texture is given as 2^27 here. When I divide the latter by the former I get 862.3, assuming this is the number of images I can store in one Texture Object. However, when I try to store more than 855 images in my buffer the program crashes with the error above.
Here's the code:
In the following the main function (a) sets up all the relevant parameters, (b) allocates the memory using cudaMallocPitch, and (c) configures and creates a CUDA Texture Object:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cassert>
#define GTX_1080TI_MEM_PITCH 512
#define GTX_1080TI_1DTEX_WIDTH 134217728 // 2^27
//=====================================================================[ util ]
// CUDA error checking for library functions
#define CUDA_ERR_CHK(func){ cuda_assert( (func), __FILE__, __LINE__ ); }
inline void cuda_assert( const cudaError_t cu_err, const char* file, int line ){
if( cu_err != cudaSuccess ){
fprintf( stderr, "\nGPU ERROR! \'%s\' (err code %d) in file %s, line %d.\n\n", cudaGetErrorString(cu_err), cu_err, file, line );
exit( EXIT_FAILURE );
}
}
// CUDA generic error checking (used after kernel calls)
#define GPU_ERR_CHK(){ gpu_assert(__FILE__, __LINE__); }
inline void gpu_assert( const char* file, const int line ){
cudaError cu_err = cudaGetLastError();
if( cu_err != cudaSuccess ){
fprintf( stderr, "\nGPU KERNEL ERROR! \'%s\' (err code %d) in file %s, line %d.\n\n", cudaGetErrorString(cu_err), cu_err, file, line );
exit(EXIT_FAILURE);
}
}
//=====================================================================[ main ]
int main(){
// setup
unsigned int img_dim_x = 76;
unsigned int img_dim_y = 76;
unsigned int img_num = 856; // <-- NOTE: set this to 855 and it should work - but we should be able to put 862 here?
unsigned int pitched_img_size = GTX_1080TI_MEM_PITCH * img_dim_y * sizeof(float);
unsigned int img_num_per_tex = GTX_1080TI_1DTEX_WIDTH / pitched_img_size;
fprintf( stderr, "We should be able to stuff %d images into one texture.\n", img_num_per_tex );
fprintf( stderr, "We use %d (more than 855 leads to a crash).\n", img_num );
// allocate pitched memory
size_t img_tex_pitch;
float* d_img_tex_data;
CUDA_ERR_CHK( cudaMallocPitch( &d_img_tex_data, &img_tex_pitch, img_dim_x*sizeof(float), img_dim_y*img_num ) );
assert( img_tex_pitch == GTX_1080TI_MEM_PITCH );
fprintf( stderr, "Asking for %zd bytes allocates %zd bytes using pitch %zd. Available: %zd/%d\n",
img_num*img_dim_x*img_dim_y*sizeof(float),
img_num*img_tex_pitch*img_dim_y*sizeof(float),
img_tex_pitch,
GTX_1080TI_1DTEX_WIDTH - img_num*img_tex_pitch*img_dim_y*sizeof(float),
GTX_1080TI_1DTEX_WIDTH );
// generic resource descriptor
cudaResourceDesc res_desc;
memset(&res_desc, 0, sizeof(res_desc));
res_desc.resType = cudaResourceTypePitch2D;
res_desc.res.pitch2D.desc = cudaCreateChannelDesc<float>();
res_desc.res.pitch2D.devPtr = d_img_tex_data;
res_desc.res.pitch2D.width = img_dim_x;
res_desc.res.pitch2D.height = img_dim_y*img_num;
res_desc.res.pitch2D.pitchInBytes = img_tex_pitch;
// texture descriptor
cudaTextureDesc tex_desc;
memset(&tex_desc, 0, sizeof(tex_desc));
tex_desc.addressMode[0] = cudaAddressModeClamp;
tex_desc.addressMode[1] = cudaAddressModeClamp;
tex_desc.filterMode = cudaFilterModeLinear; // for linear interpolation (NOTE: this breaks normal integer indexing!)
tex_desc.readMode = cudaReadModeElementType;
tex_desc.normalizedCoords = false; // we want to index using [0;img_dim] rather than [0;1]
// make sure there are no lingering errors
GPU_ERR_CHK();
fprintf(stderr, "No CUDA error until now..\n");
// create texture object
cudaTextureObject_t img_tex_obj;
CUDA_ERR_CHK( cudaCreateTextureObject(&img_tex_obj, &res_desc, &tex_desc, NULL) );
fprintf(stderr, "bluppi\n");
}
This should crash when cudaCreateTextureObject is called. If the img_num parameter (at the start of main) is changed from 856 to 855, however, the code should execute successfully. (edit: The expected behavior would be that the code runs through with a value of 862 but fails with a value of 863 since that actually requires more bytes than the documented buffer size offers.)
Any help would be appreciated!

Since you're working with a 2D texture here, the number of bytes you can store in a 1D texture (the "width") is of no relevance here.
2D textures may have different characteristics depending on the type of memory that provides the backing for the texture. Two examples are linear memory and CUDA Array. You have chosen to use a linear memory backing (that which is provided by cudaMalloc* operations other than cudaMallocArray).
The primary problem you are running into is the maximum texture height. To discover what this is, we could refer to the table 14 in the programming guide, which lists:
Maximum width and height for a 2D texture reference bound to linear memory 65000 x 65000
You are exceeding this 65000 number when going from 855 to 856 images, for an image height of 76 rows. 856*76 = 65056, 855*76 = 64980
"But wait" you say, that table 14 entry says texture reference, and I am using a texture object.
You are correct, and table 14 doesn't explicitly list the corresponding limit for texture objects. In that case, we have to refer to the device properties readable from the device at runtime, using cudaGetDeviceProperties(). If we review the data available there, we see this readable item:
maxTexture2DLinear[3] contains the maximum 2D texture dimensions for 2D textures bound to pitch linear memory.
(I suspect the 3 is a typo, but no matter, we only need the first 2 values).
This is the value we want to be sure. If we modify your code to obey that limit, there are no problems:
$ cat t382.cu
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cassert>
#define GTX_1080TI_MEM_PITCH 512
#define GTX_1080TI_1DTEX_WIDTH 134217728 // 2^27
//=====================================================================[ util ]
// CUDA error checking for library functions
#define CUDA_ERR_CHK(func){ cuda_assert( (func), __FILE__, __LINE__ ); }
inline void cuda_assert( const cudaError_t cu_err, const char* file, int line ){
if( cu_err != cudaSuccess ){
fprintf( stderr, "\nGPU ERROR! \'%s\' (err code %d) in file %s, line %d.\n\n", cudaGetErrorString(cu_err), cu_err, file, line );
exit( EXIT_FAILURE );
}
}
// CUDA generic error checking (used after kernel calls)
#define GPU_ERR_CHK(){ gpu_assert(__FILE__, __LINE__); }
inline void gpu_assert( const char* file, const int line ){
cudaError cu_err = cudaGetLastError();
if( cu_err != cudaSuccess ){
fprintf( stderr, "\nGPU KERNEL ERROR! \'%s\' (err code %d) in file %s, line %d.\n\n", cudaGetErrorString(cu_err), cu_err, file, line );
exit(EXIT_FAILURE);
}
}
//=====================================================================[ main ]
int main(){
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
size_t max2Dtexturelinearwidth = prop.maxTexture2DLinear[0]; // texture x dimension
size_t max2Dtexturelinearheight = prop.maxTexture2DLinear[1]; // texture y dimension
fprintf( stderr, "maximum 2D linear texture dimensions (width,height): %lu,%lu\n", max2Dtexturelinearwidth, max2Dtexturelinearheight);
// setup
unsigned int img_dim_x = 76;
unsigned int img_dim_y = 76;
//unsigned int img_num = 856; // <-- NOTE: set this to 855 and it should work - but we should be able to put 862 here?
unsigned int img_num = max2Dtexturelinearheight/img_dim_y;
fprintf( stderr, "maximum number of images per texture: %u\n", img_num);
unsigned int pitched_img_size = GTX_1080TI_MEM_PITCH * img_dim_y * sizeof(float);
unsigned int img_num_per_tex = GTX_1080TI_1DTEX_WIDTH / pitched_img_size;
fprintf( stderr, "We should be able to stuff %d images into one texture.\n", img_num_per_tex );
fprintf( stderr, "We use %d (more than 855 leads to a crash).\n", img_num );
// allocate pitched memory
size_t img_tex_pitch;
float* d_img_tex_data;
CUDA_ERR_CHK( cudaMallocPitch( &d_img_tex_data, &img_tex_pitch, img_dim_x*sizeof(float), img_dim_y*img_num ) );
assert( img_tex_pitch == GTX_1080TI_MEM_PITCH );
fprintf( stderr, "Asking for %zd bytes allocates %zd bytes using pitch %zd. Available: %zd/%d\n",
img_num*img_dim_x*img_dim_y*sizeof(float),
img_num*img_tex_pitch*img_dim_y*sizeof(float),
img_tex_pitch,
GTX_1080TI_1DTEX_WIDTH - img_num*img_tex_pitch*img_dim_y*sizeof(float),
GTX_1080TI_1DTEX_WIDTH );
// generic resource descriptor
cudaResourceDesc res_desc;
memset(&res_desc, 0, sizeof(res_desc));
res_desc.resType = cudaResourceTypePitch2D;
res_desc.res.pitch2D.desc = cudaCreateChannelDesc<float>();
res_desc.res.pitch2D.devPtr = d_img_tex_data;
res_desc.res.pitch2D.width = img_dim_x;
res_desc.res.pitch2D.height = img_dim_y*img_num;
res_desc.res.pitch2D.pitchInBytes = img_tex_pitch;
// texture descriptor
cudaTextureDesc tex_desc;
memset(&tex_desc, 0, sizeof(tex_desc));
tex_desc.addressMode[0] = cudaAddressModeClamp;
tex_desc.addressMode[1] = cudaAddressModeClamp;
tex_desc.filterMode = cudaFilterModeLinear; // for linear interpolation (NOTE: this breaks normal integer indexing!)
tex_desc.readMode = cudaReadModeElementType;
tex_desc.normalizedCoords = false; // we want to index using [0;img_dim] rather than [0;1]
// make sure there are no lingering errors
GPU_ERR_CHK();
fprintf(stderr, "No CUDA error until now..\n");
// create texture object
cudaTextureObject_t img_tex_obj;
CUDA_ERR_CHK( cudaCreateTextureObject(&img_tex_obj, &res_desc, &tex_desc, NULL) );
fprintf(stderr, "bluppi\n");
}
$ nvcc -o t382 t382.cu
$ cuda-memcheck ./t382
========= CUDA-MEMCHECK
maximum 2D linear texture dimensions (width,height): 131072,65000
maximum number of images per texture: 855
We should be able to stuff 862 images into one texture.
We use 855 (more than 855 leads to a crash).
Asking for 19753920 bytes allocates 133079040 bytes using pitch 512. Available: 1138688/134217728
No CUDA error until now..
bluppi
========= ERROR SUMMARY: 0 errors
$

Related

Capturing YUYV in c++ using v4l2

I have a webcam connected to beaglebone via usb. I am coding in c++ and my goal is to capture raw UNCOMPRESSED picture from the webcam.
Firstly i checked what formats are supported via command v4l2-ctl --list-formats and the result was:
Index : 0
Type : Video Capture
Pixel Format: 'MJPG' (compressed)
Name : Motion-JPEG
Index : 1
Type : Video Capture
Pixel Format: 'YUYV'
Name : YUYV 4:2:2
So from this I assume it has to be possible to get an uncompressed picture if i try to use YUYV format.
Knowing this I started writing a program in c++. I successfully written a program to capture a compressed picture, but when trying to capture using format YUYV it doesnt work and i really need some help to get this done.
Here is my code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <linux/videodev2.h>
#include <libv4l2.h>
template <typename typeXX>
void clear_memmory(typeXX* x) {
memset(x, 0, sizeof(*x));
}
void xioctl(int cd, int request, void *arg){
int response;
do{
//ensures we get the correct response.
response = v4l2_ioctl(cd, request, arg);
}
while (response == -1 && ((errno == EINTR) || (errno == EAGAIN)));
if (response == -1) {
fprintf(stderr, "error %d, %s\n", errno, strerror(errno));
exit(EXIT_FAILURE);
}
}
struct LMSBBB_buffer{
void* start;
size_t length;
};
int main(){
const char* dev_name = "/dev/video0";
int width=1920;
int height=1080;
int fd = v4l2_open(dev_name, O_RDWR | O_NONBLOCK, 0);
struct v4l2_format format = {0};
format.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
format.fmt.pix.width = width;
format.fmt.pix.height = height;
format.fmt.pix.pixelformat = V4L2_PIX_FMT_RGB24;//V4L2_PIX_FMT_YUYV //V4L2_PIX_FMT_RGB24
format.fmt.pix.field = V4L2_FIELD_NONE; //V4L2_FIELD_NONE
xioctl(fd, VIDIOC_S_FMT, &format);
printf("Device initialized.\n");
///request buffers
struct v4l2_requestbuffers req = {0};
req.count = 2;
req.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
req.memory = V4L2_MEMORY_MMAP;
xioctl(fd, VIDIOC_REQBUFS, &req);
printf("Buffers requested.\n");
///mapping buffers
struct v4l2_buffer buf;
LMSBBB_buffer* buffers;
unsigned int i;
buffers = (LMSBBB_buffer*) calloc(req.count, sizeof(*buffers));
for (i = 0; i < req.count; i++) {
clear_memmory(&(buf));
(buf).type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
(buf).memory = V4L2_MEMORY_MMAP;
(buf).index = i;
xioctl(fd, VIDIOC_QUERYBUF, &buf);
buffers[i].length = (buf).length;
printf("A buff has a len of: %i\n",buffers[i].length);
buffers[i].start = v4l2_mmap(NULL, (buf).length, PROT_READ | PROT_WRITE, MAP_SHARED,fd, (buf).m.offset);
if (MAP_FAILED == buffers[i].start) {
perror("Can not map the buffers.");
exit(EXIT_FAILURE);
}
}
printf("Buffers mapped.\n");
for (i = 0; i < req.count; i++) {
clear_memmory(&(buf));
(buf).type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
(buf).memory = V4L2_MEMORY_MMAP;
(buf).index = i;
ioctl(fd,VIDIOC_QBUF, &(buf));
}
enum v4l2_buf_type type;
type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
ioctl(fd,VIDIOC_STREAMON, &type);
printf("buffers queued and streaming.\n");
int pic_count=0;
///CAPTURE
fd_set fds;
struct timeval tv;
int r;
char out_name[256];
FILE* fout;
do {
FD_ZERO(&fds);
FD_SET(fd, &fds);
// Timeout.
tv.tv_sec = 2;
tv.tv_usec = 0;
r = select(fd + 1, &fds, NULL, NULL, &tv);
} while ((r == -1 && (errno = EINTR)));
if (r == -1) {
perror("select");
exit(EXIT_FAILURE);
}
clear_memmory(&(buf));
(buf).type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
(buf).memory = V4L2_MEMORY_MMAP;
xioctl(fd,VIDIOC_DQBUF, &(buf));
printf("Buff index: %i\n",(buf).index);
sprintf(out_name, "image%03d.ppm",pic_count);
fout = fopen(out_name, "w");
if (!fout) {
perror("Cannot open image");
exit(EXIT_FAILURE);
}
fprintf(fout, "P6\n%d %d 255\n",width, height);
fwrite(buffers[(buf).index].start, (buf).bytesused, 1, fout);
fclose(fout);
pic_count++;
clear_memmory(&(buf));
(buf).type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
(buf).memory = V4L2_MEMORY_MMAP;
xioctl(fd,VIDIOC_DQBUF, &(buf));
printf("Buff index: %i\n",(buf).index);
sprintf(out_name, "image%03d.ppm",pic_count);
fout = fopen(out_name, "w");
if (!fout) {
perror("Cannot open image");
exit(EXIT_FAILURE);
}
fprintf(fout, "P6\n%d %d 255\n",width, height);
fwrite(buffers[(buf).index].start, (buf).bytesused, 1, fout);
fclose(fout);
pic_count++;
///xioctl(fd,VIDIOC_QBUF, &(buf));
return 0;
}
in line 50, i can choose the format between V4L2_PIX_FMT_YUYV and V4L2_PIX_FMT_RGB24.
for V4L2_PIX_FMT_RGB24 i get the picture, but when using V4L2_PIX_FMT_YUYV I get this error:
libv4l2: error dequeuing buf: Resource temporarily unavailable
libv4l2: error dequeuing buf: Resource temporarily unavailable
libv4l2: error dequeuing buf: Resource temporarily unavailable
libv4l2: error dequeuing buf: Resource temporarily unavailable
libv4l2: error dequeuing buf: Resource temporarily unavailable
the error lines goes for ever until i end the program manually.
Does anyone have an idea what to do? I spent over 2 weeks on this and i can't move anywhere from here. I would really appreciate any advice.
From what I see you are requesting a FullHD (1920x1080) buffer in YUYV format from a camera. You did not mention the camera type/model/specs, but if it is a generic USB-attached hardware most likely you will not get a raw FullHD YUYV buffer as an output, only the MJPEG one (which you can decode to YUV, if you hack around with libjpeg) or the decoded RGB buffer (which is pretty much the decoded MJPEG with YUV->RGB conversion) which is not mmapped.
The exact list of formats with framerates can be requested by this command, which would probably tell you it does not provide a 1920x1080 YUYV, only something smaller, like 640x480:
v4l2-ctl --list-formats
If you need video processing with "true" zero-copy access to raw YUYV camera frames, you need direct access to hardware and that specific hardware in the first place. Once you have the USB interface between your software and the camera, you get an extra indirection and that means the speed goes down. Think for a moment, the YUYV frame at 1920x1080 takes up approximately 4 Megabytes of memory. At 30 FPS this is 120 Megabytes (or 960 Megabits) per second of bus throughput. If you have a USB2.0 camera, there is just no bandwidth to support this (thus the need for MJPEG). Even at 15FPS this is 480 Megabits, not counting the USB latency and protocol overhead.
To provide some "actionable feedback" I would advice to first concentrate on the algorithms (probably, you just don't want to loose the processing speed at the very first step) which you want to apply to the image. Don't hesitate to use OpenCV for camera input and basic image processing, later you can switch to some hardware interface and hand-written algorithms.
The easier way of getting raw frames would be to use Android's camera interface and try to process the incoming frames with GLSL shaders using the GL_TEXTURE_EXTERNAL_OES extension, about which there information and code samples available. There you can connect GL textures to AHardwareBuffer instances and then use AHardwareBuffer_lock function to get raw pointers. The exact supported formats also may vary across the hardware, so do not expect this to be super-easy.
I've recently had a similar issue. In my case the camera driver needed the VIDIOC_S_PARM ioctl in order to set the frame rate and initialize the camera for the selected capture mode.
You can try to add this code after the VIDIOC_S_FMT and see if it works for you as well:
struct v4l2_streamparm streamparam;
memset(&streamparam, 0, sizeof(streamparam));
streamparam.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
xioctl(fd, VIDIOC_G_PARM, &streamparam);
streamparam.parm.capture.timeperframe.numerator = 1;
streamparam.parm.capture.timeperframe.denominator = 5;
xioctl(fd, VIDIOC_S_PARM, &streamparam);

cudaMemcpy throws InvalidValue error when copying from device to host

I've been trying to implement a one dimensional FFT using cuFFT. An InvalidValue error is thrown and no meaningful results are produced.
I've tried to ensure that each error is caught, and I believe that the cudaMemcpy from DeviceToHost causes the issue, though I am not sure why, nor how to fix it. The data size parameter in cudaMemcpy follows the same relation as supplied by the cuFFT documentation.
#include <iostream>
#include <fstream>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <cuda_runtime_api.h>
#include <cufft.h>
// cuda macros
#define NX 100 // number of points
#define BATCH 1 // number of ffts to perform
#define RANK 1 //
#define IDIST 1 // distance between 1st elements of batches
#define ISTRIDE 1 // do every ISTRIDEth index
#define ODIST 1 // distance between 1st elements of output
#define OSTRIDE 1 // distance between output elements
void fft1d() {
// create plan for performing fft
cufftHandle plan;
if (cufftPlan1d(&plan, NX, CUFFT_R2C, BATCH) != CUFFT_SUCCESS) {
printf("Failed to create 1D plan\n");
return;
}
// assemble data
double temp_data[] = {2.598076211353316, 3.2402830701637395, 3.8494572900049224, 4.419388724529261, 4.944267282795252, 5.41874215947433, 5.837976382931011, 6.197696125093141, 6.494234270429254, 6.724567799874842, 6.886348608602047, 6.97792744346504, 6.998370716093996, 6.9474700202387565, 6.8257442563389, 6.6344343416615565, 6.37549055993378, 6.051552679431957, 5.665923042211819, 5.222532898817316, 4.725902331664744, 4.181094175657916, 3.5936624057845576, 2.9695955178498603, 2.315255479544737, 1.6373128742041732, 0.9426788984240022, 0.23843490677753865, -0.46823977812093664, -1.1701410542749289, -1.8601134815746807, -2.531123226988873, -3.176329770049035, -3.7891556376344524, -4.363353457155562, -4.893069644570959, -5.3729040779788875, -5.797965148448726, -6.163919626883915, -6.467036838555256, -6.704226694973039, -6.873071195387157, -6.971849076777267, -6.999553361041935, -6.955901620504255, -6.84133885708361, -6.657032965782207, -6.404862828733319, -6.0873991611848375, -5.707878304681281, -5.270169234606201, -4.778734118422206, -4.23858282669252, -3.6552218606153755, -3.0345982167228436, -2.383038761007964, -1.707185730522749, -1.0139290199674, -0.31033594356630245, 0.39642081173600463, 1.0991363072871054, 1.7906468025248339, 2.463902784786862, 3.1120408346390414, 3.728453594100783, 4.306857124485735, 4.841354967187034, 5.326498254347925, 5.757341256627454, 6.129491801786784, 6.439156050110601, 6.683177170206378, 6.859067520906216, 6.965034011197066, 6.999996379650895, 6.963598207007518, 6.85621054964381, 6.678928156888352, 6.433558310743566, 6.122602401787424, 5.749230429076629, 5.317248684008804, 4.831060947586139, 4.295623596650021, 3.7163950767501706, 3.0992802567403803, 2.4505702323708074, 1.7768781925409076, 1.0850720020162676, 0.3822041878858906, -0.3245599564963766, -1.0280154171511335, -1.7209909100394047, -2.3964219877733033, -3.0474230571943477, -3.667357573646071, -4.249905696354359, -4.78912871521179, -5.279529592175676, -5.716109000098287};
cufftReal *idata;
cudaMalloc((void**) &idata, sizeof(cufftComplex)*NX);
if (cudaGetLastError() != cudaSuccess) {
printf("Failed to allocate memory space for input data.\n");
return;
}
cudaMemcpy(idata, temp_data, sizeof(temp_data)/sizeof(double), cudaMemcpyHostToDevice);
if (cudaGetLastError() != cudaSuccess) {
printf("Failed to load time data to memory.\n");
return;
}
// prepare memory for return data
cufftComplex *odata;
cudaMalloc((void**) &odata, sizeof(cufftComplex)*(NX/2 + 1));
if (cudaGetLastError() != cudaSuccess) {
printf("Failed to allocate memory for output data.\n");
}
// perform fft
if (cufftExecR2C(plan, idata, odata) != CUFFT_SUCCESS) {
printf("Failed to perform fft.\n");
return;
}
I think the error is thrown here, at the cudaMemcpy.
// grab data from graphics and print (memcpy waits until complete) cuda memcopy doesn't complete
// can return errors from previous cuda calls if they haven't been caught
cufftComplex *out_temp_data;
size_t num_bytes = (NX/2 + 1)*sizeof(cufftComplex);
cudaMemcpy(out_temp_data, odata, num_bytes, cudaMemcpyDeviceToHost);
int error_value = cudaGetLastError();
printf("cudaMemcpy from device state: %i\n", error_value);
if(error_value != cudaSuccess) {
printf("Failed to pull data from device.\n");
return;
}
for (size_t i = 0; i < (NX/2 + 1); i++) {
printf("%lu %f %f\n", i, out_temp_data[i].x, out_temp_data[i].y);
}
// clean up
cufftDestroy(plan);
cudaFree(idata);
}
int main() {
fft1d();
return 0;
}
Memory must be allocated before cudaMemcpy can write the data. Thanks to generic-opto-guy for pointing this out.
In this case:
out_temp_data = new cufftComplex[NX/2 + 1];

OpenCL vs CUDA: Pinned memory

I have been porting my RabbitCT CUDA implementation to OpenCL and I'm running into issues with pinned memory.
For CUDA a host buffer is created that buffers the input images to be processed in pinned memory. This allows the host to catch the next batch of input images while the GPU processes the current batch. A simplified mockup of my CUDA implementation is as follows:
// globals
float** hostProjBuffer = new float*[BUFFER_SIZE];
float* devProjection[STREAMS_MAX];
cudaStream_t stream[STREAMS_MAX];
void initialize()
{
// initiate streams
for( uint s = 0; s < STREAMS_MAX; s++ ){
cudaStreamCreateWithFlags (&stream[s], cudaStreamNonBlocking);
cudaMalloc( (void**)&devProjection[s], imgSize);
}
// initiate buffers
for( uint b = 0; b < BUFFER_SIZE; b++ ){
cudaMallocHost((void **)&hostProjBuffer[b], imgSize);
}
}
// main function called for all input images
void backproject(imgdata* r)
{
uint projNr = r->imgnr % BUFFER_SIZE;
uint streamNr = r->imgnr % STREAMS_MAX;
// When buffer is filled, wait until work in current stream has finished
if(projNr == 0) {
cudaStreamSynchronize(stream[streamNr]);
}
// copy received image data to buffer (maps double precision to float)
std::copy(r->I_n, r->I_n+(imgSizeX * imgSizeY), hostProjBuffer[projNr]);
// copy image and matrix to device
cudaMemcpyAsync( devProjection[streamNr], hostProjBuffer[projNr], imgSize, cudaMemcpyHostToDevice, stream[streamNr] );
// call kernel
backproject<<<numBlocks, threadsPerBlock, 0 , stream[streamNr]>>>(devProjection[streamNr]);
}
So, for CUDA, I create a pinned host pointer for each buffer item and copy the data to the device before executing kernel of each stream.
For OpenCL I initially did something similar when following the Nvidia OpenCL Best Practices Guide. Here they recommend creating two buffers, one for copying the kernel data to and one for the pinned memory. However, this leads to the implementation using double the device memory as both the kernel and pinned memory buffers are allocated on the device.
To get around this memory issue, I created an implementation where only a mapping is made to the device as it is needed. This can be seen in the following implementation:
// globals
float** hostProjBuffer = new float* [BUFFER_SIZE];
cl_mem devProjection[STREAMS_MAX], devMatrix[STREAMS_MAX];
cl_command_queue queue[STREAMS_MAX];
// initiate streams
void initialize()
{
for( uint s = 0; s < STREAMS_MAX; s++ ){
queue[s] = clCreateCommandQueueWithProperties(context, device, NULL, &status);
devProjection[s] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, imgSize, NULL, &status);
}
}
// main function called for all input images
void backproject(imgdata* r)
{
const uint projNr = r->imgnr % BUFFER_SIZE;
const uint streamNr = r->imgnr % STREAMS_MAX;
// when buffer is filled, wait until work in current stream has finished
if(projNr == 0) {
status = clFinish(queue[streamNr]);
}
// map host memory region to device buffer
hostProjBuffer[projNr] = (float*) clEnqueueMapBuffer(queue[streamNr], devProjection[streamNr], CL_FALSE, CL_MAP_WRITE_INVALIDATE_REGION, 0, imgSize, 0, NULL, NULL, &status);
// copy received image data to hostbuffers
std::copy(imgPtr, imgPtr + (imgSizeX * imgSizeY), hostProjBuffer[projNr]);
// unmap the allocated pinned host memory
clEnqueueUnmapMemObject(queue[streamNr], devProjection[streamNr], hostProjBuffer[projNr], 0, NULL, NULL);
// set stream specific arguments
clSetKernelArg(kernel, 0, sizeof(devProjection[streamNr]), (void *) &devProjection[streamNr]);
// launch kernel
clEnqueueNDRangeKernel(queue[streamNr], kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, NULL);
clFlush(queue[streamNr]);
clFinish(queue[streamNr]); //should be removed!
}
This implementation does use a similar amount of device memory as the CUDA implementation. However, I have been unable to get this last code example working without a clFinish after each loop, which significantly hampers the performance of the application. This indicates data is lost as the host moves ahead of the kernel. I tried increasing my buffer size to the number of input images, but this did not work either. So somehow during execution, the hostBuffer data gets lost.
So, with the goal to write OpenCL code similar to CUDA, I have three questions:
What is the recommended implementation for OpenCL pinned memory?
Is my OpenCL implementation similar to how CUDA handles pinned memory?
What causes the wrong data to be used in the OpenCL example?
Thanks in advance!
Kind regards,
Remy
PS: Question initially asked at the Nvidia developer forums

cuBLAS matrix inverse much slower than MATLAB

In my current project, I am attempting to calculate the inverse of a large (n > 2000) matrix with cuBLAS. The inverse calculation is performed, but for some reason calculation times are significantly slower than compared to those when done in MATLAB.
I have attached a sample calculation performed on random matrices using my implementation in either language as well as performance results.
Any help or suggestions on what may be causing this slowdown would be greatly appreciated.
Thank you in advance.
Comparison
cuBLAS vs. MATLAB
N = 500 : cuBLAS ~ 0.130 sec, MATLAB ~ 0.066 sec -> ~1.97x slower
N = 1000 : cuBLAS ~ 0.898 sec, MATLAB ~ 0.311 sec -> ~2.89x slower
N = 2000 : cuBLAS ~ 6.667 sec, MATLAB ~ 0.659 sec -> ~10.12x slower
N = 4000 : cuBLAS ~ 51.860 sec, MATLAB ~ 4.296 sec -> ~12.07x slower
C++ Code
#include <string>
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <conio.h>
#define CUDA_CALL(res, str) { if (res != cudaSuccess) { printf("CUDA Error : %s : %s %d : ERR %s\n", str, __FILE__, __LINE__, cudaGetErrorName(res)); } }
#define CUBLAS_CALL(res, str) { if (res != CUBLAS_STATUS_SUCCESS) { printf("CUBLAS Error : %s : %s %d : ERR %d\n", str, __FILE__, __LINE__, int(res)); } }
static cudaEvent_t cu_TimerStart;
static cudaEvent_t cu_TimerStop;
void d_CUDATimerStart(void)
{
CUDA_CALL(cudaEventCreate(&cu_TimerStart), "Failed to create start event!");
CUDA_CALL(cudaEventCreate(&cu_TimerStop), "Failed to create stop event!");
CUDA_CALL(cudaEventRecord(cu_TimerStart), "Failed to record start event!");
}
float d_CUDATimerStop(void)
{
CUDA_CALL(cudaEventRecord(cu_TimerStop), "Failed to record stop event!");
CUDA_CALL(cudaEventSynchronize(cu_TimerStop), "Failed to synch stop event!");
float ms;
CUDA_CALL(cudaEventElapsedTime(&ms, cu_TimerStart, cu_TimerStop), "Failed to elapse events!");
CUDA_CALL(cudaEventDestroy(cu_TimerStart), "Failed to destroy start event!");
CUDA_CALL(cudaEventDestroy(cu_TimerStop), "Failed to destroy stop event!");
return ms;
}
float* d_GetInv(float* L, int n)
{
cublasHandle_t cu_cublasHandle;
CUBLAS_CALL(cublasCreate(&cu_cublasHandle), "Failed to initialize cuBLAS!");
float** adL;
float** adC;
float* dL;
float* dC;
int* dLUPivots;
int* dLUInfo;
size_t szA = n * n * sizeof(float);
CUDA_CALL(cudaMalloc(&adL, sizeof(float*)), "Failed to allocate adL!");
CUDA_CALL(cudaMalloc(&adC, sizeof(float*)), "Failed to allocate adC!");
CUDA_CALL(cudaMalloc(&dL, szA), "Failed to allocate dL!");
CUDA_CALL(cudaMalloc(&dC, szA), "Failed to allocate dC!");
CUDA_CALL(cudaMalloc(&dLUPivots, n * sizeof(int)), "Failed to allocate dLUPivots!");
CUDA_CALL(cudaMalloc(&dLUInfo, sizeof(int)), "Failed to allocate dLUInfo!");
CUDA_CALL(cudaMemcpy(dL, L, szA, cudaMemcpyHostToDevice), "Failed to copy to dL!");
CUDA_CALL(cudaMemcpy(adL, &dL, sizeof(float*), cudaMemcpyHostToDevice), "Failed to copy to adL!");
CUDA_CALL(cudaMemcpy(adC, &dC, sizeof(float*), cudaMemcpyHostToDevice), "Failed to copy to adC!");
d_CUDATimerStart();
CUBLAS_CALL(cublasSgetrfBatched(cu_cublasHandle, n, adL, n, dLUPivots, dLUInfo, 1), "Failed to perform LU decomp operation!");
CUDA_CALL(cudaDeviceSynchronize(), "Failed to synchronize after kernel call!");
CUBLAS_CALL(cublasSgetriBatched(cu_cublasHandle, n, (const float **)adL, n, dLUPivots, adC, n, dLUInfo, 1), "Failed to perform Inverse operation!");
CUDA_CALL(cudaDeviceSynchronize(), "Failed to synchronize after kernel call!");
float timed = d_CUDATimerStop();
printf("cublas inverse in: %.5f ms.\n", timed);
float* res = (float*)malloc(szA);
CUDA_CALL(cudaMemcpy(res, dC, szA, cudaMemcpyDeviceToHost), "Failed to copy to res!");
CUDA_CALL(cudaFree(adL), "Failed to free adL!");
CUDA_CALL(cudaFree(adC), "Failed to free adC!");
CUDA_CALL(cudaFree(dL), "Failed to free dL!");
CUDA_CALL(cudaFree(dC), "Failed to free dC!");
CUDA_CALL(cudaFree(dLUPivots), "Failed to free dLUPivots!");
CUDA_CALL(cudaFree(dLUInfo), "Failed to free dLUInfo!");
CUBLAS_CALL(cublasDestroy(cu_cublasHandle), "Failed to destroy cuBLAS!");
return res;
}
int main()
{
int n = 1000;
float* L = (float*)malloc(n * n * sizeof(float));
for(int i = 0; i < n * n; i++)
L[i] = ((float)rand()/(float)(RAND_MAX));
float* inv = d_GetInv(L, n);
printf("done.");
_getch();
return 0;
}
MATLAB Code
A = rand(1000);
tic
X = inv(A);
toc
System Info:
GPU: GTX 780 3gb
CPU: i7-4790S # 3.20 GHz
As #RobertCrovella said, you should not use batched small matrix APIs for a single large matrix inversion.
Basically you could use the same method as in your code, but with the non-batched version of getrf() and getri() to maximum the performance for large matrix.
For getrf() you could find it here.
http://docs.nvidia.com/cuda/cusolver/index.html#cuds-lt-t-gt-getrf
For getri(), although CUDA toolkit does not provide a getri() to solve AX=I, where A is LU-facotored by getrf(), it does provide a getrs() to solve AX=B. All you need to do is to set B=I before calling getrs().
http://docs.nvidia.com/cuda/cusolver/index.html#cuds-lt-t-gt-getrs

tex1Dfetch unexpectedly returning 0

I don't believe this is the same issue as reported here :
Bound CUDA texture reads zero
CUDA 1D texture fetch always return 0
In my CUDA application I noticed that tex1Dfetch is not returning the expected value, past a certain index in the buffer. An initial observation in the application was that a value at index 0 could be read correctly, but at 12705625, the value read was 0. I made a small test program to investigate this, given below. The results are a little bit baffling to me. I'm trying to probe at what index the values no longer are read correctly. But as the value arraySize is changed, so does the "firstBadIndex". Even with arraySize =2, the second value is read incorrectly! As arraySize is made bigger, the firstBadIndex gets bigger. This happens when binding to arrays of float, float2, or float4. If the data are read from the device buffer instead (switch around the commented lines in FetchTextureData), then everything is fine. This is using CUDA 6.5, on a Tesla c2075.
Thanks for any insights or advice you might have.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#define FLOATTYPE float4
texture<FLOATTYPE,cudaTextureType1D,cudaReadModeElementType> texture1D;
const unsigned int arraySize = 1000;
FLOATTYPE* host;
FLOATTYPE* device;
FLOATTYPE* dTemp;
FLOATTYPE hTemp[1];
__global__ void FetchTextureData(FLOATTYPE* data,FLOATTYPE* arr,int idx)
{
data[0] = tex1Dfetch(texture1D, idx);
//data[0] = arr[idx];
}
bool GetTextureValues(int idx){
FetchTextureData<<<1,1>>>(dTemp,device,idx);
// copy to the host
cudaError_t err = cudaMemcpy(hTemp,dTemp,sizeof(FLOATTYPE),cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {
throw "cudaMemcpy failed!";
}
if (cudaDeviceSynchronize() != cudaSuccess) {
throw "cudaDeviceSynchronize failed!";
}
return hTemp[0].x == 1.0f;
}
int main()
{
try{
host = new FLOATTYPE[arraySize];
cudaError_t err = cudaMalloc((void**)&device,sizeof(FLOATTYPE) * arraySize);
cudaError_t err1 = cudaMalloc((void**)&dTemp,sizeof(FLOATTYPE));
if (err != cudaSuccess || err1 != cudaSuccess) {
throw "cudaMalloc failed!";
}
// make some host data
for(unsigned int i=0; i<arraySize; i++){
FLOATTYPE data = {1.0f, 0.0f, 0.0f, 0.0f};
host[i] = data;
}
// and copy it to the device
err = cudaMemcpy(device,host,sizeof(FLOATTYPE) * arraySize,cudaMemcpyHostToDevice);
if (err != cudaSuccess){
throw "cudaMemcpy failed!";
}
// set up the textures
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<FLOATTYPE>();
texture1D.addressMode[0] = cudaAddressModeClamp;
texture1D.filterMode = cudaFilterModePoint;
texture1D.normalized = false;
cudaBindTexture(NULL, texture1D, device, channelDesc, arraySize);
// do a texture fetch and find where the fetches stop working
int lastGoodValue = -1, firstBadValue = -1;
float4 badValue = {-1.0f,0.0f,0.0f,0.0f};
for(unsigned int i=0; i<arraySize; i++){
if(i % 100000 == 0) printf("%d\n",i);
bool isGood = GetTextureValues(i);
if(firstBadValue == -1 && !isGood)
firstBadValue = i;
if(isGood)
lastGoodValue = i;
else
badValue = hTemp[0];
}
printf("lastGoodValue %d, firstBadValue %d\n",lastGoodValue,firstBadValue);
printf("Bad value is (%.2f)\n",badValue.x);
}catch(const char* err){
printf("\nCaught an error : %s\n",err);
}
return 0;
}
The problem lies in the texture set up. This:
cudaBindTexture(NULL, texture1D, device, channelDesc, arraySize);
should be:
cudaBindTexture(NULL, texture1D, device, channelDesc,
arraySize * sizeof(FLOATTYPE));
As per the documentation, the size argument is the size of the memory area in bytes, not the number of elements. I would have expected that with the clamped addressing mode, the code would still work as expected. With border mode, you should get a zero value which looks like it would trigger your bad value detection. I haven't actually run your code, so perhaps there is a subtley I'm missing somewhere. For such a simple repro case, your code structure is rather convoluted and hard to follow (at least on the mobile phone screen I am reading it on).
EDIT to add that between the time I started writing this and finished, #njuffa pointed out the same mistake in comments