CUDA randState intializer kernel identifier "curandState_t" is undefined - c++

The problem
I am compiling a CUDA shared library that I've written, for this library, I need the capability to randomly sample stuff, therefore, as specified by the docs, I am initializing an array of curandState_t:
/**
* #brief Sets up random number generators for each thread
*
* #param state pointer to the array of curandState
* #param seed seed for the random number generator
*/
__global__ void rand_setup(curandState_t* state, unsigned long seed)
{
/* 3D grid of 3D blocks id */
size_t gid = getGlobalIdx_3D_3D();
curand_init(seed, gid, 0, &state[gid]);
printf("Initializing random number generator for thread %d");
}
The getGlobalIdx_3D_3D() call just retrieves the global ID through a bunch of tedious calculations, such as:
/**
* #brief Get the global identifier of the thread
*
* thanks to: https://cs.calvin.edu/courses/cs/374/CUDA/CUDA-Thread-Indexing-Cheatsheet.pdf
* for the snippet
*
* #return size_t The global index of the thread
*/
__device__ size_t getGlobalIdx_3D_3D()
{
size_t blockId = blockIdx.x + blockIdx.y * gridDim.x
+ gridDim.x * gridDim.y * blockIdx.z;
size_t threadId = blockId * (blockDim.x * blockDim.y * blockDim.z)
+ (threadIdx.z * (blockDim.x * blockDim.y))
+ (threadIdx.y * blockDim.x) + threadIdx.x;
return threadId;
}
The errors
I am getting a cascade of compilation errors (I have just completed a substantial amount of work), however, many of them seem to stem from the fact that curandState_t is not recognized as a proper type, hence I get this annoying dump:
D:\Repos\MulticoreTree\shared\libkernel.cu(325): error: attribute "__global__" does not apply here
D:\Repos\MulticoreTree\shared\libkernel.cu(325): error: incomplete type is not allowed
D:\Repos\MulticoreTree\shared\libkernel.cu(325): error: identifier "curandState_t" is undefined
D:\Repos\MulticoreTree\shared\libkernel.cu(325): error: identifier "state" is undefined
D:\Repos\MulticoreTree\shared\libkernel.cu(325): error: type name is not allowed
D:\Repos\MulticoreTree\shared\libkernel.cu(325): error: expected a ")"
D:\Repos\MulticoreTree\shared\libkernel.cu(326): error: expected a ";"
D:\Repos\MulticoreTree\shared\libkernel.cu(407): warning #12-D: parsing restarts here after previous syntax error
Looking online for documentation, there doesn't seem to be anything that tells me to import a specific header, also, I have #include <cuda_runtime.h> at the top of my file, so what could be wrong?
I personally think that the type is not being recognized, but there might also be a problem with something else in the function's signature causing it to fail compilation.

As per the first sentence of the curand device API documentation
To use the device API, include the file curand_kernel.h in files that define kernels that use cuRAND device functions

Related

For loop based kernel vs If statement Kernel - Cuda

I have seen the Cuda Kernel started two separate ways:
1.
for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x; i < length; i += blockDim.x * gridDim.x)
{
// do stuff
}
if(i < length)
{
// do stuff
}
Both versions are launched with kernel<<<num_blocks, threads_per_block>>> where the threads per block are maximized for our device (1024) and the number of blocks (2) for a length of 1025, for example.
The obvious difference is that the for loop allows the kernel to loop when the kernel is launched with less threads, for example 512 threads with 2 blocks length of 1025 it loops twice.
From previous research I've gathered that Nvidia suggests that we do not try and load balance ourselves (read as loop within the kernel like this), for instance, giving a kernel less threads or less blocks to reserve space for other kernels on the device because the load balancing that is built in is supposed to handle this in a more globally optimized way.
So my question is why would we want to use the for loop vs the if statement form of kernel? Is there a benefit to either at run time?
Given my understanding of Nvidia's stance on load balancing, the only value I can see is the ability to debug synchronously via 1 thread and 1 block setting <<<1, 1>>> when we launch the kernel in the for loop version or not having to precompute the # of blocks needed (and/or threads).
This is the test project I ran:
#include <cstdint>
#include <cstdio>
__global__
inline void kernel(int length)
{
int counter = 0;
for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x; i < length; i += blockDim.x * gridDim.x)
{
printf("%u: | i+: %u | tid: %u | counter: %u \n", i, blockDim.x * gridDim.x, threadIdx.x, counter++);
}
}
__global__
inline void kernel2(int length)
{
uint32_t i = blockIdx.x * blockDim.x + threadIdx.x;
if(i < length)
printf("%u: | i+: %u | tid: %u | \n", i, blockDim.x * gridDim.x, threadIdx.x);
}
int main()
{
//kernel<<<2, 1024>>>(1025);
kernel2<<<2, 1024>>>(1025);
cudaDeviceSynchronize();
}
So my question is why would we want to use the for loop vs the if statement form of kernel? Is there a benefit to either at run time?
Yes, there is. Every CUDA thread needs to:
Read all of its parameters from constant memory
Read grid and thread information from special registers: blockDim, blockIdx, threadIdx (or at least their .x components)
Do the arithemtic for computing its global index.
That takes a bit of time. It's not a lot; but if your kernel is very simple (e.g. something like adding up two arrays), then - yes, that has a cost. And of course, if you perform your own preliminary computation that is used with all items in the sequence - each thread has to take the time to do that as well.
From previous research I've gathered that Nvidia suggests that we do not try and load balance ourselves (read as loop within the kernel like this)
I doubt that. The question of whether to iterate a large sequence with a single "CUDA thread" per item or with less threads, each working on multiple items, depends on what is done for individual items in the sequence.

cudaTextureObject_t texFetch1D doesn't compile

this code doesn't compile in cuda toolkit 7.5 on a gtx 980 with compute capability set to 5.2 in visual studio 2013.
__global__ void a_kernel(cudaTextureObject_t texObj)
{
int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
int something = tex1Dfetch(texObj, thread_id);
}
here is the error.
error : more than one instance of overloaded function "tex1Dfetch" matches the argument list:
this code also doesn't compile.
__global__ void another_kernel(cudaTextureObject_t texObj)
{
int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
float something = tex1Dfetch<float>(texObj, thread_id);
}
here is that error.
error : type name is not allowed
following this example and the comments, all of the above should work:
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-kepler-texture-objects-improve-performance-and-flexibility/
please let me know if you need additional info, I couldn't think what else to provide.
Your first kernel doesn't compile because of a missing template type argument. This will compile:
__global__ void a_kernel(cudaTextureObject_t texObj)
{
int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
int something = tex1Dfetch<int>(texObj, thread_id);
}
Your second kernel is correct, and it does compile for me using VS2012 with the CUDA 7.0 toolkit for every compute capability I tried (sm_30 through sm_52).
I reinstalled the cuda toolkit and now the second piece of code (another_kernel) compiles. The first piece of code was incorrect in the first place as per the first answer. W.r.t. reinstalling the cuda toolkit, it was that I must have previously clobbered something in the sdk, I believe it was texture_indirect_functions.h.

cudaOccupancyMaxActiveBlocksPerMultiprocessor is undefined

I am trying to learn cuda and use it in an efficient way. And I have found a code from nvidia's website, which tells that we can learn what should be the block size that we should use for the device's most efficient usage. The code is as follows :
#include <iostream>
// Device code
__global__ void MyKernel(int *d, int *a, int *b)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
d[idx] = a[idx] * b[idx];
}
// Host code
int main()
{
int numBlocks; // Occupancy in terms of active blocks
int blockSize = 32;
// These variables are used to convert occupancy to warps
int device;
cudaDeviceProp prop;
int activeWarps;
int maxWarps;
cudaGetDevice(&device);
cudaGetDeviceProperties(&prop, device);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
MyKernel,
blockSize,
0);
activeWarps = numBlocks * blockSize / prop.warpSize;
maxWarps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
std::cout << "Occupancy: " << (double)activeWarps / maxWarps * 100 << "%" << std::endl;
return 0;
}
However, when I compiled it, there is the following error :
Compile line :
nvcc ben_deneme2.cu -arch=sm_35 -rdc=true -lcublas -lcublas_device -lcudadevrt -o my
Error :
ben_deneme2.cu(25): error: identifier "cudaOccupancyMaxActiveBlocksPerMultiprocessor" is undefined
1 error detected in the compilation of "/tmp/tmpxft_0000623d_00000000-8_ben_deneme2.cpp1.ii".
Should I include a library for this, though I could not find a library name for this on the internet? Or am I doing something else wrong?
Thanks in advance
The cudaOccupancyMaxActiveBlocksPerMultiprocessorfunction is included in CUDA 6.5. You have not access to that function if you have a previous version of CUDA installed, for example, it will not work for CUDA 5.5.
If you want to use that function you must update your CUDA version at least to to 6.5.
People using older versions usually use the Cuda Occupancy Calculator.
One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. -- CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

cudaErrorLaunchFailure while trying to run simple templated kernel on a 64bit data type

I have this simple kernel code:
template<typename T> __global__ void CalcHamming( const T* pData, const uint64_t u64Count, const T Arg, uint32_t* pu32Results )
{
uint64_t gidx = blockDim.x * blockIdx.x + threadIdx.x;
while ( gidx < u64Count )
{
pu32Results[ gidx ] += __popc( pData[gidx] ^ Arg );
gidx += blockDim.x * gridDim.x;
}
}
It works correctly unless I use it for a 64 bit unsigned int (uint64_t). In that case I get cudaErrorLaunchFailure. I figured that maybe the problem is in __popc() which cannot handle 64 bit numbers so i made a specialized function to solve this:
template<> __global__ void CalcHamming<uint64_t>( const uint64_t* pData, const uint64_t u64Count, const uint64_t Arg, uint32_t* pu32Results )
{
uint64_t gidx = blockDim.x * blockIdx.x + threadIdx.x;
while ( gidx < u64Count )
{
pu32Results[ gidx ] += __popcll( pData[gidx] ^ Arg );
gidx += blockDim.x * gridDim.x;
}
}
However the problem still remains. One thing to note is that my data are not in several arrays, like this:
Array1 (uint32_t): 100 items
Array2 (uint64_t): 200 items
But instead concatenated in one memory block:
Array: 100 items (uint32_t), 200 items (uint64_t)
And I am doing some pointer arithmetic to launch the kernel on the correct spot. I'm quite sure the calculations are correct. Also note that the above example is a simplified case, i have many more 'subarrays' of various integer types concatenated like this).
My guess is that this might be behind the issue, that CUDA somehow dislikes the alignment of the uint64_t array. However fixing this requires quite a lot of effort and I would like ot be sure it will help before I do it. Or can I fix this just by modifying the kernel somehow? Will there be performance penalties?
uint64_t must be 8-bytes aligned: see HERE.
So yes, CUDA "dislikes" misaligned types: it does not run at all with them.
However I think you can avoid to rearrange your data structure externally. It's enough you check and treat as uint32_t (or uint8_t for total generality!) the extremes of the array. That's quite common in optimized kernels, especially using vector types as float4, int4,...
For some alignment tips see HERE.

Segmentation fault when saving a JPEG file (from an array of RGB data)

I find the follow code on internet for read and write a JPEG file using the library libjpeg.
I changed the function void write_JPEG_file (char * filename, int quality) to the following:
void write_JPEG_vetor (JSAMPLE * image_data, int height, int width, int quality)
{
printf("%s\n","write_JPEG_vetor");
/* This struct contains the JPEG compression parameters and pointers to
* working space (which is allocated as needed by the JPEG library).
* It is possible to have several such structures, representing multiple
* compression/decompression processes, in existence at once. We refer
* to any one struct (and its associated working data) as a "JPEG object".
*/
struct jpeg_compress_struct cinfo;
/* This struct represents a JPEG error handler. It is declared separately
* because applications often want to supply a specialized error handler
* (see the second half of this file for an example). But here we just
* take the easy way out and use the standard error handler, which will
* print a message on stderr and call exit() if compression fails.
* Note that this struct must live as long as the main JPEG parameter
* struct, to avoid dangling-pointer problems.
*/
struct jpeg_error_mgr jerr;
/* More stuff */
FILE * outfile; /* target file */
JSAMPROW row_pointer[1]; /* pointer to JSAMPLE row[s] */
int row_stride; /* physical row width in image buffer */
printf ("%s\n","Step 1: allocate and initialize JPEG compression object */");
/* We have to set up the error handler first, in case the initialization
* step fails. (Unlikely, but it could happen if you are out of memory.)
* This routine fills in the contents of struct jerr, and returns jerr's
* address which we place into the link field in cinfo.
*/
cinfo.err = jpeg_std_error(&jerr);
/* Now we can initialize the JPEG compression object. */
jpeg_create_compress(&cinfo);
printf ("%s\n","/* Step 2: specify data destination (eg, a file) */");
/* Note: steps 2 and 3 can be done in either order. */
/* Here we use the library-supplied code to send compressed data to a
* stdio stream. You can also write your own code to do something else.
* VERY IMPORTANT: use "b" option to fopen() if you are on a machine that
* requires it in order to write binary files.
*/
char * filename = {"novo_arquivo.jpeg"};
if ((outfile = fopen(filename, "wb")) == NULL) {
fprintf(stderr, "can't open %s\n", filename);
exit(1);
}
jpeg_stdio_dest(&cinfo, outfile);
printf ("%s\n","/* Step 3: set parameters for compression */");
/* First we supply a description of the input image.
* Four fields of the cinfo struct must be filled in:
*/
cinfo.image_width = width; /* image width and height, in pixels */
cinfo.image_height = height;
cinfo.input_components = 3; /* # of color components per pixel */
cinfo.in_color_space = JCS_RGB; /* colorspace of input image */
/* Now use the library's routine to set default compression parameters.
* (You must set at least cinfo.in_color_space before calling this,
* since the defaults depend on the source color space.)
*/
jpeg_set_defaults(&cinfo);
/* Now you can set any non-default parameters you wish to.
* Here we just illustrate the use of quality (quantization table) scaling:
*/
jpeg_set_quality(&cinfo, quality, TRUE /* limit to baseline-JPEG values */);
printf ("%s\n","/* Step 4: Start compressor */");
/* TRUE ensures that we will write a complete interchange-JPEG file.
* Pass TRUE unless you are very sure of what you're doing.
*/
jpeg_start_compress(&cinfo, TRUE);
printf ("%s\n","/* Step 5: while (scan lines remain to be written) */");
/* jpeg_write_scanlines(...); */
/* Here we use the library's state variable cinfo.next_scanline as the
* loop counter, so that we don't have to keep track ourselves.
* To keep things simple, we pass one scanline per call; you can pass
* more if you wish, though.
*/
row_stride = width * 3; /* JSAMPLEs per row in image_buffer */
while (cinfo.next_scanline < cinfo.image_height) {
printf ("%s\n","Loop WHILE");
/* jpeg_write_scanlines expects an array of pointers to scanlines.
* Here the array is only one element long, but you could pass
* more than one scanline at a time if that's more convenient.
*/
row_pointer[0] = &image_data[cinfo.next_scanline * row_stride];
(void) jpeg_write_scanlines(&cinfo, row_pointer, row_stride);
}
printf ("%s\n","/* Step 6: Finish compression */");
jpeg_finish_compress(&cinfo);
/* After finish_compress, we can close the output file. */
fclose(outfile);
printf ("%s\n","/* Step 7: release JPEG compression object */");
/* This is an important step since it will release a good deal of memory. */
jpeg_destroy_compress(&cinfo);
/* And we're done! */
}
Now, when I run the program (in a Linux enviroment), I am receiving an error Segmentation Fault. Someone can tell why this is happening? My main suspect is the code:
while (cinfo.next_scanline < cinfo.image_height) {
printf ("%s\n","Loop WHILE");
/* jpeg_write_scanlines expects an array of pointers to scanlines.
* Here the array is only one element long, but you could pass
* more than one scanline at a time if that's more convenient.
*/
row_pointer[0] = &image_data[cinfo.next_scanline * row_stride];
(void) jpeg_write_scanlines(&cinfo, row_pointer, row_stride);
}
but i'm not sure about that, and can't find a solution to solve this, despite spend a good time trying.
=== UPDATE ===
I included the follow debugging code in this part of the code:
while (cinfo.next_scanline < cinfo.image_height) {
printf ("%s\n","Loop WHILE");
/* jpeg_write_scanlines expects an array of pointers to scanlines.
* Here the array is only one element long, but you could pass
* more than one scanline at a time if that's more convenient.
*/
printf ("%s\n","parte 1.1");
row_pointer[0] = &image_data[cinfo.next_scanline * row_stride];
printf ("%s\n","parte 1.2");
printf ("%s\n","parte 2.1");
(void) jpeg_write_scanlines(&cinfo, row_pointer, 1);
printf ("%s\n","parte 2.2");
}
And this way the output when running the programa is:
Loop WHILE
parte 1.1
parte 1.2
parte 2.1
=== UPDATE 2 ===
For the record, in my program, this function is receiving the return value of this function:
JSAMPLE * inverte_imagem()
{
int tamanho = image_height*image_width*image_colors;
int i;
JSAMPLE * vetor = malloc(sizeof(JSAMPLE)*(image_height*image_width*image_colors));
for( i=0; i<tamanho; i++)
vetor [i] = image_buffer [tamanho - (i+1)];
}
This looks wrong:
(void) jpeg_write_scanlines(&cinfo, row_pointer, row_stride);
That last parameter is the number of lines to write, not the row length. You probably want:
(void) jpeg_write_scanlines(&cinfo, row_pointer, 1);
OK, I solve the problem putting the call for write_JPEG_vector inside the function inverte_imagem(). I don't know why, but when I make the call to this function from my main function, a memory problem (error segmentation fault on linux) occurs.