Some child grids not being executed with CUDA Dynamic Parallelism

Some child grids not being executed with CUDA Dynamic Parallelism - c++

I'm experimenting with the new Dynamic Parallelism feature in CUDA 5.0 (GTK 110). I face the strange behavior that my program does not return the expected result for some configurations—not only unexpected, but also a different result with each launch.
Now I think I found the source of my problem: It seems that some child girds (kernels launched by other kernels) are sometimes not executed when too many child grids are spawned at the same time.
I wrote a little test program to illustrate this behavior:
#include <stdio.h>
__global__ void out_kernel(char* d_out, int index)
{
d_out[index] = 1;
}
__global__ void kernel(char* d_out)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
out_kernel<<<1, 1>>>(d_out, index);
}
int main(int argc, char** argv) {
int griddim = 10, blockdim = 210;
// optional: read griddim and blockdim from command line
if(argc > 1) griddim = atoi(argv[1]);
if(argc > 2) blockdim = atoi(argv[2]);
const int numLaunches = griddim * blockdim;
const int memsize = numLaunches * sizeof(char);
// allocate device memory, set to 0
char* d_out; cudaMalloc(&d_out, memsize);
cudaMemset(d_out, 0, memsize);
// launch outer kernel
kernel<<<griddim, blockdim>>>(d_out);
cudaDeviceSynchronize();
// dowload results
char* h_out = new char[numLaunches];
cudaMemcpy(h_out, d_out, memsize, cudaMemcpyDeviceToHost);
// check results, reduce output to 10 errors
int maxErrors = 10;
for (int i = 0; i < numLaunches; ++i) {
if (h_out[i] != 1) {
printf("Value at index %d is %d, should be 1.\n", i, h_out[i]);
if(maxErrors-- == 0) break;
}
}
// clean up
delete[] h_out;
cudaFree(d_out);
cudaDeviceReset();
return maxErrors < 10 ? 1 : 0;
}
The program launches a kernel in a given number of blocks (1st parameter) with a given number of threads each (2nd parameter). Each thread in that kernel will then launch another kernel with a single thread. This child kernel will write a 1 in its portion of an output array (which was initialized with 0s).
At the end of execution all values in the output array should be 1. But strangely for some block- and grid-sizes some of the array values are still zero. This basically means that some of the child grids are not executed.
This only happens if many of the child grids are spawned at the same time. On my test system (a Tesla K20x) this is the case for 10 blocks containing 210 threads each. 10 blocks with 200 threads deliver the correct result, though. But also 3 blocks with 1024 threads each cause the error.
Strangely, no error is reported back by the runtime. The child grids simply seem to be ignored by the scheduler.
Does anyone else face the same problem? Is this behavior documented somewhere (I did not find anything), or is it really a bug in the device runtime?

You're doing no error checking of any kind that I can see. You can and should do similar error checking on device kernel launches. Refer to the documentation These errors will not necessarily be bubbled up to the host:
Errors are recorded per-thread, so that each thread can identify the most recent error that it has generated.
You must trap them in the device. There are plenty of examples of this type of device error checking in the documentation.
If you were to do proper error checking you would discover that in each case where a kernel failed to launch, the cuda device runtime API was returning error 69, cudaErrorLaunchPendingCountExceeded.
If you scan the documentation for this error, you'll find this:
cudaLimitDevRuntimePendingLaunchCount
Controls the amount of memory set aside for buffering kernel launches which have not yet begun to execute, due either to unresolved dependencies or lack of execution resources. When the buffer is full, launches will set the thread’s last error to cudaErrorLaunchPendingCountExceeded. The default pending launch count is 2048 launches.
At 10 blocks * 200 threads, you are launching 2000 kernels, and things seem to work.
At 10 blocks * 210 threads, you are launching 2100 kernels, which exceeds the 2048 limit mentioned above.
Note that this is somewhat dynamic in nature; depending on how your application launches child kernels, you may launch in excess of 2048 kernels easily without hitting this limit. But since your application launches all kernels approximately simultaneously, you are hitting the limit.
Proper cuda error checking is advisable any time your CUDA code is not behaving the way you expect.
If you'd like to get some confirmation of the above, in your code you can modify your main kernel like this:
__global__ void kernel(char* d_out)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
out_kernel<<<1, 1>>>(d_out, index);
// cudaDeviceSynchronize(); // not necessary since error 69 is returned immediately
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) d_out[index] = (char)err;
}
The pending launch count limit is modifiable. Refer to the documentation for cudaLimitDevRuntimePendingLaunchCount

Related

For loop based kernel vs If statement Kernel - Cuda

I have seen the Cuda Kernel started two separate ways:
1.
for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x; i < length; i += blockDim.x * gridDim.x)
{
// do stuff
}
if(i < length)
{
// do stuff
}
Both versions are launched with kernel<<<num_blocks, threads_per_block>>> where the threads per block are maximized for our device (1024) and the number of blocks (2) for a length of 1025, for example.
The obvious difference is that the for loop allows the kernel to loop when the kernel is launched with less threads, for example 512 threads with 2 blocks length of 1025 it loops twice.
From previous research I've gathered that Nvidia suggests that we do not try and load balance ourselves (read as loop within the kernel like this), for instance, giving a kernel less threads or less blocks to reserve space for other kernels on the device because the load balancing that is built in is supposed to handle this in a more globally optimized way.
So my question is why would we want to use the for loop vs the if statement form of kernel? Is there a benefit to either at run time?
Given my understanding of Nvidia's stance on load balancing, the only value I can see is the ability to debug synchronously via 1 thread and 1 block setting <<<1, 1>>> when we launch the kernel in the for loop version or not having to precompute the # of blocks needed (and/or threads).
This is the test project I ran:
#include <cstdint>
#include <cstdio>
__global__
inline void kernel(int length)
{
int counter = 0;
for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x; i < length; i += blockDim.x * gridDim.x)
{
printf("%u: | i+: %u | tid: %u | counter: %u \n", i, blockDim.x * gridDim.x, threadIdx.x, counter++);
}
}
__global__
inline void kernel2(int length)
{
uint32_t i = blockIdx.x * blockDim.x + threadIdx.x;
if(i < length)
printf("%u: | i+: %u | tid: %u | \n", i, blockDim.x * gridDim.x, threadIdx.x);
}
int main()
{
//kernel<<<2, 1024>>>(1025);
kernel2<<<2, 1024>>>(1025);
cudaDeviceSynchronize();
}

So my question is why would we want to use the for loop vs the if statement form of kernel? Is there a benefit to either at run time?
Yes, there is. Every CUDA thread needs to:
Read all of its parameters from constant memory
Read grid and thread information from special registers: blockDim, blockIdx, threadIdx (or at least their .x components)
Do the arithemtic for computing its global index.
That takes a bit of time. It's not a lot; but if your kernel is very simple (e.g. something like adding up two arrays), then - yes, that has a cost. And of course, if you perform your own preliminary computation that is used with all items in the sequence - each thread has to take the time to do that as well.
From previous research I've gathered that Nvidia suggests that we do not try and load balance ourselves (read as loop within the kernel like this)
I doubt that. The question of whether to iterate a large sequence with a single "CUDA thread" per item or with less threads, each working on multiple items, depends on what is done for individual items in the sequence.

CUDA signal to host

Is there a way to signal (success/failure) to the host at the end of kernel execution?
I am looking at an iterative process where calculations are made in device and after each iteration, a boolean variable is passed to host that tells if the process has converged. Based on the variable, host decides to either stop iterating or go through another round of iteration.
Copying a single boolean variable at the end of every iteration nullifies the time gain obtained through parallelization. Hence, I would like to find a way to let the host know of the convergence status (success/failure) without having to CudaMemCpy every time.
Note: The time issue exists after using pinned memory to transfer data.
Alternatives that I have looked at.
asm("trap;"); & assert();
These will trigger respectively Unknown error and cudaErrorAssert in host. Unfortunately, they are "sticky" in that the error cannot be reset using CudaGetLastError. The only way is to reset device using cudaDeviceReset().
using CudaHostAllocMapped to avoid CudaMemCpy This is of no use as it does not offer any time based advantage over standard pinned memory allocation + CudaMemCpy. (Pg 460, MultiCore and GPU Programming, An Integrated Approach, Morgran Kruffmann 2014).
Will appreciate other ways to overcome this issue.

I suspect the real issue here is that your iteration kernel run time is very short (on the order of 100us or less), meaning the work per iteration is very small. The best solution might be to try to increase the work per iteration (refactor your code/algorithm, tackle a larger problem, etc.)
However, here are some possibilities:
Use mapped/pinned memory. Your claim in item 2 of your question is unsupported, IMO, without a lot more context than a page reference to a book that many of us probably don't have available to look at.
Use dynamic parallelism. Move your kernel launch process to a CUDA parent kernel that is issuing child kernels. Whatever boolean is set by the child kernel will be immediately discoverable in the parent kernel, without any need for a cudaMemcpy operation or mapped/pinned memory.
Use a pipelined algorithm, and overlap a speculative kernel launch with the device->host copy of the boolean, for each pipeline stage.
I consider the first two items above fairly obvious, so I'll provide a worked example for item 3. The basic idea is that we will ping-pong between two streams, launching the kernel alternately into one stream then the other. We will have a 3rd stream so that we can overlap the device->host copy operations with the execution of the next launch. Due to the overlap of D->H copy with kernel execution, there is effectively no "cost" for the copy operation, it is hidden by kernel execution work.
Here's a fully worked example, plus a nvvp timeline:
$ cat t267.cu
#include <stdio.h>
const int stop_count = 5;
const long long tdelay = 1000000LL;
__global__ void test_kernel(int *icounter, bool *istop, int *ocounter, bool *ostop){
if (*istop) return;
long long start = clock64();
while (clock64() < tdelay+start);
int my_count = *icounter;
my_count++;
if (my_count >= stop_count) *ostop = true;
*ocounter = my_count;
}
int main(){
volatile bool *v_stop;
volatile int *v_counter;
bool *h_stop, *d_stop1, *d_stop2, *d_s1, *d_s2, *d_ss;
int *h_counter, *d_counter1, *d_counter2, *d_c1, *d_c2, *d_cs;
cudaStream_t s1, s2, s3, *sp1, *sp2, *sps;
cudaEvent_t e1, e2, *ep1, *ep2, *eps;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
cudaStreamCreate(&s3);
cudaEventCreate(&e1);
cudaEventCreate(&e2);
cudaMalloc(&d_counter1, sizeof(int));
cudaMalloc(&d_stop1, sizeof(bool));
cudaMalloc(&d_counter2, sizeof(int));
cudaMalloc(&d_stop2, sizeof(bool));
cudaHostAlloc(&h_stop, sizeof(bool), cudaHostAllocDefault);
cudaHostAlloc(&h_counter, sizeof(int), cudaHostAllocDefault);
v_stop = h_stop;
v_counter = h_counter;
int n_counter = 1;
h_stop[0] = false;
h_counter[0] = 0;
cudaMemcpy(d_stop1, h_stop, sizeof(bool), cudaMemcpyHostToDevice);
cudaMemcpy(d_stop2, h_stop, sizeof(bool), cudaMemcpyHostToDevice);
cudaMemcpy(d_counter1, h_counter, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_counter2, h_counter, sizeof(int), cudaMemcpyHostToDevice);
sp1 = &s1;
sp2 = &s2;
ep1 = &e1;
ep2 = &e2;
d_c1 = d_counter1;
d_c2 = d_counter2;
d_s1 = d_stop1;
d_s2 = d_stop2;
test_kernel<<<1,1, 0, *sp1>>>(d_c1, d_s1, d_c2, d_s2);
cudaEventRecord(*ep1, *sp1);
cudaStreamWaitEvent(s3, *ep1, 0);
cudaMemcpyAsync(h_stop, d_s2, sizeof(bool), cudaMemcpyDeviceToHost, s3);
cudaMemcpyAsync(h_counter, d_c2, sizeof(int), cudaMemcpyDeviceToHost, s3);
while (v_stop[0] == false){
cudaStreamWaitEvent(*sp2, *ep1, 0);
sps = sp1; // ping-pong
sp1 = sp2;
sp2 = sps;
eps = ep1;
ep1 = ep2;
ep2 = eps;
d_cs = d_c1;
d_c1 = d_c2;
d_c2 = d_cs;
d_ss = d_s1;
d_s1 = d_s2;
d_s2 = d_ss;
test_kernel<<<1,1, 0, *sp1>>>(d_c1, d_s1, d_c2, d_s2);
cudaEventRecord(*ep1, *sp1);
while (n_counter > v_counter[0]);
n_counter++;
if(v_stop[0] == false){
cudaStreamWaitEvent(s3, *ep1, 0);
cudaMemcpyAsync(h_stop, d_s2, sizeof(bool), cudaMemcpyDeviceToHost, s3);
cudaMemcpyAsync(h_counter, d_c2, sizeof(int), cudaMemcpyDeviceToHost, s3);
}
}
cudaDeviceSynchronize(); // optional
printf("terminated at counter = %d\n", v_counter[0]);
}
$ nvcc -arch=sm_52 -o t267 t267.cu
$ ./t267
terminated at counter = 5
$
In the above diagram, we see that 5 kernel launches are evident (actually 6) andy they are bouncing back and forth between two streams. (The 6th kernel launch, which we would expect from the code organization and pipelining, is a very short line at the end of stream15 above. This kernel launches but immediately witness that stop is true, so it exits.) The device -> host copies are in a 3rd stream. If we zoom in closely at the handoff from one kernel iteration to the next:
we see that even these very short D->H memcpy operations are essentially overlapped with the next kernel execution. For reference, the gap between kernel executions above is about 5us.
Note that this was entirely done on linux. If you attempt this on windows WDDM, it may be difficult to achieve anything similar, due to WDDM command batching. Windows TCC should approximately duplicate linux behavior, however.

In cuda, thread indexes are not fully shown in the kernel function

I am writing code and recently, I found some error. The simplified version is shown below.
#include <stdio.h>
#include <cuda.h>
#define DEBUG 1
inline void check_cuda_errors(const char *filename, const int line_number)
{
#ifdef DEBUG
cudaThreadSynchronize();
cudaError_t error = cudaGetLastError();
if(error != cudaSuccess)
{
printf("CUDA error at %s:%i: %s\n", filename, line_number, cudaGetErrorString(error));
exit(-1);
}
#endif
}
__global__ void make_input_matrix_zp()
{
unsigned int row = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x*blockDim.x + threadIdx.x;
printf("col: %d (%d*%d+%d) row: %d (%d*%d+%d) \n", col, blockIdx.x, blockDim.x, threadIdx.x, row, blockIdx.y, blockDim.y, threadIdx.y);
}
int main()
{
dim3 blockDim(16, 16, 1);
dim3 gridDim(6, 6, 1);
make_input_matrix_zp<<<gridDim, blockDim>>>();
//check_cuda_errors(__FILE__, __LINE__);
return 0;
}
The first inline function is for checking the error in the cuda.
The second kernel function simply calculates current thread's index written in 'row' and 'col' and print these values. I guess there are no problem in inline function since it is from other reliable source.
The problem is, when I run the program, it does not execute kernel function even though it is called in the main function. However, if I delete the comment notation '//' in front of the
check_cuda_error
the program seems to enter the kernel function and it shows some printed value by printf function. But, it does not shows full combination of 'col' and 'row' indexes. In detail, the 'blockDim.y' does not change much. It only shows values of 4 and 5, but not 0, 1, 2, 3.
What I do not understand first.
As far as I know, the 'gridDim' means the dimension of the blocks. That means the block indexes have combination of (0,0)(0,1)(0,2)(0,3)(0,4)(0,5)(1,0)(1,1)(1,2)(1,3)... and so on. Also the size of the each block is 16 by 16. However, if you run this program, it does not show full combination. I just shows several combinations and it ends.
What I do not understand second.
Why the kernel function is dependent to the function named 'check_cuda_errors'? When this function exists, the program at least runs although imperfectly. However, when this error checking function is commented, the kernel function does not show any printed values.
This is very simple code but I couldn't find the problem for several days. Is there anything that I missed? Or do I know something wrong?
My working environment is like this.
"GeForce GT 630"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 2.1
Ubuntu 14.04

The CUDA GPU printf subsystem relies on a FIFO buffer to store printed output. If your output exceeds the size of the buffer, some or all of the previous content of the FIFO buffer will be overwritten by subsequent output. This is what will be happening in this case.
You can query and change the size of the buffer using the runtime API with cudaDeviceGetLimit and cudaDeviceSetLimit. If your device has the resources available to expand the limit, you should be able to see all the output your code emits.
As an aside, relying on the kernel printf feature for anything other than simple diagnostics or lightweight debugging, is a terrible idea, and you have probably just proven to yourself that you should be looking at other methods of verifying the correctness of your code.
Regarding your second question, the printf buffer is flushed to output only when the host synchronizes with the device. For example, with a call to cudaDeviceSynchronize, cudaThreadSynchronize, cudaMemcpy, and others (see B.17.2 Limitations of the Formatted Output appendix).
When check_cuda_errors is uncommented, calling cudaThreadSynchronize is what triggers the buffer to be printed. When it is commented, the main thread simply terminates before the kernel gets to run to completion, and nothing else happens.

Doubling buffering in CUDA so the CPU can operate on data produced by a persistent kernel

I have a Monte Carlo simulation in which the state of the system is a bit string (size N) with the bits being randomly flipped. In an effort to accelerate the simulation the code was revised to use CUDA. However because of the large number of statistics I need calculated from the system state (goes as N^2) this part needs to be done on the CPU where there is more memory. Currently the algorithm looks like this:
loop
CUDA kernel making 10s of Monte Carlo steps
Copy system state back to CPU
Calculate statistics
This is inefficient and I would like to have the kernel run persistently while the CPU occasionally queries the state of the system and calculates the statistics while the kernel continues to run.
Based on Tom's answer to this question I think the answer is double buffering, but I haven't been able to find an explanation or example of how to do this.
How does one set up the double buffering described in the third paragraph of Tom's answer for a CUDA/C++ code?

Here's a fully worked example of a "persistent" kernel, producer-consumer approach, with a double-buffered interface from device (producer) to host (consumer).
Persistent kernel design generally implies launching kernels with, at most, the number of blocks that can be simultaneously resident on the hardware (see item 1 on slide 16 here). For the most efficient usage of the machine, we'd generally like to maximize this, while still staying within the aforementioned limit. This involves an occupancy study for a specific kernel, and it will vary from kernel to kernel. Therefore I've chosen to take a shortcut here, and simply launch as many blocks as there are multiprocessors. Such an approach is always guaranteed to work (it could be considered a "lower bound" on the number of blocks to launch for a persistent kernel), but is (typically) not the most efficient usage of the machine. Nevertheless, I claim the occupancy study is beside the point of your question. Furthermore, it is arguable that proper "persistent kernel" design with guaranteed forward progress is actually quite tricky - requiring careful design of the CUDA thread code and placement of threadblocks (e.g. only use 1 threadblock per SM) to guarantee forward progress. However we don't need to delve to this level to address your question (I don't think) and the persistent kernel example I propose here only places 1 threadblock per SM.
I'm also assuming a proper UVA setup, so that I can skip the details of arranging for proper mapped memory allocations in an non-UVA setup.
The basic idea is that we will have 2 buffers on the device, along with 2 "mailboxes" in mapped memory, one for each buffer. The device kernel will fill a buffer with data, then set the "mailbox" to a value (2, in this case) that indicates the host may "consume" the buffer. The device then goes on to the other buffer and repeats the process in a ping-pong fashion between buffers. In order to make this work we must make sure that the device itself has not overrun the buffers (no thread is allowed to be more than one buffer ahead of any other thread) and that before a buffer is populated by the device, the host has consumed the previous contents.
On the host side, it is simply waiting for the mailbox to indicate "full", then copying the buffer from device to host, reset the mailbox, and perform the "processing" on it (the validate function). It then goes on to the next buffer in a ping-pong fashion. The actual data "production" by the device is just to fill each buffer with the iteration number. The host then checks to see that the proper iteration number was received.
I've structured the code to call out the actual device "work" function (my_compute_function) which is where you would put whatever your Monte Carlo code is. If your code is nicely thread-independent, this should be straightforward. Thus the device side my_compute_function is the producer function, and the host side validate is the consumer function. If your device producer code is not simply thread independent, then you may need to restructure things slightly around the calling point to my_compute_function.
The net effect of this is that the device can "race ahead" and begin filling the next buffer, while the host is "consuming" the data in the previous buffer.
Because persistent kernel design imposes an upper bound on the number of blocks (and threads) in a kernel launch, I've chosen to implement the "work" producer function in a grid-striding loop, so that arbitrary size buffers can be handled by the given grid-width.
Here's a fully worked example:
$ cat t942.cu
#include <stdio.h>
#define ITERS 1000
#define DSIZE 65536
#define nTPB 256
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__device__ volatile int blkcnt1 = 0;
__device__ volatile int blkcnt2 = 0;
__device__ volatile int itercnt = 0;
__device__ void my_compute_function(int *buf, int idx, int data){
buf[idx] = data; // put your work code here
}
__global__ void testkernel(int *buffer1, int *buffer2, volatile int *buffer1_ready, volatile int *buffer2_ready, const int buffersize, const int iterations){
// assumption of persistent block-limited kernel launch
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int iter_count = 0;
while (iter_count < iterations ){ // persistent until iterations complete
int *buf = (iter_count & 1)? buffer2:buffer1; // ping pong between buffers
volatile int *bufrdy = (iter_count & 1)?(buffer2_ready):(buffer1_ready);
volatile int *blkcnt = (iter_count & 1)?(&blkcnt2):(&blkcnt1);
int my_idx = idx;
while (iter_count - itercnt > 1); // don't overrun buffers on device
while (*bufrdy == 2); // wait for buffer to be consumed
while (my_idx < buffersize){ // perform the "work"
my_compute_function(buf, my_idx, iter_count);
my_idx += gridDim.x*blockDim.x; // grid-striding loop
}
__syncthreads(); // wait for my block to finish
__threadfence(); // make sure global buffer writes are "visible"
if (!threadIdx.x) atomicAdd((int *)blkcnt, 1); // mark my block done
if (!idx){ // am I the master block/thread?
while (*blkcnt < gridDim.x); // wait for all blocks to finish
*blkcnt = 0;
*bufrdy = 2; // indicate that buffer is ready
__threadfence_system(); // push it out to mapped memory
itercnt++;
}
iter_count++;
}
}
int validate(const int *data, const int dsize, const int val){
for (int i = 0; i < dsize; i++) if (data[i] != val) {printf("mismatch at %d, was: %d, should be: %d\n", i, data[i], val); return 0;}
return 1;
}
int main(){
int *h_buf1, *d_buf1, *h_buf2, *d_buf2;
volatile int *m_bufrdy1, *m_bufrdy2;
// buffer and "mailbox" setup
cudaHostAlloc(&h_buf1, DSIZE*sizeof(int), cudaHostAllocDefault);
cudaHostAlloc(&h_buf2, DSIZE*sizeof(int), cudaHostAllocDefault);
cudaHostAlloc(&m_bufrdy1, sizeof(int), cudaHostAllocMapped);
cudaHostAlloc(&m_bufrdy2, sizeof(int), cudaHostAllocMapped);
cudaCheckErrors("cudaHostAlloc fail");
cudaMalloc(&d_buf1, DSIZE*sizeof(int));
cudaMalloc(&d_buf2, DSIZE*sizeof(int));
cudaCheckErrors("cudaMalloc fail");
cudaStream_t streamk, streamc;
cudaStreamCreate(&streamk);
cudaStreamCreate(&streamc);
cudaCheckErrors("cudaStreamCreate fail");
*m_bufrdy1 = 0;
*m_bufrdy2 = 0;
cudaMemset(d_buf1, 0xFF, DSIZE*sizeof(int));
cudaMemset(d_buf2, 0xFF, DSIZE*sizeof(int));
cudaCheckErrors("cudaMemset fail");
// inefficient crutch for choosing number of blocks
int nblock = 0;
cudaDeviceGetAttribute(&nblock, cudaDevAttrMultiProcessorCount, 0);
cudaCheckErrors("get multiprocessor count fail");
testkernel<<<nblock, nTPB, 0, streamk>>>(d_buf1, d_buf2, m_bufrdy1, m_bufrdy2, DSIZE, ITERS);
cudaCheckErrors("kernel launch fail");
volatile int *bufrdy;
int *hbuf, *dbuf;
for (int i = 0; i < ITERS; i++){
if (i & 1){ // ping pong on the host side
bufrdy = m_bufrdy2;
hbuf = h_buf2;
dbuf = d_buf2;}
else {
bufrdy = m_bufrdy1;
hbuf = h_buf1;
dbuf = d_buf1;}
// int qq = 0; // add for failsafe - otherwise a machine failure can hang
while ((*bufrdy)!= 2); // use this for a failsafe: if (++qq > 1000000) {printf("bufrdy = %d\n", *bufrdy); return 0;} // wait for buffer to be full;
cudaMemcpyAsync(hbuf, dbuf, DSIZE*sizeof(int), cudaMemcpyDeviceToHost, streamc);
cudaStreamSynchronize(streamc);
cudaCheckErrors("cudaMemcpyAsync fail");
*bufrdy = 0; // release buffer back to device
if (!validate(hbuf, DSIZE, i)) {printf("validation failure at iter %d\n", i); exit(1);}
}
printf("Completed %d iterations successfully\n", ITERS);
}
$ nvcc -o t942 t942.cu
$ ./t942
Completed 1000 iterations successfully
$
I've tested the above code and it seems to work well on linux. I believe it should be OK on a windows TCC setup. On windows WDDM, however, I think there are issues that I am still investigating.
Note that the above kernel design attempts to do a grid-wide synchronization using a block-counting atomic strategy. CUDA now (9.0 and newer) has cooperative groups, and that is the recommended approach, rather than the above methodology, to create a grid-wide sync.

This isn't a direct answer to your question but it may be of help.
I am working with a CUDA producer-consumer code that appears to be similar in basic structure to yours. I was hoping to speed up the code by making the CPU and GPU run concurrently. I attempted this by restructuring the code this why
Launch kernel
Copy data
Loop
Launch kernel
CPU work
Copy data
CPU work
This way the CPU can work on the data from the last kernel run while the next set of data is being generated. This cut 30% off the runtime of my code. I am guess thing it could get better if the GPU/CPU work can be balanced so they take roughly the same amount of time.
I am still launching the same kernel 1000s of times. If the overhead of launching a kernel repeatedly is significant then looking for a way to do what I have accomplish with a single launch would be worth it. Otherwise this is probably the best (simplest) solution.

CUDA Convex Hull program crashes on large input

I am trying to implement quickHull algorithm (for convex hull) parallely in CUDA. It works correctly for input_size <= 1 million. When I try 10 million points, the program crashes. My graphic card size is 1982 MB and all my data structures in the algorithm collectively require not more than 600 MB for this input size, which is less than 50 % of the available space.
By commenting out lines of my kernels, I found out that the crash occurs when I try to access array element and the index of the element I am trying to access is not out of bounds (double checked). The following is the kernel code where it crashes.
for(unsigned int i = old_setIndex; i < old_setIndex + old_setS[tid]; i++)
{
int pI = old_set[i];
if(pI <= -1 || pI > pts.size())
{
printf("Thread %d: i = %d, pI = %d\n", tid, i, pI);
continue;
}
p = pts[pI];
double d = distance(A,B,p);
if(d > dist) {
dist = d;
furthestPoint = i;
fpi = pI;
}
}
//fpi = old_set[furthestPoint];
//printf("Thread %d: Furthestpoint = %d\n", tid, furthestPoint);
My code crashes when I uncomment the statements (array access and printf) after the for loop. I am unable to explain the error as furthestPoint is always within bounds of old_set array size. Old_setS stores the size of smaller arrays that each thread can operate on. It crashes even if just try to print the value of furthestPoint (last line) without the array access statement above it.
There's no problem with the above code for input size <= 1 million. Am I overflowing some buffer in the device in case of 10 million?
Please help me in finding the source of the crash.

There is no out of bounds memory access in your code (or at least not one which is causing the symptoms you are seeing).
What is happening is that your kernel is being killed by the display driver because it is taking too much time to execute on your display GPU. All CUDA platform display drivers include a time limit for any operation on the GPU. This exists to prevent the display from freezing for a sufficiently long time that either the OS kernel panics or the user panics and thinks the machine has crashed. On the windows platform you are using, the time limit is about 2 seconds.
What has partly mislead you into thinking the source of the problem is array adressing is the commenting out of code makes the problem disappear. But what really happens there is an artifact of compiler optimization. When you comment out a global memory write, the compiler recognizes that the calculations which lead to the value being stored are unused, and it removes all that code from the assembler code it emits (google "nvcc dead code removal" for more information). That has the effect of making the code run much faster and puts it under the display driver time limit.
For workarounds see this recent stackoverflow question and answer

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js