I am learning some basic CUDA programming. I am trying to initialize an array on the Host with host_a[i] = i. This array consists of N = 128 integers. I am launching a kernel with 1 block and 128 threads per block, in which I want to square the integer at index i.
My questions are:
How do I come to know whether the kernel gets launched or not? Can I use printf within the kernel?
The expected output for my program is a space-separated list of squares of integers -
1 4 9 16 ... .
What's wrong with my code, since it outputs 1 2 3 4 5 ...
Code:
#include <iostream>
#include <numeric>
#include <stdlib.h>
#include <cuda.h>
const int N = 128;
__global__ void f(int *dev_a) {
unsigned int tid = threadIdx.x;
if(tid < N) {
dev_a[tid] = tid * tid;
}
}
int main(void) {
int host_a[N];
int *dev_a;
cudaMalloc((void**)&dev_a, N * sizeof(int));
for(int i = 0 ; i < N ; i++) {
host_a[i] = i;
}
cudaMemcpy(dev_a, host_a, N * sizeof(int), cudaMemcpyHostToDevice);
f<<<1, N>>>(dev_a);
cudaMemcpy(host_a, dev_a, N * sizeof(int), cudaMemcpyDeviceToHost);
for(int i = 0 ; i < N ; i++) {
printf("%d ", host_a[i]);
}
}
How do I come to know whether the kernel gets launched or not? Can I use printf within the kernel?
You can use printf in device code (as long as you #include <stdio.h>) on any compute capability 2.0 or higher GPU. Since CUDA 7 and CUDA 7.5 only support those types of GPUs, if you are using CUDA 7 or CUDA 7.5 (successfully) then you can use printf in device code.
What's wrong with my code?
As identified in the comments, there is nothing "wrong" with your code, if run on a properly set up machine. To address your previous question "How do I come to know whether the kernel gets launched or not?", the best approach in my opinion is to use proper cuda error checking, which has numerous benefits besides just telling you whether your kernel launched or not. In this case it would also give a clue as to the failure being an improper CUDA setup on your machine. You can also run CUDA codes with cuda-memcheck as a quick test as to whether any runtime errors are occurring.
Related
I have seen the Cuda Kernel started two separate ways:
1.
for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x; i < length; i += blockDim.x * gridDim.x)
{
// do stuff
}
if(i < length)
{
// do stuff
}
Both versions are launched with kernel<<<num_blocks, threads_per_block>>> where the threads per block are maximized for our device (1024) and the number of blocks (2) for a length of 1025, for example.
The obvious difference is that the for loop allows the kernel to loop when the kernel is launched with less threads, for example 512 threads with 2 blocks length of 1025 it loops twice.
From previous research I've gathered that Nvidia suggests that we do not try and load balance ourselves (read as loop within the kernel like this), for instance, giving a kernel less threads or less blocks to reserve space for other kernels on the device because the load balancing that is built in is supposed to handle this in a more globally optimized way.
So my question is why would we want to use the for loop vs the if statement form of kernel? Is there a benefit to either at run time?
Given my understanding of Nvidia's stance on load balancing, the only value I can see is the ability to debug synchronously via 1 thread and 1 block setting <<<1, 1>>> when we launch the kernel in the for loop version or not having to precompute the # of blocks needed (and/or threads).
This is the test project I ran:
#include <cstdint>
#include <cstdio>
__global__
inline void kernel(int length)
{
int counter = 0;
for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x; i < length; i += blockDim.x * gridDim.x)
{
printf("%u: | i+: %u | tid: %u | counter: %u \n", i, blockDim.x * gridDim.x, threadIdx.x, counter++);
}
}
__global__
inline void kernel2(int length)
{
uint32_t i = blockIdx.x * blockDim.x + threadIdx.x;
if(i < length)
printf("%u: | i+: %u | tid: %u | \n", i, blockDim.x * gridDim.x, threadIdx.x);
}
int main()
{
//kernel<<<2, 1024>>>(1025);
kernel2<<<2, 1024>>>(1025);
cudaDeviceSynchronize();
}
So my question is why would we want to use the for loop vs the if statement form of kernel? Is there a benefit to either at run time?
Yes, there is. Every CUDA thread needs to:
Read all of its parameters from constant memory
Read grid and thread information from special registers: blockDim, blockIdx, threadIdx (or at least their .x components)
Do the arithemtic for computing its global index.
That takes a bit of time. It's not a lot; but if your kernel is very simple (e.g. something like adding up two arrays), then - yes, that has a cost. And of course, if you perform your own preliminary computation that is used with all items in the sequence - each thread has to take the time to do that as well.
From previous research I've gathered that Nvidia suggests that we do not try and load balance ourselves (read as loop within the kernel like this)
I doubt that. The question of whether to iterate a large sequence with a single "CUDA thread" per item or with less threads, each working on multiple items, depends on what is done for individual items in the sequence.
I have an array of float values, namely life, of which i want to count the number of entries with a value greater than 0 in CUDA.
On the CPU, the code would look like this:
int numParticles = 0;
for(int i = 0; i < MAX_PARTICLES; i++){
if(life[i]>0){
numParticles++;
}
}
Now in CUDA, I've tried something like this:
__global__ void update(float* life, int* numParticles){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (life[idx]>0){
(*numParticles)++;
}
}
//life is a filled device pointer
int launchCount(float* life)
{
int numParticles = 0;
int* numParticles_d = 0;
cudaMalloc((void**)&numParticles_d, sizeof(int));
update<<<MAX_PARTICLES/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(life, numParticles_d);
cudaMemcpy(&numParticles, numParticles_d, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << "numParticles: " << numParticles << std::endl;
}
But for some reason the CUDA attempt always returns 0 for numParticles. How come?
This:
if (life[idx]>0){
(*numParticles)++;
}
is a read-after write hazard. Multiple threads will be simultaneously attempting to read and write from numParticles. The CUDA execution model does not guarantee anything about the order of simultaneous transactions.
You could make this work by using atomic memory transactions, for example:
if (life[idx]>0){
atomicAdd(numParticles, 1);
}
This will serialize the memory transactions and make the calculation correct. It will also have a big negative effect on performance.
You might want to investigate having each block calculate a local sum using a reduction type calculation and then sum the block local sums atomically or on the host, or in a second kernel.
Your code is actually launching MAX_PARTICLES threads, and multiple thread blocks are executing (*numParticles)++; concurrently. It is a race condition. So you have the result 0, or if you are luck, sometimes a little bigger than 0.
As your attempt to sum up life[i]>0 ? 1 : 0 for all i, you could follow CUDA parallel reduction to implement your kernel, or use Thrust reduction to simplify your life.
I'm writing a CUDA program that to be run on thousands of different GPUs, those machine would have different version of display driver installed, I cannot force them to update to the latest driver. Actually most code runs fine on those 'old' machine, but fails with some particular code:
Here's the problem:
#include <stdio.h>
#include <cuda.h>
#include <cuda_profiler_api.h>
__global__
void test()
{
unsigned i = 64;
unsigned j = 192;
int k = 7;
for(j = 1 << (k - 1); i &j; j >>= 1)
i ^= j;
i ^= j;
printf("i,j,k: %d,%d,%d\n", i,j,k);
// i,j,k: 32,32, 7 (correct)
// i,j,k: 0, 64, 7 (wrong)
}
int main() {
cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
test<<<1,1>>>();
}
The code prints 32,32,7 as result on GPU with latest driver, which is the correct result. But on old driver(lower than CUDA 6.5) it prints 0,64,7 .
I'm looking for any workaround for this.
Envoronment:
Developing: Win7-32bit, VS2013, CUDA 6.5
Corrent Result on: WinXP-32bit(and Win7-32bit), GTX-650(latest driver)
Wrong Result on: WinXP-32bit + GTX-750-Ti(old driver), WinXP-32bit + GTX-750(old driver)
There is no workaround. The runtime API is versioned and the minimum driver version requirement is non-negotiable.
Your only two choices are to develop using the lowest common denominator toolkit version that supports the driver being used, or switch to the driver API.
Got a very slow solution: use local memory rather than register variable.
just add volatile keyword before i,j
volatile unsigned i = 64;
volatile unsigned j = 192;
First I should say I'm quite new to programming in C++ (let alone CUDA), though it is what I first learned with about 184 years ago. I'd say I'm a bit out of touch with memory allocation, and datatype sizes, though I'm learning. Anyway here goes:
I have a GPU with compute capability 3.0 (It's a Geforce 660 GTX w/ 2GB of DRAM).
Going by ./deviceQuery found in the CUDA samples (and by other charts I've found online), my maximum grid size is listed:
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
At 2,147,483,647 (2^31-1) that x dimension is huge and kind of nice… YET, when I run my code, pushing beyond 65535 in the x dimension, things get... weird.
I used an example from an Udacity course, and modified it to test the extremes. I've kept the kernel code fairly simple to prove the point:
__global__ void referr(long int *d_out, long int *d_in){
long int idx = blockIdx.x;
d_out[idx] = idx;
}
Please note below the ARRAY_SIZE being the size of the grid, but also being the size of the array of integers on which to do operations. I am leaving the size of the blocks at 1x1x1. JUST for the sake of understanding the limitations, I KNOW that having this many operations with blocks of only 1 thread makes no sense, but I want to understand what's going on with the grid size limitations.
int main(int argc, char ** argv) {
const long int ARRAY_SIZE = 522744;
const long int ARRAY_BYTES = ARRAY_SIZE * sizeof(long int);
// generate the input array on the host
long int h_in[ARRAY_SIZE];
for (long int i = 0; i < ARRAY_SIZE; i++) {
h_in[i] = i;
}
long int h_out[ARRAY_SIZE];
// declare GPU memory pointers
long int *d_in;
long int *d_out;
// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);
// transfer the array to the GPU
cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);
// launch the kernel with ARRAY_SIZE blocks in the x dimension, with 1 thread each.
referr<<<ARRAY_SIZE, 1>>>(d_out, d_in);
// copy back the result array to the CPU
cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);
// print out the resulting array
for (long int i =0; i < ARRAY_SIZE; i++) {
printf("%li", h_out[i]);
printf(((i % 4) != 3) ? "\t" : "\n");
}
cudaFree(d_in);
cudaFree(d_out);
return 0;
}
This works as expected with an ARRAY_SIZE at MOST of 65535. The last few lines of the output below
65516 65517 65518 65519
65520 65521 65522 65523
65524 65525 65526 65527
65528 65529 65530 65531
65532 65533 65534
If I push the ARRAY_SIZE beyond this the output gets really unpredictable and eventually if the number gets too high I get a Segmentation fault (core dumped) message… whatever that even means. Ie. with an ARRAY_SIZE of 65536:
65520 65521 65522 65523
65524 65525 65526 65527
65528 65529 65530 65531
65532 65533 65534 131071
Why is it now stating that the blockIdx.x for this last one is 131071?? That is 65535+65535+1. Weird.
Even weirder, when I set the ARRAY_SIZE to 65537 (65535+2) I get some seriously strange results for the last lines of the output.
65520 65521 65522 65523
65524 65525 65526 65527
65528 65529 65530 65531
65532 65533 65534 131071
131072 131073 131074 131075
131076 131077 131078 131079
131080 131081 131082 131083
131084 131085 131086 131087
131088 131089 131090 131091
131092 131093 131094 131095
131096 131097 131098 131099
131100 131101 131102 131103
131104 131105 131106 131107
131108 131109 131110 131111
131112 131113 131114 131115
131116 131117 131118 131119
131120 131121 131122 131123
131124 131125 131126 131127
131128 131129 131130 131131
131132 131133 131134 131135
131136 131137 131138 131139
131140 131141 131142 131143
131144 131145 131146 131147
131148 131149 131150 131151
131152 131153 131154 131155
131156 131157 131158 131159
131160 131161 131162 131163
131164 131165 131166 131167
131168 131169 131170 131171
131172 131173 131174 131175
131176 131177 131178 131179
131180 131181 131182 131183
131184 131185 131186 131187
131188 131189 131190 131191
131192 131193 131194 131195
131196 131197 131198 131199
131200
Isn't 65535 the limit for older GPUs? Why is my GPU "messing up" when I push past the 65535 barrier for the x grid dimension? Or is this by design? What in the world is going on?
Wow, sorry for the long question.
Any help to understand this would be greatly appreciated! Thanks!
You should be using proper CUDA error checking . And you should be compiling for a compute 3.0 architecture by specifying -arch=sm_30 when you compile with nvcc.
I'm experimenting with the new Dynamic Parallelism feature in CUDA 5.0 (GTK 110). I face the strange behavior that my program does not return the expected result for some configurations—not only unexpected, but also a different result with each launch.
Now I think I found the source of my problem: It seems that some child girds (kernels launched by other kernels) are sometimes not executed when too many child grids are spawned at the same time.
I wrote a little test program to illustrate this behavior:
#include <stdio.h>
__global__ void out_kernel(char* d_out, int index)
{
d_out[index] = 1;
}
__global__ void kernel(char* d_out)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
out_kernel<<<1, 1>>>(d_out, index);
}
int main(int argc, char** argv) {
int griddim = 10, blockdim = 210;
// optional: read griddim and blockdim from command line
if(argc > 1) griddim = atoi(argv[1]);
if(argc > 2) blockdim = atoi(argv[2]);
const int numLaunches = griddim * blockdim;
const int memsize = numLaunches * sizeof(char);
// allocate device memory, set to 0
char* d_out; cudaMalloc(&d_out, memsize);
cudaMemset(d_out, 0, memsize);
// launch outer kernel
kernel<<<griddim, blockdim>>>(d_out);
cudaDeviceSynchronize();
// dowload results
char* h_out = new char[numLaunches];
cudaMemcpy(h_out, d_out, memsize, cudaMemcpyDeviceToHost);
// check results, reduce output to 10 errors
int maxErrors = 10;
for (int i = 0; i < numLaunches; ++i) {
if (h_out[i] != 1) {
printf("Value at index %d is %d, should be 1.\n", i, h_out[i]);
if(maxErrors-- == 0) break;
}
}
// clean up
delete[] h_out;
cudaFree(d_out);
cudaDeviceReset();
return maxErrors < 10 ? 1 : 0;
}
The program launches a kernel in a given number of blocks (1st parameter) with a given number of threads each (2nd parameter). Each thread in that kernel will then launch another kernel with a single thread. This child kernel will write a 1 in its portion of an output array (which was initialized with 0s).
At the end of execution all values in the output array should be 1. But strangely for some block- and grid-sizes some of the array values are still zero. This basically means that some of the child grids are not executed.
This only happens if many of the child grids are spawned at the same time. On my test system (a Tesla K20x) this is the case for 10 blocks containing 210 threads each. 10 blocks with 200 threads deliver the correct result, though. But also 3 blocks with 1024 threads each cause the error.
Strangely, no error is reported back by the runtime. The child grids simply seem to be ignored by the scheduler.
Does anyone else face the same problem? Is this behavior documented somewhere (I did not find anything), or is it really a bug in the device runtime?
You're doing no error checking of any kind that I can see. You can and should do similar error checking on device kernel launches. Refer to the documentation These errors will not necessarily be bubbled up to the host:
Errors are recorded per-thread, so that each thread can identify the most recent error that it has generated.
You must trap them in the device. There are plenty of examples of this type of device error checking in the documentation.
If you were to do proper error checking you would discover that in each case where a kernel failed to launch, the cuda device runtime API was returning error 69, cudaErrorLaunchPendingCountExceeded.
If you scan the documentation for this error, you'll find this:
cudaLimitDevRuntimePendingLaunchCount
Controls the amount of memory set aside for buffering kernel launches which have not yet begun to execute, due either to unresolved dependencies or lack of execution resources. When the buffer is full, launches will set the thread’s last error to cudaErrorLaunchPendingCountExceeded. The default pending launch count is 2048 launches.
At 10 blocks * 200 threads, you are launching 2000 kernels, and things seem to work.
At 10 blocks * 210 threads, you are launching 2100 kernels, which exceeds the 2048 limit mentioned above.
Note that this is somewhat dynamic in nature; depending on how your application launches child kernels, you may launch in excess of 2048 kernels easily without hitting this limit. But since your application launches all kernels approximately simultaneously, you are hitting the limit.
Proper cuda error checking is advisable any time your CUDA code is not behaving the way you expect.
If you'd like to get some confirmation of the above, in your code you can modify your main kernel like this:
__global__ void kernel(char* d_out)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
out_kernel<<<1, 1>>>(d_out, index);
// cudaDeviceSynchronize(); // not necessary since error 69 is returned immediately
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) d_out[index] = (char)err;
}
The pending launch count limit is modifiable. Refer to the documentation for cudaLimitDevRuntimePendingLaunchCount