For loop based kernel vs If statement Kernel - Cuda - c++

I have seen the Cuda Kernel started two separate ways:
1.
for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x; i < length; i += blockDim.x * gridDim.x)
{
// do stuff
}
if(i < length)
{
// do stuff
}
Both versions are launched with kernel<<<num_blocks, threads_per_block>>> where the threads per block are maximized for our device (1024) and the number of blocks (2) for a length of 1025, for example.
The obvious difference is that the for loop allows the kernel to loop when the kernel is launched with less threads, for example 512 threads with 2 blocks length of 1025 it loops twice.
From previous research I've gathered that Nvidia suggests that we do not try and load balance ourselves (read as loop within the kernel like this), for instance, giving a kernel less threads or less blocks to reserve space for other kernels on the device because the load balancing that is built in is supposed to handle this in a more globally optimized way.
So my question is why would we want to use the for loop vs the if statement form of kernel? Is there a benefit to either at run time?
Given my understanding of Nvidia's stance on load balancing, the only value I can see is the ability to debug synchronously via 1 thread and 1 block setting <<<1, 1>>> when we launch the kernel in the for loop version or not having to precompute the # of blocks needed (and/or threads).
This is the test project I ran:
#include <cstdint>
#include <cstdio>
__global__
inline void kernel(int length)
{
int counter = 0;
for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x; i < length; i += blockDim.x * gridDim.x)
{
printf("%u: | i+: %u | tid: %u | counter: %u \n", i, blockDim.x * gridDim.x, threadIdx.x, counter++);
}
}
__global__
inline void kernel2(int length)
{
uint32_t i = blockIdx.x * blockDim.x + threadIdx.x;
if(i < length)
printf("%u: | i+: %u | tid: %u | \n", i, blockDim.x * gridDim.x, threadIdx.x);
}
int main()
{
//kernel<<<2, 1024>>>(1025);
kernel2<<<2, 1024>>>(1025);
cudaDeviceSynchronize();
}

So my question is why would we want to use the for loop vs the if statement form of kernel? Is there a benefit to either at run time?
Yes, there is. Every CUDA thread needs to:
Read all of its parameters from constant memory
Read grid and thread information from special registers: blockDim, blockIdx, threadIdx (or at least their .x components)
Do the arithemtic for computing its global index.
That takes a bit of time. It's not a lot; but if your kernel is very simple (e.g. something like adding up two arrays), then - yes, that has a cost. And of course, if you perform your own preliminary computation that is used with all items in the sequence - each thread has to take the time to do that as well.
From previous research I've gathered that Nvidia suggests that we do not try and load balance ourselves (read as loop within the kernel like this)
I doubt that. The question of whether to iterate a large sequence with a single "CUDA thread" per item or with less threads, each working on multiple items, depends on what is done for individual items in the sequence.

Related

How can I use shared memory here in my CUDA kernel?

I have the following CUDA kernel:
__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, double investment, double profitability) {
// Use a grid-stride loop.
// Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < strategyCount;
i += blockDim.x * gridDim.x)
{
strategies[i].backtest(data, investment, profitability);
}
}
TL;DR I would like to find a way to store data in shared (__shared__) memory. What I don't understand is how to fill the shared variable using multiple threads.
I have seen examples like this one where data is copied to shared memory thread by thread (e.g. myblock[tid] = data[tid]), but I'm not sure how to do this in my situation. The issue is that each thread needs access to an entire "row" (flattened) of data with each iteration through the data set (see further below where the kernel is called).
I'm hoping for something like this:
__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, int propertyCount, double investment, double profitability) {
__shared__ double sharedData[propertyCount];
// Use a grid-stride loop.
// Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < strategyCount;
i += blockDim.x * gridDim.x)
{
strategies[i].backtest(sharedData, investment, profitability);
}
}
Here are more details (if more information is needed, please ask!):
strategies is a pointer to a list of Strategy objects, and data is a pointer to an allocated flattened data array.
In backtest() I access data like so:
data[0]
data[1]
data[2]
...
Unflattened, data is a fixed size 2D array similar to this:
[87.6, 85.4, 88.2, 86.1]
84.1, 86.5, 86.7, 85.9
86.7, 86.5, 86.2, 86.1
...]
As for the kernel call, I iterate over the data items and call it n times for n data rows (about 3.5 million):
int dataCount = 3500000;
int propertyCount = 4;
for (i=0; i<dataCount; i++) {
unsigned int dataPointerOffset = i * propertyCount;
// Notice pointer arithmetic.
optimizer_backtest<<<32, 1024>>>(devData + dataPointerOffset, devStrategies, strategyCount, investment, profitability);
}
As confirmed in your comment, you want to apply 20k (this number is from your previous question) strategies on every one of the 3.5m data and exam the 20k x 3.5m results.
Without shared memory you have to read all data 20k times or all strategies 3.5m times, from the global memory.
Shared memory can speed up your program by reducing global memory access. Say you can read 1k strategies and 1k data to shared mem each time, exam the 1k x 1k results, and then repeat this until all are examed. By this way you can reduce the global mem access to 20 times of all data and 3.5k times of all strategies. This situation is similar to vector-vectoer cross product. You could find some reference code for more detail.
However each one of your data is large (838-D vector), maybe strategies are large too. You may not be able to cache a lot of them in the shared mem (only ~48k per block depending on the device type ). So the situation changes to something like matrix-matrix multiplication. For this, you may get some hints from the matrix multiplication code as in the following link.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory
For people in the future in search of a similar answer, here is what I ended up with for my kernel function:
__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, double investment, double profitability) {
__shared__ double sharedData[838];
if (threadIdx.x < 838) {
sharedData[threadIdx.x] = data[threadIdx.x];
}
__syncthreads();
// Use a grid-stride loop.
// Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < strategyCount;
i += blockDim.x * gridDim.x)
{
strategies[i].backtest(sharedData, investment, profitability);
}
}
Note that I use both .cuh and .cu files in my application, and I put this in the .cu file. Also note that I use --device-c in my Makefile when compiling object files. I don't know if that's how things should be done, but that's what worked for me.

Count values from array CUDA

I have an array of float values, namely life, of which i want to count the number of entries with a value greater than 0 in CUDA.
On the CPU, the code would look like this:
int numParticles = 0;
for(int i = 0; i < MAX_PARTICLES; i++){
if(life[i]>0){
numParticles++;
}
}
Now in CUDA, I've tried something like this:
__global__ void update(float* life, int* numParticles){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (life[idx]>0){
(*numParticles)++;
}
}
//life is a filled device pointer
int launchCount(float* life)
{
int numParticles = 0;
int* numParticles_d = 0;
cudaMalloc((void**)&numParticles_d, sizeof(int));
update<<<MAX_PARTICLES/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(life, numParticles_d);
cudaMemcpy(&numParticles, numParticles_d, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << "numParticles: " << numParticles << std::endl;
}
But for some reason the CUDA attempt always returns 0 for numParticles. How come?
This:
if (life[idx]>0){
(*numParticles)++;
}
is a read-after write hazard. Multiple threads will be simultaneously attempting to read and write from numParticles. The CUDA execution model does not guarantee anything about the order of simultaneous transactions.
You could make this work by using atomic memory transactions, for example:
if (life[idx]>0){
atomicAdd(numParticles, 1);
}
This will serialize the memory transactions and make the calculation correct. It will also have a big negative effect on performance.
You might want to investigate having each block calculate a local sum using a reduction type calculation and then sum the block local sums atomically or on the host, or in a second kernel.
Your code is actually launching MAX_PARTICLES threads, and multiple thread blocks are executing (*numParticles)++; concurrently. It is a race condition. So you have the result 0, or if you are luck, sometimes a little bigger than 0.
As your attempt to sum up life[i]>0 ? 1 : 0 for all i, you could follow CUDA parallel reduction to implement your kernel, or use Thrust reduction to simplify your life.

Cuda kernel to compute squares of integers in an array

I am learning some basic CUDA programming. I am trying to initialize an array on the Host with host_a[i] = i. This array consists of N = 128 integers. I am launching a kernel with 1 block and 128 threads per block, in which I want to square the integer at index i.
My questions are:
How do I come to know whether the kernel gets launched or not? Can I use printf within the kernel?
The expected output for my program is a space-separated list of squares of integers -
1 4 9 16 ... .
What's wrong with my code, since it outputs 1 2 3 4 5 ...
Code:
#include <iostream>
#include <numeric>
#include <stdlib.h>
#include <cuda.h>
const int N = 128;
__global__ void f(int *dev_a) {
unsigned int tid = threadIdx.x;
if(tid < N) {
dev_a[tid] = tid * tid;
}
}
int main(void) {
int host_a[N];
int *dev_a;
cudaMalloc((void**)&dev_a, N * sizeof(int));
for(int i = 0 ; i < N ; i++) {
host_a[i] = i;
}
cudaMemcpy(dev_a, host_a, N * sizeof(int), cudaMemcpyHostToDevice);
f<<<1, N>>>(dev_a);
cudaMemcpy(host_a, dev_a, N * sizeof(int), cudaMemcpyDeviceToHost);
for(int i = 0 ; i < N ; i++) {
printf("%d ", host_a[i]);
}
}
How do I come to know whether the kernel gets launched or not? Can I use printf within the kernel?
You can use printf in device code (as long as you #include <stdio.h>) on any compute capability 2.0 or higher GPU. Since CUDA 7 and CUDA 7.5 only support those types of GPUs, if you are using CUDA 7 or CUDA 7.5 (successfully) then you can use printf in device code.
What's wrong with my code?
As identified in the comments, there is nothing "wrong" with your code, if run on a properly set up machine. To address your previous question "How do I come to know whether the kernel gets launched or not?", the best approach in my opinion is to use proper cuda error checking, which has numerous benefits besides just telling you whether your kernel launched or not. In this case it would also give a clue as to the failure being an improper CUDA setup on your machine. You can also run CUDA codes with cuda-memcheck as a quick test as to whether any runtime errors are occurring.

Performance optimization with different blocks and threads in CUDA

I've written a program to compute a histogram, where each of the 256 values for a char byte is counted:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "..\..\common\book.h"
#include <stdio.h>
#include <cuda.h>
#include <conio.h>
#define SIZE (100*1024*1024)
__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo){
__shared__ unsigned int temp[256];
temp[threadIdx.x] = 0;
__syncthreads();
int i = threadIdx.x + blockIdx.x * blockDim.x;
int offset = blockDim.x * gridDim.x;
while (i < size) {
atomicAdd(&temp[buffer[i]], 1);
i += offset;}
__syncthreads();
atomicAdd(&(histo[threadIdx.x]), temp[threadIdx.x]);
}
int main()
{
unsigned char *buffer = (unsigned char*)big_random_block(SIZE);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
unsigned char *dev_buffer;
unsigned int *dev_histo;
cudaMalloc((void**)&dev_buffer, SIZE);
cudaMemcpy(dev_buffer, buffer, SIZE, cudaMemcpyHostToDevice);
cudaMalloc((void**)&dev_histo, 256 * sizeof(long));
cudaMemset(dev_histo, 0, 256 * sizeof(int));
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int blocks = prop.multiProcessorCount;
histo_kernel << <blocks * 256 , 256>> >(dev_buffer, SIZE, dev_histo);
unsigned int histo[256];
cudaMemcpy(&histo, dev_histo, 256 * sizeof(int), cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsed_time;
cudaEventElapsedTime(&elapsed_time, start, stop);
printf("Time to generate: %f ms\n", elapsed_time);
long sum = 0;
for (int i = 0; i < 256; i++)
sum += histo[i];
printf("The sum is %ld", sum);
cudaFree(dev_buffer);
cudaFree(dev_histo);
free(buffer);
getch();
return 0;
}
I'ves read in the book, CUDA by example, that launching the kernel with number of blocks twice the number of multiprocessors is empirically found to be the most optimal solution. Yet, when I launch it with 8 times the number of blocks, the running time is cut down.
I've run the kernel with: 1.Blocks same as the number of multiprocessors, 2.Blocks twice the number of multiprocessors, 3.Blocks 4 times, and so on.
With (1), I got the running time to be 112ms
With (2) I got the running time to be 73ms
With (3) I got the running time to be 52ms
Funnily, after the number of blocks being 8 times the number of multiprocessors, the running time did not vary by a significant amount. Like it was the same for block being 8 times and 256 times and 1024 times the number of multiprocessors.
How can this be explained?
This behavior is typical. The GPU is a latency-hiding machine. In order to hide latency, when it hits a stall, it needs additional new work available. You can maximize the amount of additional new work available by giving the GPU a large number of blocks and threads.
Once you have given it enough work to hide latency as best it can, giving it additional work does not help. The machine is saturated. However, having additional work available is generally/typically not much of a detriment either. There is little overhead associated with blocks and threads.
Whatever you read in CUDA by Example may have been true for a specific case, but it is certainly not generally true that the correct number of blocks to launch is equal to twice the number of multiprocessors. A better target (typically) would be 4-8 blocks per multiprocessor.
When it comes to blocks and threads, more is usually better, and it's rarely the case that having arbitrarily large numbers of blocks and threads will actually cause a significant degradation in performance. This is contrary to typical CPU thread programming, where having large numbers of OMP threads, for example, may lead to a significant reduction in performance, when you exceed the core count.
When you are tuning the code for the last 10% in performance, then you will see people limit the amount of blocks they launch, to some number that is typically 4-8 times the number of SMs, and construct their threadblocks to loop over the data set. But this normally only yields a few percent performance improvement, in most cases. As a reasonable CUDA programming starting point, aim for tens of thousands of threads, and hundreds of blocks, at least. A carefully tuned code may be able to saturate the machine with fewer blocks and threads, but it will become GPU-dependent at that point. And as I've stated already, there's rarely much of a performance detriment to having millions of threads and thousands of blocks.

Some child grids not being executed with CUDA Dynamic Parallelism

I'm experimenting with the new Dynamic Parallelism feature in CUDA 5.0 (GTK 110). I face the strange behavior that my program does not return the expected result for some configurations—not only unexpected, but also a different result with each launch.
Now I think I found the source of my problem: It seems that some child girds (kernels launched by other kernels) are sometimes not executed when too many child grids are spawned at the same time.
I wrote a little test program to illustrate this behavior:
#include <stdio.h>
__global__ void out_kernel(char* d_out, int index)
{
d_out[index] = 1;
}
__global__ void kernel(char* d_out)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
out_kernel<<<1, 1>>>(d_out, index);
}
int main(int argc, char** argv) {
int griddim = 10, blockdim = 210;
// optional: read griddim and blockdim from command line
if(argc > 1) griddim = atoi(argv[1]);
if(argc > 2) blockdim = atoi(argv[2]);
const int numLaunches = griddim * blockdim;
const int memsize = numLaunches * sizeof(char);
// allocate device memory, set to 0
char* d_out; cudaMalloc(&d_out, memsize);
cudaMemset(d_out, 0, memsize);
// launch outer kernel
kernel<<<griddim, blockdim>>>(d_out);
cudaDeviceSynchronize();
// dowload results
char* h_out = new char[numLaunches];
cudaMemcpy(h_out, d_out, memsize, cudaMemcpyDeviceToHost);
// check results, reduce output to 10 errors
int maxErrors = 10;
for (int i = 0; i < numLaunches; ++i) {
if (h_out[i] != 1) {
printf("Value at index %d is %d, should be 1.\n", i, h_out[i]);
if(maxErrors-- == 0) break;
}
}
// clean up
delete[] h_out;
cudaFree(d_out);
cudaDeviceReset();
return maxErrors < 10 ? 1 : 0;
}
The program launches a kernel in a given number of blocks (1st parameter) with a given number of threads each (2nd parameter). Each thread in that kernel will then launch another kernel with a single thread. This child kernel will write a 1 in its portion of an output array (which was initialized with 0s).
At the end of execution all values in the output array should be 1. But strangely for some block- and grid-sizes some of the array values are still zero. This basically means that some of the child grids are not executed.
This only happens if many of the child grids are spawned at the same time. On my test system (a Tesla K20x) this is the case for 10 blocks containing 210 threads each. 10 blocks with 200 threads deliver the correct result, though. But also 3 blocks with 1024 threads each cause the error.
Strangely, no error is reported back by the runtime. The child grids simply seem to be ignored by the scheduler.
Does anyone else face the same problem? Is this behavior documented somewhere (I did not find anything), or is it really a bug in the device runtime?
You're doing no error checking of any kind that I can see. You can and should do similar error checking on device kernel launches. Refer to the documentation These errors will not necessarily be bubbled up to the host:
Errors are recorded per-thread, so that each thread can identify the most recent error that it has generated.
You must trap them in the device. There are plenty of examples of this type of device error checking in the documentation.
If you were to do proper error checking you would discover that in each case where a kernel failed to launch, the cuda device runtime API was returning error 69, cudaErrorLaunchPendingCountExceeded.
If you scan the documentation for this error, you'll find this:
cudaLimitDevRuntimePendingLaunchCount
Controls the amount of memory set aside for buffering kernel launches which have not yet begun to execute, due either to unresolved dependencies or lack of execution resources. When the buffer is full, launches will set the thread’s last error to cudaErrorLaunchPendingCountExceeded. The default pending launch count is 2048 launches.
At 10 blocks * 200 threads, you are launching 2000 kernels, and things seem to work.
At 10 blocks * 210 threads, you are launching 2100 kernels, which exceeds the 2048 limit mentioned above.
Note that this is somewhat dynamic in nature; depending on how your application launches child kernels, you may launch in excess of 2048 kernels easily without hitting this limit. But since your application launches all kernels approximately simultaneously, you are hitting the limit.
Proper cuda error checking is advisable any time your CUDA code is not behaving the way you expect.
If you'd like to get some confirmation of the above, in your code you can modify your main kernel like this:
__global__ void kernel(char* d_out)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
out_kernel<<<1, 1>>>(d_out, index);
// cudaDeviceSynchronize(); // not necessary since error 69 is returned immediately
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) d_out[index] = (char)err;
}
The pending launch count limit is modifiable. Refer to the documentation for cudaLimitDevRuntimePendingLaunchCount