CUDA periodic execution time - c++

I just started learning CUDA and I have a trouble interpreting my experiment results. I wanted to compare CPU vs GPU in a simple program that adds two vectors together. The code is following:
__global__ void add(int *a, int *b, int *c, long long n) {
long long tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < n) {
c[tid] = a[tid] + b[tid];
}
}
void add_cpu(int* a, int* b, int* c, long long n) {
for (long long i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
void check_results(int* gpu, int* cpu, long long n) {
for (long long i = 0; i < n; i++) {
if (gpu[i] != cpu[i]) {
printf("Different results!\n");
return;
}
}
}
int main(int argc, char* argv[]) {
long long n = atoll(argv[1]);
int num_of_blocks = atoi(argv[2]);
int num_of_threads = atoi(argv[3]);
int* a = new int[n];
int* b = new int[n];
int* c = new int[n];
int* c_cpu = new int[n];
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void **) &dev_a, n * sizeof(int));
cudaMalloc((void **) &dev_b, n * sizeof(int));
cudaMalloc((void **) &dev_c, n * sizeof(int));
for (long long i = 0; i < n; i++) {
a[i] = i;
b[i] = i * 2;
}
cudaMemcpy(dev_a, a, n * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, n * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_c, c, n * sizeof(int), cudaMemcpyHostToDevice);
StopWatchInterface *timer=NULL;
sdkCreateTimer(&timer);
sdkResetTimer(&timer);
sdkStartTimer(&timer);
add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);
cudaDeviceSynchronize();
sdkStopTimer(&timer);
float time = sdkGetTimerValue(&timer);
sdkDeleteTimer(&timer);
cudaMemcpy(c, dev_c, n * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
clock_t start = clock();
add_cpu(a, b, c_cpu, n);
clock_t end = clock();
check_results(c, c_cpu, n);
printf("%f %f\n", (double)(end - start) * 1000 / CLOCKS_PER_SEC, time);
return 0;
}
I ran this code in a loop with a bash script:
for i in {1..2560}
do
n="$((1024 * i))"
out=`./vectors $n $i 1024`
echo "$i $out" >> "./vectors.txt"
done
Where 2560 is maximum number of blocks that my GPU supports, and 1024 is the maximum number of threads in block. So I just ran it for maximum block size to the maximum problem size my GPU can handle, with a step of 1 block (1024 ints in vector).
Here is my GPU info:
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 2070 SUPER"
CUDA Driver Version / Runtime Version 11.3 / 11.0
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 8192 MBytes (8589934592 bytes)
(040) Multiprocessors, (064) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1785 MHz (1.78 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 65536 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.3, CUDA Runtime Version = 11.0, NumDevs = 1
Result = PASS
After running the experiment I gathered the results and plotted them:
So what bothers me is this 256 blocks-wide period in the GPU execution time. I have no clue why this happens. Why executing 512 blocks is much slower than executing 513 blocks of threads?
I also checked this with a constant number of blocks (2560) as well as with different block sizes and it always give this period of 256 * 1024 vector size (so for block size 512 its each 512 blocks, not each 256 blocks). So maybe this is something with memory, but I can't figure out what.
I would appreciate any ideas on why this is happening.

This is by no means a complete or precise answer. However I believe the periodic pattern you are observing is at least partly due to a 1-time or first-time kernel launch overhead. Good benchmarking practice usually is to do something other than what you are doing. For example, run the kernel multiple times and take an average. Or do some other kind of statistical measurement.
When I run your code using your script on a GTX 960 GPU, I get the following graph (only plotting the GPU data, vertical axis is in milliseconds):
When I modify your code as follows:
cudaMemcpy(dev_c, c, n * sizeof(int), cudaMemcpyHostToDevice);
// next two lines added:
add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);
cudaDeviceSynchronize();
StopWatchInterface *timer=NULL;
sdkCreateTimer(&timer);
sdkResetTimer(&timer);
sdkStartTimer(&timer);
add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);
cudaDeviceSynchronize();
Doing a "warm-up" run first, then timing the second run, I witness data like this:
So the data without the warm-up shows a periodicity. After the warm-up, the periodicity disappears. I conclude that the periodicity is due to some kind of 1-time or first-time behavior. Some typical things that might be in this category are caching effects and cuda "lazy" initialization effects (for example, the time taken to JIT-compile the GPU code, which is certainly happening in your case, or the time to load the GPU code into GPU memory). I won't be able to go farther with any explanation of what kind of first-time effect exactly is giving rise to the periodicity.
Another observation is that while my data shows an expected "average slope" to each graph, indicating that the kernel duration associated with 2560 blocks is approximately 5 times the kernel duration associated with 512 blocks, I don't see that kind of trend in your data. It ought to be there, however. Your GPU will "saturate" at about 40 blocks. Thereafter, the average kernel duration should increase in approximately a linear fashion, such that the kernel duration associated with 2560 blocks is 4-5x the kernel duration associated with 512 blocks. I can't explain your data in this respect at all, I suspect a graphing or data processing error, or else a characteristic in your environment (e.g. shared GPU with other users, broken CUDA install, etc.) that is not present in my environment, and which I'm unable to guess at.
Finally, my conclusion is that GPU "expected" behavior is more evident in the presence of good benchmarking techniques.

Related

Use a CUDA Thread Index as a number

I am new to CUDA and GPGPUs. I am trying to check properties of a large set of numbers (bigger than 32 bit) and I would like to try to do this using my Windows 7 64bit machine equipped with a nVidia GTX 1080:
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1080"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8192 MBytes (8589934592 bytes)
(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1734 MHz (1.73 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
When I run the following code the value for "sum" is nonsensical (28, 20, etc.) even though I can see the threadId goes from 0 to 4095 :
#include <cuda.h>
#include <cuda_runtime.h>
#include "device_launch_parameters.h"
#include "stdio.h"
__global__ void Simple(unsigned long long int *sum)
{
unsigned long long int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
unsigned long long int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z)
+ (threadIdx.z * (blockDim.x * blockDim.y))
+ (threadIdx.y * blockDim.x)
+ threadIdx.x;
printf("threadId = %llu.\n", threadId);
// Check threadId for property. Possibly introduce a grid stride for loop to give each thread a range to check.
sum[0]++;
}
int main(int argc, char **argv)
{
unsigned long long int sum[] = { 0 };
unsigned long long int *dev_sum;
cudaMalloc((void**)&dev_sum, sizeof(unsigned long long int));
cudaMemcpy(dev_sum, sum, sizeof(unsigned long long int), cudaMemcpyHostToDevice);
dim3 grid(2, 1, 1);
dim3 block(1024, 1, 1);
printf("--------- Start kernel ---------\n\n");
Simple <<< grid, block >>> (dev_sum);
cudaDeviceSynchronize();
cudaMemcpy(sum, dev_sum, sizeof(unsigned long long int), cudaMemcpyDeviceToHost);
printf("sum = %llu.\n", sum[0]);
cudaFree(dev_sum);
getchar();
return 0;
}
How would I modify this kernel call to get the maximum threads to operate (with my setup ) over a range of numbers say 0 to 10^12 by adding a grid stride loop?
dim3 grid(2, 1, 1);
dim3 block(1024, 1, 1);
Simple <<< grid, block >>> (dev_sum);
All threads are doing increment on the same place in memory, which results in race condition. That is why the outcome is incorrect. You should use atomic addition to have it right (there is a function for that in CUDA).

Why my cuda program became slower after using 128 threads on blocks?

I have a simple cuda application with the following code:
#include <stdio.h>
#include <sys/time.h>
#include <stdint.h>
__global__
void daxpy(int n, int a, int *x, int *y) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
y[i] = x[i];
int j;
for(j = 0; j < 1024*10000; ++j) {
y[i] += j%10;
}
}
// debug time
void calc_time(struct timeval *start, const char *msg) {
struct timeval end;
gettimeofday(&end, NULL);
uint64_t us = end.tv_sec * 1000000 + end.tv_usec - (start->tv_sec * 1000000 + start->tv_usec);
printf("%s cost us = %llu\n", msg, us);
memcpy(start, &end, sizeof(struct timeval));
}
void do_test() {
unsigned long n = 1536;
int *x, *y, a, *dx, *dy;
a = 2.0;
x = (int*)malloc(sizeof(int)*n);
y = (int*)malloc(sizeof(int)*n);
for(i = 0; i < n; ++i) {
x[i] = i;
}
cudaMalloc((void**)&dx, n*sizeof(int));
cudaMalloc((void**)&dy, n*sizeof(int));
struct timeval start;
gettimeofday(&start, NULL);
cudaMemcpy(dx, x, n*sizeof(int), cudaMemcpyHostToDevice);
daxpy<<<1, 512>>>(n, a, dx, dy); // this line
cudaThreadSynchronize();
cudaMemcpy(y, dy, n*sizeof(int), cudaMemcpyDeviceToHost);
calc_time(&start, "do_test ");
cudaFree(dx);
cudaFree(dy);
free(x);
free(y);
}
int main() {
do_test();
return 0;
}
The gpu kernel call is daxpy<<<1, 512>>>(n, a, dx, dy) and I performed some tests using different block sizes:
daxpy<<<1, 32>>>(n, a, dx, dy)
daxpy<<<1, 64>>>(n, a, dx, dy)
daxpy<<<1, 128>>>(n, a, dx, dy)
daxpy<<<1, 129>>>(n, a, dx, dy)
daxpy<<<1, 512>>>(n, a, dx, dy)
... and made the following observations:
Execution time is the same for 32, 64, and 128 block sizes,
Execution time differs for block sizes 128 and 129, in particular:
For 128 the execution time is 280ms,
For 129 the execution time is 386ms.
I would like to ask what is causing the difference in execution time for block sizes 128 and 129.
My GPU is tesla K80:
CUDA Driver Version / Runtime Version 6.5 / 6.5
CUDA Capability Major/Minor version number: 3.7
Total amount of global memory: 11520 MBytes (12079136768 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Clock rate: 824 MHz (0.82 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 135 / 0
After providing us with the exact time differences in one of the comments, i.e.:
280ms for up to 128 threads,
386ms for 129+ threads,
I think it indirectly supports my theory of issue being related to warp scheduling. Look at the GK210 whitepaper, which is a chip used in K80:
K80 SMX features a quad warp scheduler, see section Quad Warp Scheduler,
It means that K80 SMX is able to schedule up to 128 threads at once (4 warps == 128 threads), these are then executed simultaneously,
Therefore, for 129 threads, scheduling cannot happen at once, because SMX has to schedule 5 warps, i.e. scheduling will happen in two steps.
If the above is true, then I would expect:
The execution time to be roughly the same for block sizes 1 - 128,
The execution time to be roughly the same for block sizes 129 - 192.
192 is the number of cores on the SMX, see whitepaper. As a reminder - entire blocks are always scheduled for one SMX and so obviously if you spawn more than 192 threads then those for sure won't be able to execute in parallel and execution time should be higher for 193+ number of threads.
You can verify the above thesis by simplifying your kernel code to the degree where it will do almost nothing so it should be more or less obvious whether the execution takes longer only due to scheduling (there will be no other limiting factors such as memory throughput).
Disclaimer: The above are just my assumptions as I don't have access to K80, nor any other GPU with quad warp scheduler so I cannot profile your code properly. But anyway, I believe that is the task for you - why not to use nvprof and profile your code yourself? Then you should be able to see where the time difference lies.

Measuring OpenCL kernel's memory throughput

I read about global memory optimization in OpenCL. In one of the slide-shows, a very simple kernel (below) has been used to demonstrate the importance of memory coalescing.
__kernel void measure(__global float* idata, __global float* odata, int offset) {
int xid = get_global_id(0) + offset;
odata[xid] = idata[xid];
}
Please see my code below which measures the running time of the kernel
ret = clFinish(command_queue);
size_t local_item_size = MAX_THREADS;
size_t global_item_size = INPUTSIZE;
struct timeval t0,t1;
gettimeofday(&t0, 0 );
//ret = clFinish(command_queue);
ret = clEnqueueNDRangeKernel(command_queue, measure, 1, NULL,
&global_item_size, &local_item_size, 0, NULL, NULL);
ret = clFlush(command_queue);
ret = clFinish(command_queue);
gettimeofday(&t1,0);
double elapsed = (t1.tv_sec-t0.tv_sec)*1000000 + (t1.tv_usec-t0.tv_usec);
printf("time taken = %lf microseconds\n", elapsed);
I transfer around 0.5 GB of data:
#define INPUTSIZE 1024 * 1024 * 128
int main (int argc, char *argv[])
{
int offset = atoi(argv[1]);
float* input = (float*) malloc(sizeof(float) * INPUTSIZE);
Now, the results are a bit random. With offset =0, I get times as low as 21 usecs. With offset = 1, I get times ranging between 53 usecs to 24400 usecs.
Can someone please tell me what is going on. I thought that offset=0 will be the fastest, because all the threads will access consecutive locations, hence the minimum number of memory transactions will take place.
Bandwidth is a measure of how fast data can be transferred, and is typically measured in bytes/second in these situations (usually GB/s for GPU memory bandwidth).
To compute the bandwidth of a compute kernel, you just need to know how much data the kernel reads/writes from/to memory, and then divide that by the time your kernel took to execute.
Your example kernel has each work-item (or CUDA thread) read a single float, and write a single float. If you launch this kernel to copy 2^10 floats, then you will be reading 2^10 * sizeof(float) bytes, and writing the same amount (so 8MB in total). If this kernel takes 1ms to execute, then you have achieved bandwidth of 8MB / 0.001s = 8GB/s.
Your new code snippet that shows your kernel timing approach indicates that you are only timing the kernel enqueue, not the amount of time it actually takes to run the kernel. This is why you are getting very low kernel timings (0.5GB / 0.007ms ~= 71TB/s!). You should add calls to clFinish() to obtain proper timing. I typically also take timings over several runs, to allow the device to warm-up, which usually gives more consistent timing:
// Warm-up run (not timed)
clEnqueueNDRangeKernel(command_queue, ...);
clFinish(command_queue);
// start timing
start = ...
for (int i = 0; i < NUM_RUNS; i++)
{
clEnqueueNDRangeKernel(command_queue, ...);
}
clFinish(command_queue);
// stop timing
end = ...
// Compute time taken, bandwidth etc
average_time = (end-start)/NUM_RUNS;
...
Question from comment:
Why does offset=0 perform better than offset=1,4 or 6?
On NVIDIA GPUs, work-items are grouped into 'warps' of size 32, which execute in lockstep (other devices have similar approaches, just with a different sizes). Memory transactions are aligned to multiples of the cacheline size (e.g. 64 bytes, 128 bytes etc). Consider what happens when each work-item in a warp attempts to read a single 4-byte value (assuming they are contiguous, as per your example), with a cacheline size of 64 bytes.
This warp is reading a total of 128 bytes of data. If the start of this 128-byte chunk is aligned to a 64-byte boundary (i.e. if offset=0), then this can serviced in two 64-byte transactions. However, if this chunk is not aligned to the a 64-byte boundary (offset=1,4,6,etc), then this will require three memory transactions to fetch all of the data. This is where your performance difference comes from.
If you set the offset to be a multiple of the cacheline size (e.g. 64), then you will likely get performance equivalent to offset=0.

Improving asynchronous execution in CUDA

I am currently writing a programme that performs large simulations on the GPU using the CUDA API. In order to accelerate the performance, I tried to run my kernels simultaneously and then asynchronously copy the result into the host memory again. The code looks roughly like this:
#define NSTREAMS 8
#define BLOCKDIMX 16
#define BLOCKDIMY 16
void domainUpdate(float* domain_cpu, // pointer to domain on host
float* domain_gpu, // pointer to domain on device
const unsigned int dimX,
const unsigned int dimY,
const unsigned int dimZ)
{
dim3 blocks((dimX + BLOCKDIMX - 1) / BLOCKDIMX, (dimY + BLOCKDIMY - 1) / BLOCKDIMY);
dim3 threads(BLOCKDIMX, BLOCKDIMY);
for (unsigned int ii = 0; ii < NSTREAMS; ++ii) {
updateDomain3D<<<blocks,threads, 0, streams[ii]>>>(domain_gpu,
dimX, 0, dimX - 1, // dimX, minX, maxX
dimY, 0, dimY - 1, // dimY, minY, maxY
dimZ, dimZ * ii / NSTREAMS, dimZ * (ii + 1) / NSTREAMS - 1); // dimZ, minZ, maxZ
unsigned int offset = dimX * dimY * dimZ * ii / NSTREAMS;
cudaMemcpyAsync(domain_cpu + offset ,
domain_gpu+ offset ,
sizeof(float) * dimX * dimY * dimZ / NSTREAMS,
cudaMemcpyDeviceToHost, streams[ii]);
}
cudaDeviceSynchronize();
}
All in all it is just a simple for-loop, looping over all streams (8 in this case) and dividing the work. This actually is a deal faster (up to 30% performance gain), although maybe less than I had hoped. I analysed a typical cycle in Nvidia's Compute Visual Profiler, and the execution looks like this:
As can be seen in the picture, the kernels do overlap, although never more than two kernels are running at the same time. I tried the same thing for different numbers of streams and different sizes of the simulation domain, but this is always the case.
So my question is: is there a way to encourage/force the GPU scheduler to run more than two things at the same time? Or is this a limitation dependent on the GPU device that cannot be represented in the code?
My system specifications are: 64-bit Windows 7, and a GeForce GTX 670 graphics card (that's Kepler architecture, compute capability 3.0).
Kernels overlap only if the GPU has resources left to run a second kernel. Once the GPU is fully loaded, there is no gain from running more kernels in parallel, so the driver does not do that.

CUDA - no blocks, just threads for undefined dimensions

I have some matrices with unknown sizes varying from 10-20.000 in both directions.
I designed a CUDA kernel with (x;y) blocks and (x;y) threads.
Since matrices width/height aren't multiple of my dimensions, it was a terrible pain to get things work and the code is becoming more and more complicated to get coalescence memory reads.
Besides all of that, the kernel is growing in size using more and more registers to check for correctness... so I think this is not the way I should adopt.
My question is: what if I totally eliminate blocks and just create a grid of x;y threads? Will a SM unit have problems without many blocks?
Can I eliminate blocks and use a large amount of threads or is the block subdivision necessary?
You can't really just make a "grid of threads", since you have to organize threads into blocks and you can have a maximum of 512 threads per block. However, you could effectively do this by using 1 thread per block, which will result in a X by Y grid of 1x1 blocks. However, this will result in pretty terrible performance due to several factors:
According to the CUDA Programming Guide, a SM can handle a maximum of 8 blocks at any time. This will limit you to 8 threads per SM, which isn't enough to fill even a single warp. If you have, say, 48 CUDA cores, you will only be able to handle 384 threads at any given time.
With only 8 threads available on a SM, there will be too few warps to hide memory latencies. The GPU will spend most of its time waiting for memory accesses to complete, rather than doing any computations.
You will be unable to coalesce memory reads and writes, resulting in poor memory bandwidth usage.
You will be effectively unable to leverage shared memory, as this is a shared resource between threads in a block.
While having to ensure correctness for threads in a block is annoying, your performance will be vastly better than your "grid of threads" idea.
Here's the code i use to divide a given task requiring num_threads into block and grid. Yes, you might end up launching to many blocks (but only very few) and you will probably end up having more actual threads than required, but it's easy and efficient this way. See the second code example below for my simple in-kernel boundary check.
PS: I always have block_size == 128 because it has been a good tradeoff between multicore occupancy, register usage, shared memory requirements and coalescent access for all of my kernels.
Code to calculate a good grid size (host):
#define GRID_SIZE 65535
//calculate grid size (store result in grid/block)
void kernelUtilCalcGridSize(unsigned int num_threads, unsigned int block_size, dim3* grid, dim3* block) {
//block
block->x = block_size;
block->y = 1;
block->z = 1;
//number of blocks
unsigned int num_blocks = kernelUtilCeilDiv(num_threads, block_size);
unsigned int total_threads = num_blocks * block_size;
assert(total_threads >= num_threads);
//calculate grid size
unsigned int gy = kernelUtilCeilDiv(num_blocks, GRID_SIZE);
unsigned int gx = kernelUtilCeilDiv(num_blocks, gy);
unsigned int total_blocks = gx * gy;
assert(total_blocks >= num_blocks);
//grid
grid->x = gx;
grid->y = gy;
grid->z = 1;
}
//ceil division (rounding up)
unsigned int kernelUtilCeilDiv(unsigned int numerator, unsigned int denominator) {
return (numerator + denominator - 1) / denominator;
}
Code to calculate the unique thread id and check boundaries (device):
//some kernel
__global__ void kernelFoo(unsigned int num_threads, ...) {
//calculate unique id
const unsigned int thread_id = threadIdx.x;
const unsigned int block_id = blockIdx.x + blockIdx.y * gridDim.x;
const unsigned int unique_id = thread_id + block_id * blockDim.x;
//check range
if (unique_id >= num_threads) return;
//do the actual work
...
}
I don't think that's a lot of effort/registers/lines-of-code to check for correctness.