I've written a program to compute a histogram, where each of the 256 values for a char byte is counted:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "..\..\common\book.h"
#include <stdio.h>
#include <cuda.h>
#include <conio.h>
#define SIZE (100*1024*1024)
__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo){
__shared__ unsigned int temp[256];
temp[threadIdx.x] = 0;
int i = threadIdx.x + blockIdx.x * blockDim.x;
int offset = blockDim.x * gridDim.x;
while (i < size) {
atomicAdd(&temp[buffer[i]], 1);
i += offset;}
atomicAdd(&(histo[threadIdx.x]), temp[threadIdx.x]);
int main()
unsigned char *buffer = (unsigned char*)big_random_block(SIZE);
cudaEvent_t start, stop;
cudaEventRecord(start, 0);
unsigned char *dev_buffer;
unsigned int *dev_histo;
cudaMalloc((void**)&dev_buffer, SIZE);
cudaMemcpy(dev_buffer, buffer, SIZE, cudaMemcpyHostToDevice);
cudaMalloc((void**)&dev_histo, 256 * sizeof(long));
cudaMemset(dev_histo, 0, 256 * sizeof(int));
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int blocks = prop.multiProcessorCount;
histo_kernel << <blocks * 256 , 256>> >(dev_buffer, SIZE, dev_histo);
unsigned int histo[256];
cudaMemcpy(&histo, dev_histo, 256 * sizeof(int), cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0);
float elapsed_time;
cudaEventElapsedTime(&elapsed_time, start, stop);
printf("Time to generate: %f ms\n", elapsed_time);
long sum = 0;
for (int i = 0; i < 256; i++)
sum += histo[i];
printf("The sum is %ld", sum);
return 0;
I'ves read in the book, CUDA by example, that launching the kernel with number of blocks twice the number of multiprocessors is empirically found to be the most optimal solution. Yet, when I launch it with 8 times the number of blocks, the running time is cut down.
I've run the kernel with: 1.Blocks same as the number of multiprocessors, 2.Blocks twice the number of multiprocessors, 3.Blocks 4 times, and so on.
With (1), I got the running time to be 112ms
With (2) I got the running time to be 73ms
With (3) I got the running time to be 52ms
Funnily, after the number of blocks being 8 times the number of multiprocessors, the running time did not vary by a significant amount. Like it was the same for block being 8 times and 256 times and 1024 times the number of multiprocessors.
How can this be explained?
This behavior is typical. The GPU is a latency-hiding machine. In order to hide latency, when it hits a stall, it needs additional new work available. You can maximize the amount of additional new work available by giving the GPU a large number of blocks and threads.
Once you have given it enough work to hide latency as best it can, giving it additional work does not help. The machine is saturated. However, having additional work available is generally/typically not much of a detriment either. There is little overhead associated with blocks and threads.
Whatever you read in CUDA by Example may have been true for a specific case, but it is certainly not generally true that the correct number of blocks to launch is equal to twice the number of multiprocessors. A better target (typically) would be 4-8 blocks per multiprocessor.
When it comes to blocks and threads, more is usually better, and it's rarely the case that having arbitrarily large numbers of blocks and threads will actually cause a significant degradation in performance. This is contrary to typical CPU thread programming, where having large numbers of OMP threads, for example, may lead to a significant reduction in performance, when you exceed the core count.
When you are tuning the code for the last 10% in performance, then you will see people limit the amount of blocks they launch, to some number that is typically 4-8 times the number of SMs, and construct their threadblocks to loop over the data set. But this normally only yields a few percent performance improvement, in most cases. As a reasonable CUDA programming starting point, aim for tens of thousands of threads, and hundreds of blocks, at least. A carefully tuned code may be able to saturate the machine with fewer blocks and threads, but it will become GPU-dependent at that point. And as I've stated already, there's rarely much of a performance detriment to having millions of threads and thousands of blocks.
I have seen the Cuda Kernel started two separate ways:
for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x; i < length; i += blockDim.x * gridDim.x)
// do stuff
if(i < length)
// do stuff
Both versions are launched with kernel<<<num_blocks, threads_per_block>>> where the threads per block are maximized for our device (1024) and the number of blocks (2) for a length of 1025, for example.
The obvious difference is that the for loop allows the kernel to loop when the kernel is launched with less threads, for example 512 threads with 2 blocks length of 1025 it loops twice.
From previous research I've gathered that Nvidia suggests that we do not try and load balance ourselves (read as loop within the kernel like this), for instance, giving a kernel less threads or less blocks to reserve space for other kernels on the device because the load balancing that is built in is supposed to handle this in a more globally optimized way.
So my question is why would we want to use the for loop vs the if statement form of kernel? Is there a benefit to either at run time?
Given my understanding of Nvidia's stance on load balancing, the only value I can see is the ability to debug synchronously via 1 thread and 1 block setting <<<1, 1>>> when we launch the kernel in the for loop version or not having to precompute the # of blocks needed (and/or threads).
This is the test project I ran:
#include <cstdint>
#include <cstdio>
inline void kernel(int length)
int counter = 0;
for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x; i < length; i += blockDim.x * gridDim.x)
printf("%u: | i+: %u | tid: %u | counter: %u \n", i, blockDim.x * gridDim.x, threadIdx.x, counter++);
inline void kernel2(int length)
uint32_t i = blockIdx.x * blockDim.x + threadIdx.x;
if(i < length)
printf("%u: | i+: %u | tid: %u | \n", i, blockDim.x * gridDim.x, threadIdx.x);
int main()
//kernel<<<2, 1024>>>(1025);
kernel2<<<2, 1024>>>(1025);
So my question is why would we want to use the for loop vs the if statement form of kernel? Is there a benefit to either at run time?
Yes, there is. Every CUDA thread needs to:
Read all of its parameters from constant memory
Read grid and thread information from special registers: blockDim, blockIdx, threadIdx (or at least their .x components)
Do the arithemtic for computing its global index.
That takes a bit of time. It's not a lot; but if your kernel is very simple (e.g. something like adding up two arrays), then - yes, that has a cost. And of course, if you perform your own preliminary computation that is used with all items in the sequence - each thread has to take the time to do that as well.
From previous research I've gathered that Nvidia suggests that we do not try and load balance ourselves (read as loop within the kernel like this)
I doubt that. The question of whether to iterate a large sequence with a single "CUDA thread" per item or with less threads, each working on multiple items, depends on what is done for individual items in the sequence.
In the course of optimising an inner loop I have come across strange performance behaviour that I'm having trouble understanding and correcting.
A pared-down version of the code follows; roughly speaking there is one gigantic array which is divided up into 16 word chunks, and I simply add up the number of leading zeroes of the words in each chunk. (In reality I'm using the popcnt code from Dan Luu, but here I picked a simpler instruction with similar performance characteristics for "brevity". Dan Luu's code is based on an answer to this SO question which, while it has tantalisingly similar strange results, does not seem to answer my questions here.)
// -*- compile-command: "gcc -O3 -march=native -Wall -Wextra -std=c99 -o clz-timing clz-timing.c" -*-
#include <stdint.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#define ARRAY_LEN 16
// Return the sum of the leading zeros of each element of the ARRAY_LEN
// words starting at u.
static inline uint64_t clz_array(const uint64_t u[ARRAY_LEN]) {
uint64_t c0 = 0;
for (int i = 0; i < ARRAY_LEN; ++i) {
uint64_t t0;
__asm__ ("lzcnt %1, %0" : "=r"(t0) : "r"(u[i]));
c0 += t0;
return c0;
// For each of the narrays blocks of ARRAY_LEN words starting at
// arrays, put the result of clz_array(arrays + i*ARRAY_LEN) in
// counts[i]. Return the time taken in milliseconds.
double clz_arrays(uint32_t *counts, const uint64_t *arrays, int narrays) {
clock_t t = clock();
for (int i = 0; i < narrays; ++i, arrays += ARRAY_LEN)
counts[i] = clz_array(arrays);
t = clock() - t;
// Convert clock time to milliseconds
return t * 1e3 / (double)CLOCKS_PER_SEC;
void print_stats(double t_ms, long n, double total_MiB) {
double t_s = t_ms / 1e3, thru = (n/1e6) / t_s, band = total_MiB / t_s;
printf("Time: %7.2f ms, %7.2f x 1e6 clz/s, %8.1f MiB/s\n", t_ms, thru, band);
int main(int argc, char *argv[]) {
long n = 1 << 20;
if (argc > 1)
n = atol(argv[1]);
long total_bytes = n * ARRAY_LEN * sizeof(uint64_t);
uint64_t *buf = malloc(total_bytes);
uint32_t *counts = malloc(sizeof(uint32_t) * n);
double t_ms, total_MiB = total_bytes / (double)(1 << 20);
printf("Total size: %.1f MiB\n", total_MiB);
// Warm up
t_ms = clz_arrays(counts, buf, n);
//print_stats(t_ms, n, total_MiB); // (1)
// Run it
t_ms = clz_arrays(counts, buf, n); // (2)
print_stats(t_ms, n, total_MiB);
// Write something into buf
for (long i = 0; i < n*ARRAY_LEN; ++i)
buf[i] = i;
// And again...
(void) clz_arrays(counts, buf, n); // (3)
t_ms = clz_arrays(counts, buf, n); // (4)
print_stats(t_ms, n, total_MiB);
return 0;
The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory.
Here is the result of a typical run (compiler command is at the beginning of the source):
$ ./clz-timing 10000000
Total size: 1220.7 MiB
Time: 47.78 ms, 209.30 x 1e6 clz/s, 25548.9 MiB/s
Time: 77.41 ms, 129.19 x 1e6 clz/s, 15769.7 MiB/s
The CPU on which this was run is an "Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz" which has a turbo boost of 3.5GHz. The latency of the lzcnt instruction is 3 cycles but it has a throughput of 1 operation per second (see Agner Fog's Skylake instruction tables) so, with 8 byte words (using uint64_t) at 3.5GHz the peak bandwidth should be 3.5e9 cycles/sec x 8 bytes/cycle = 28.0 GiB/s, which is pretty close to what we see in the first number. Even at 2.6GHz we should get close to 20.8 GiB/s.
The main question I have is,
Why is the bandwidth of call (4) always so far below the optimal value(s) obtained in call (2) and what can I do to guarantee optimal performance under a majority of circumstances?
Some points regarding what I've found so far:
According to extensive analysis with perf, the problem seems to be caused by LLC cache load misses in the slow cases that don't appear in the fast case. My guess was that maybe the fact that the memory on which we're performing the calculation hadn't been initialised meant that the compiler didn't feel obliged to load any particular values into memory, but the output of objdump -d clearly shows that the same code is being run each time. It's as though the hardware prefetcher was active the first time but not the second time, but in every case this array should be the easiest thing in the world to prefetch reliably.
The "warm up" calls at (1) and (3) are consistently as slow as the second printed bandwidth corresponding to call (4).
I've obtained much the same results on my desktop machine ("Intel(R) Xeon(R) CPU E5-2620 v3 # 2.40GHz").
Results were essentially the same between GCC 4.9, 7.0 and Clang 4.0. All tests run on Debian testing, kernel 4.14.
All of these results and observations can also be obtained with clz_array replaced by builtin_popcnt_unrolled_errata_manual from the Dan Luu post, mutatis mutandis.
Any help would be most appreciated!
The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory
Uninitialized memory that malloc gets from the kernel with mmap is all initially copy-on-write mapped to the same physical page of all zeros.
So you get TLB misses but not cache misses. If it used a 4k page, then you get L1D hits. If it used a 2M hugepage, then you only get L3 (LLC) hits, but that's still significantly better bandwidth than DRAM.
Single-core memory bandwidth is often limited by max_concurrency / latency, and often can't saturate DRAM bandwidth. (See Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?, and the "latency-bound platforms" section of this answer for more about this in; it's much worse on many-core Xeon chips than on quad-core desktop/laptops.)
Your first warm-up run will suffer from page faults as well as TLB misses. Also, on a kernel with Meltdown mitigation enabled, any system call will flush the whole TLB. If you were adding extra print_stats to show the warm-up run performance, that would have made the run after slower.
You might want to loop multiple times over the same memory inside a timing run, so you don't need so many page-walks from touching so much virtual address space.
clock() is not a great way to measure performance. It records time in seconds, not CPU core clock cycles. If you run your benchmark long enough, you don't need really high precision, but you would need to control for CPU frequency to get accurate results. Calling clock() probably results in a system call, which (with Meltdown and Spectre mitigation enabled) flushes TLBs and branch-prediction. It may be slow enough for Skylake to clock back down from max turbo. You don't do any warm-up work after that, and of course you can't because anything after the first clock() is inside the timed interval.
Something based on wall-clock time which can use RDTSC as a timesource instead of switching to kernel mode (like gettimeofday()) would be lower overhead, although then you'd be measuring wall-clock time instead of CPU time. That's basically equivalent if the machine is otherwise idle so your process doesn't get descheduled.
For something that wasn't memory-bound, CPU performance counters to count core clock cycles can be very accurate, and without the inconvenience of having to control for CPU frequency. (Although these days you don't have to reboot to temporarily disable turbo and set the governor to performance.)
But with memory-bound stuff, changing core frequency changes the ratio of core to memory, making memory faster or slower relative to the CPU.
I have a Monte Carlo simulation in which the state of the system is a bit string (size N) with the bits being randomly flipped. In an effort to accelerate the simulation the code was revised to use CUDA. However because of the large number of statistics I need calculated from the system state (goes as N^2) this part needs to be done on the CPU where there is more memory. Currently the algorithm looks like this:
CUDA kernel making 10s of Monte Carlo steps
Copy system state back to CPU
Calculate statistics
This is inefficient and I would like to have the kernel run persistently while the CPU occasionally queries the state of the system and calculates the statistics while the kernel continues to run.
Based on Tom's answer to this question I think the answer is double buffering, but I haven't been able to find an explanation or example of how to do this.
How does one set up the double buffering described in the third paragraph of Tom's answer for a CUDA/C++ code?
Here's a fully worked example of a "persistent" kernel, producer-consumer approach, with a double-buffered interface from device (producer) to host (consumer).
Persistent kernel design generally implies launching kernels with, at most, the number of blocks that can be simultaneously resident on the hardware (see item 1 on slide 16 here). For the most efficient usage of the machine, we'd generally like to maximize this, while still staying within the aforementioned limit. This involves an occupancy study for a specific kernel, and it will vary from kernel to kernel. Therefore I've chosen to take a shortcut here, and simply launch as many blocks as there are multiprocessors. Such an approach is always guaranteed to work (it could be considered a "lower bound" on the number of blocks to launch for a persistent kernel), but is (typically) not the most efficient usage of the machine. Nevertheless, I claim the occupancy study is beside the point of your question. Furthermore, it is arguable that proper "persistent kernel" design with guaranteed forward progress is actually quite tricky - requiring careful design of the CUDA thread code and placement of threadblocks (e.g. only use 1 threadblock per SM) to guarantee forward progress. However we don't need to delve to this level to address your question (I don't think) and the persistent kernel example I propose here only places 1 threadblock per SM.
I'm also assuming a proper UVA setup, so that I can skip the details of arranging for proper mapped memory allocations in an non-UVA setup.
The basic idea is that we will have 2 buffers on the device, along with 2 "mailboxes" in mapped memory, one for each buffer. The device kernel will fill a buffer with data, then set the "mailbox" to a value (2, in this case) that indicates the host may "consume" the buffer. The device then goes on to the other buffer and repeats the process in a ping-pong fashion between buffers. In order to make this work we must make sure that the device itself has not overrun the buffers (no thread is allowed to be more than one buffer ahead of any other thread) and that before a buffer is populated by the device, the host has consumed the previous contents.
On the host side, it is simply waiting for the mailbox to indicate "full", then copying the buffer from device to host, reset the mailbox, and perform the "processing" on it (the validate function). It then goes on to the next buffer in a ping-pong fashion. The actual data "production" by the device is just to fill each buffer with the iteration number. The host then checks to see that the proper iteration number was received.
I've structured the code to call out the actual device "work" function (my_compute_function) which is where you would put whatever your Monte Carlo code is. If your code is nicely thread-independent, this should be straightforward. Thus the device side my_compute_function is the producer function, and the host side validate is the consumer function. If your device producer code is not simply thread independent, then you may need to restructure things slightly around the calling point to my_compute_function.
The net effect of this is that the device can "race ahead" and begin filling the next buffer, while the host is "consuming" the data in the previous buffer.
Because persistent kernel design imposes an upper bound on the number of blocks (and threads) in a kernel launch, I've chosen to implement the "work" producer function in a grid-striding loop, so that arbitrary size buffers can be handled by the given grid-width.
Here's a fully worked example:
$ cat t942.cu
#include <stdio.h>
#define ITERS 1000
#define DSIZE 65536
#define nTPB 256
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__device__ volatile int blkcnt1 = 0;
__device__ volatile int blkcnt2 = 0;
__device__ volatile int itercnt = 0;
__device__ void my_compute_function(int *buf, int idx, int data){
buf[idx] = data; // put your work code here
__global__ void testkernel(int *buffer1, int *buffer2, volatile int *buffer1_ready, volatile int *buffer2_ready, const int buffersize, const int iterations){
// assumption of persistent block-limited kernel launch
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int iter_count = 0;
while (iter_count < iterations ){ // persistent until iterations complete
int *buf = (iter_count & 1)? buffer2:buffer1; // ping pong between buffers
volatile int *bufrdy = (iter_count & 1)?(buffer2_ready):(buffer1_ready);
volatile int *blkcnt = (iter_count & 1)?(&blkcnt2):(&blkcnt1);
int my_idx = idx;
while (iter_count - itercnt > 1); // don't overrun buffers on device
while (*bufrdy == 2); // wait for buffer to be consumed
while (my_idx < buffersize){ // perform the "work"
my_compute_function(buf, my_idx, iter_count);
my_idx += gridDim.x*blockDim.x; // grid-striding loop
__syncthreads(); // wait for my block to finish
__threadfence(); // make sure global buffer writes are "visible"
if (!threadIdx.x) atomicAdd((int *)blkcnt, 1); // mark my block done
if (!idx){ // am I the master block/thread?
while (*blkcnt < gridDim.x); // wait for all blocks to finish
*blkcnt = 0;
*bufrdy = 2; // indicate that buffer is ready
__threadfence_system(); // push it out to mapped memory
int validate(const int *data, const int dsize, const int val){
for (int i = 0; i < dsize; i++) if (data[i] != val) {printf("mismatch at %d, was: %d, should be: %d\n", i, data[i], val); return 0;}
return 1;
int main(){
int *h_buf1, *d_buf1, *h_buf2, *d_buf2;
volatile int *m_bufrdy1, *m_bufrdy2;
// buffer and "mailbox" setup
cudaHostAlloc(&h_buf1, DSIZE*sizeof(int), cudaHostAllocDefault);
cudaHostAlloc(&h_buf2, DSIZE*sizeof(int), cudaHostAllocDefault);
cudaHostAlloc(&m_bufrdy1, sizeof(int), cudaHostAllocMapped);
cudaHostAlloc(&m_bufrdy2, sizeof(int), cudaHostAllocMapped);
cudaCheckErrors("cudaHostAlloc fail");
cudaMalloc(&d_buf1, DSIZE*sizeof(int));
cudaMalloc(&d_buf2, DSIZE*sizeof(int));
cudaCheckErrors("cudaMalloc fail");
cudaStream_t streamk, streamc;
cudaCheckErrors("cudaStreamCreate fail");
*m_bufrdy1 = 0;
*m_bufrdy2 = 0;
cudaMemset(d_buf1, 0xFF, DSIZE*sizeof(int));
cudaMemset(d_buf2, 0xFF, DSIZE*sizeof(int));
cudaCheckErrors("cudaMemset fail");
// inefficient crutch for choosing number of blocks
int nblock = 0;
cudaDeviceGetAttribute(&nblock, cudaDevAttrMultiProcessorCount, 0);
cudaCheckErrors("get multiprocessor count fail");
testkernel<<<nblock, nTPB, 0, streamk>>>(d_buf1, d_buf2, m_bufrdy1, m_bufrdy2, DSIZE, ITERS);
cudaCheckErrors("kernel launch fail");
volatile int *bufrdy;
int *hbuf, *dbuf;
for (int i = 0; i < ITERS; i++){
if (i & 1){ // ping pong on the host side
bufrdy = m_bufrdy2;
hbuf = h_buf2;
dbuf = d_buf2;}
else {
bufrdy = m_bufrdy1;
hbuf = h_buf1;
dbuf = d_buf1;}
// int qq = 0; // add for failsafe - otherwise a machine failure can hang
while ((*bufrdy)!= 2); // use this for a failsafe: if (++qq > 1000000) {printf("bufrdy = %d\n", *bufrdy); return 0;} // wait for buffer to be full;
cudaMemcpyAsync(hbuf, dbuf, DSIZE*sizeof(int), cudaMemcpyDeviceToHost, streamc);
cudaCheckErrors("cudaMemcpyAsync fail");
*bufrdy = 0; // release buffer back to device
if (!validate(hbuf, DSIZE, i)) {printf("validation failure at iter %d\n", i); exit(1);}
printf("Completed %d iterations successfully\n", ITERS);
$ nvcc -o t942 t942.cu
$ ./t942
Completed 1000 iterations successfully
I've tested the above code and it seems to work well on linux. I believe it should be OK on a windows TCC setup. On windows WDDM, however, I think there are issues that I am still investigating.
Note that the above kernel design attempts to do a grid-wide synchronization using a block-counting atomic strategy. CUDA now (9.0 and newer) has cooperative groups, and that is the recommended approach, rather than the above methodology, to create a grid-wide sync.
This isn't a direct answer to your question but it may be of help.
I am working with a CUDA producer-consumer code that appears to be similar in basic structure to yours. I was hoping to speed up the code by making the CPU and GPU run concurrently. I attempted this by restructuring the code this why
Launch kernel
Copy data
Launch kernel
CPU work
Copy data
CPU work
This way the CPU can work on the data from the last kernel run while the next set of data is being generated. This cut 30% off the runtime of my code. I am guess thing it could get better if the GPU/CPU work can be balanced so they take roughly the same amount of time.
I am still launching the same kernel 1000s of times. If the overhead of launching a kernel repeatedly is significant then looking for a way to do what I have accomplish with a single launch would be worth it. Otherwise this is probably the best (simplest) solution.
I am developing an application for which performance is a fundamental issue. In particular, I was willing to organize a tree-like structure that needs to be traversed really quickly in blocks of the same size as my memory page size so that it would reduce the number of cache misses needed to reach a leaf.
I am quite a novice in the art of memory optimization. As far as I understand, the process of accessing the main memory goes more or less as follows:
CPUs have several layer of caches of increasing size and decreasing speed.
Every time some data that I need is already in the cache, it is fetched from the cache (cache hit).
If it is not in the cache, it will be fetched from the main memory.
Anytime something is loaded from the main memory, the whole page (or pages) containing the data are loaded and stored in the cache. In this way, if I try to access locations in memory that are close to the ones I already fetched from the main memory, they will already be in my CPU cache.
However, if I organize my data in blocks of the same size as my memory page size, I thought that it would also be needed to align that data properly, so that whenever a new block of my data needs to be loaded only one page of memory will need to be fetched from the main memory rather than the two pages containing the first half and the second half of my data block). In principle, shouldn't a correctly aligned data block mean only one access to the memory rather than two? Shouldn't that more or less double memory performance?
I tried the following:
#include <iostream>
#include <unistd.h>
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
using namespace std;
#define BLOCKS 262144
#define TESTS 131072
unsigned long int utime()
struct timeval tp;
gettimeofday(&tp, NULL);
return tp.tv_sec * 1000000 + tp.tv_usec;
unsigned long int pagesize = sysconf(_SC_PAGE_SIZE);
unsigned long int block_slots = pagesize / sizeof(unsigned int);
unsigned int t = 0;
unsigned int p = 0;
unsigned long int test(unsigned int * data)
unsigned long int start = utime();
for(unsigned int n=0; n<TESTS; n++)
for(unsigned int i=0; i<block_slots; i++)
t += data[p * block_slots + i];
p = t % BLOCKS;
unsigned long int end = utime();
return end - start;
int main()
srand((unsigned int) time(NULL));
char * buffer = new char[(BLOCKS + 3) * pagesize];
for(unsigned int i=0; i<(BLOCKS + 3) * pagesize; i++)
buffer[i] = rand();
for(unsigned int i=0; i<pagesize; i++)
cout<<test((unsigned int *) (buffer + i))<<endl;
delete [] buffer;
This code instantiates more or less 1 GB of empty bytes, fills them with random numbers. Then the function test is called with all the possible shifts in a memory page (from a 0 shift to a 4096 shift). The test function interprets the pointer provided as a group of blocks of data and carries out some simple operation (sum) over those blocks. The order of access to the blocks is more or less random (it's determined by the partial sums) so that every time a new block is accessed it is nearly certain not to already be in the cache.
The function test is then timed. In all the shift configurations but one I should observe some timing, while in one particular shift configuration (the null shift, maybe?) I should observe some big improvement in terms of efficiency. This, however, does not happen at all: all the shift timings are perfectly compatible with each other.
Why does this happen and what does this mean? Can I just forget about memory alignment? Can I also forget about making my data blocks exactly as big as a memory page? (I was planning to use some padding in case they were smaller). Or maybe something in the cache management process is just unclear to me?
First I should say I'm quite new to programming in C++ (let alone CUDA), though it is what I first learned with about 184 years ago. I'd say I'm a bit out of touch with memory allocation, and datatype sizes, though I'm learning. Anyway here goes:
I have a GPU with compute capability 3.0 (It's a Geforce 660 GTX w/ 2GB of DRAM).
Going by ./deviceQuery found in the CUDA samples (and by other charts I've found online), my maximum grid size is listed:
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
At 2,147,483,647 (2^31-1) that x dimension is huge and kind of nice… YET, when I run my code, pushing beyond 65535 in the x dimension, things get... weird.
I used an example from an Udacity course, and modified it to test the extremes. I've kept the kernel code fairly simple to prove the point:
__global__ void referr(long int *d_out, long int *d_in){
long int idx = blockIdx.x;
d_out[idx] = idx;
Please note below the ARRAY_SIZE being the size of the grid, but also being the size of the array of integers on which to do operations. I am leaving the size of the blocks at 1x1x1. JUST for the sake of understanding the limitations, I KNOW that having this many operations with blocks of only 1 thread makes no sense, but I want to understand what's going on with the grid size limitations.
int main(int argc, char ** argv) {
const long int ARRAY_SIZE = 522744;
const long int ARRAY_BYTES = ARRAY_SIZE * sizeof(long int);
// generate the input array on the host
long int h_in[ARRAY_SIZE];
for (long int i = 0; i < ARRAY_SIZE; i++) {
h_in[i] = i;
long int h_out[ARRAY_SIZE];
// declare GPU memory pointers
long int *d_in;
long int *d_out;
// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);
// transfer the array to the GPU
cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);
// launch the kernel with ARRAY_SIZE blocks in the x dimension, with 1 thread each.
referr<<<ARRAY_SIZE, 1>>>(d_out, d_in);
// copy back the result array to the CPU
cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);
// print out the resulting array
for (long int i =0; i < ARRAY_SIZE; i++) {
printf("%li", h_out[i]);
printf(((i % 4) != 3) ? "\t" : "\n");
return 0;
This works as expected with an ARRAY_SIZE at MOST of 65535. The last few lines of the output below
65516 65517 65518 65519
65520 65521 65522 65523
65524 65525 65526 65527
65528 65529 65530 65531
65532 65533 65534
If I push the ARRAY_SIZE beyond this the output gets really unpredictable and eventually if the number gets too high I get a Segmentation fault (core dumped) message… whatever that even means. Ie. with an ARRAY_SIZE of 65536:
65520 65521 65522 65523
65524 65525 65526 65527
65528 65529 65530 65531
65532 65533 65534 131071
Why is it now stating that the blockIdx.x for this last one is 131071?? That is 65535+65535+1. Weird.
Even weirder, when I set the ARRAY_SIZE to 65537 (65535+2) I get some seriously strange results for the last lines of the output.
65520 65521 65522 65523
65524 65525 65526 65527
65528 65529 65530 65531
65532 65533 65534 131071
131072 131073 131074 131075
131076 131077 131078 131079
131080 131081 131082 131083
131084 131085 131086 131087
131088 131089 131090 131091
131092 131093 131094 131095
131096 131097 131098 131099
131100 131101 131102 131103
131104 131105 131106 131107
131108 131109 131110 131111
131112 131113 131114 131115
131116 131117 131118 131119
131120 131121 131122 131123
131124 131125 131126 131127
131128 131129 131130 131131
131132 131133 131134 131135
131136 131137 131138 131139
131140 131141 131142 131143
131144 131145 131146 131147
131148 131149 131150 131151
131152 131153 131154 131155
131156 131157 131158 131159
131160 131161 131162 131163
131164 131165 131166 131167
131168 131169 131170 131171
131172 131173 131174 131175
131176 131177 131178 131179
131180 131181 131182 131183
131184 131185 131186 131187
131188 131189 131190 131191
131192 131193 131194 131195
131196 131197 131198 131199
Isn't 65535 the limit for older GPUs? Why is my GPU "messing up" when I push past the 65535 barrier for the x grid dimension? Or is this by design? What in the world is going on?
Wow, sorry for the long question.
Any help to understand this would be greatly appreciated! Thanks!
You should be using proper CUDA error checking . And you should be compiling for a compute 3.0 architecture by specifying -arch=sm_30 when you compile with nvcc.