Cuda: pinned memory zero copy problems - c++

I tried the code in this link Is CUDA pinned memory zero-copy?
The one who asked claims the program worked fine for him
But does not work the same way on mine
the values does not change if I manipulate them in the kernel.
Basically my problem is, my GPU memory is not enough but I want to do calculations which require more memory. I my program to use RAM memory, or host memory and be able to use CUDA for calculations. The program in the link seemed to solve my problem but the code does not give output as shown by the guy.
Any help or any working example on Zero copy memory would be useful.
Thank you
__global__ void testPinnedMemory(double * mem)
{
double currentValue = mem[threadIdx.x];
printf("Thread id: %d, memory content: %f\n", threadIdx.x, currentValue);
mem[threadIdx.x] = currentValue+10;
}
void test()
{
const size_t THREADS = 8;
double * pinnedHostPtr;
cudaHostAlloc((void **)&pinnedHostPtr, THREADS, cudaHostAllocDefault);
//set memory values
for (size_t i = 0; i < THREADS; ++i)
pinnedHostPtr[i] = i;
//call kernel
dim3 threadsPerBlock(THREADS);
dim3 numBlocks(1);
testPinnedMemory<<< numBlocks, threadsPerBlock>>>(pinnedHostPtr);
//read output
printf("Data after kernel execution: ");
for (int i = 0; i < THREADS; ++i)
printf("%f ", pinnedHostPtr[i]);
printf("\n");
}

First of all, to allocate ZeroCopy memory, you have to specify cudaHostAllocMapped flag as an argument to cudaHostAlloc.
cudaHostAlloc((void **)&pinnedHostPtr, THREADS * sizeof(double), cudaHostAllocMapped);
Still the pinnedHostPointer will be used to access the mapped memory from the host side only. To access the same memory from device, you have to get the device side pointer to the memory like this:
double* dPtr;
cudaHostGetDevicePointer(&dPtr, pinnedHostPtr, 0);
Pass this pointer as kernel argument.
testPinnedMemory<<< numBlocks, threadsPerBlock>>>(dPtr);
Also, you have to synchronize the kernel execution with the host to read the updated values. Just add cudaDeviceSynchronize after the kernel call.
The code in the linked question is working, because the person who asked the question is running the code on a 64 bit OS with a GPU of Compute Capability 2.0 and TCC enabled. This configuration automatically enables the Unified Virtual Addressing feature of the GPU in which the device sees host + device memory as a single large memory instead of separate ones and host pointers allocated using cudaHostAlloc can be passed directly to the kernel.
In your case, the final code will look like this:
#include <cstdio>
__global__ void testPinnedMemory(double * mem)
{
double currentValue = mem[threadIdx.x];
printf("Thread id: %d, memory content: %f\n", threadIdx.x, currentValue);
mem[threadIdx.x] = currentValue+10;
}
int main()
{
const size_t THREADS = 8;
double * pinnedHostPtr;
cudaHostAlloc((void **)&pinnedHostPtr, THREADS * sizeof(double), cudaHostAllocMapped);
//set memory values
for (size_t i = 0; i < THREADS; ++i)
pinnedHostPtr[i] = i;
double* dPtr;
cudaHostGetDevicePointer(&dPtr, pinnedHostPtr, 0);
//call kernel
dim3 threadsPerBlock(THREADS);
dim3 numBlocks(1);
testPinnedMemory<<< numBlocks, threadsPerBlock>>>(dPtr);
cudaDeviceSynchronize();
//read output
printf("Data after kernel execution: ");
for (int i = 0; i < THREADS; ++i)
printf("%f ", pinnedHostPtr[i]);
printf("\n");
return 0;
}

Related

Two-threaded app is slower than single-threaded on C++ (VC++ 2010 Express). How to solve?

I have some program that allocate memory a lot, I hoped to boost it's speed by splitting task on threads, but it made my program only slower.
I made this minimal example that has nothing to do with my real code aside of the fact it allocate memory in different threads.
class ThreadStartInfo
{
public:
unsigned char *arr_of_5m_elems;
bool TaskDoneFlag;
ThreadStartInfo()
{
this->TaskDoneFlag = false;
this->arr_of_5m_elems = NULL;
}
~ThreadStartInfo()
{
if (this->arr_of_5m_elems)
free(this->arr_of_5m_elems);
}
};
unsigned long __stdcall CalcSomething(void *tsi_ptr)
{
ThreadStartInfo *tsi = (ThreadStartInfo*)tsi_ptr;
for (int i = 0; i < 5000000; i++)
{
double *test_ptr = (double*)malloc(tsi->arr_of_5m_elems[i] * sizeof(double));
memset(test_ptr, 0, tsi->arr_of_5m_elems[i] * sizeof(double));
free(test_ptr);
}
tsi->TaskDoneFlag = true;
return 0;
}
void main()
{
ThreadStartInfo *tsi1 = new ThreadStartInfo();
tsi1->arr_of_5m_elems = (unsigned char*)malloc(5000000 * sizeof(unsigned char));
ThreadStartInfo *tsi2 = new ThreadStartInfo();
tsi2->arr_of_5m_elems = (unsigned char*)malloc(5000000 * sizeof(unsigned char));
ThreadStartInfo **tsi_arr = (ThreadStartInfo**)malloc(2 * sizeof(ThreadStartInfo*));
tsi_arr[0] = tsi1;
tsi_arr[1] = tsi2;
time_t start_dt = time(NULL);
CalcSomething(tsi1);
CalcSomething(tsi2);
printf("Task done in %i seconds.\n", time(NULL) - start_dt);
//--
tsi1->TaskDoneFlag = false;
tsi2->TaskDoneFlag = false;
//--
start_dt = time(NULL);
unsigned long th1_id = 0;
void *th1h = CreateThread(NULL, 0, CalcSomething, tsi1, 0, &th1_id);
unsigned long th2_id = 0;
void *th2h = CreateThread(NULL, 0, CalcSomething, tsi2, 0, &th2_id);
retry:
for (int i = 0; i < 2; i++)
if (!tsi_arr[i]->TaskDoneFlag)
{
Sleep(100);
goto retry;
}
CloseHandle(th1h);
CloseHandle(th2h);
printf("MT Task done in %i seconds.\n", time(NULL) - start_dt);
}
It prints me such results:
Task done in 16 seconds.
MT Task done in 19 seconds.
And... I didn't expected slow down. Is there anyway to make memory allocations faster in multiple threads?
Apart from some undefined behavior due to lack of synchronization on TaskDoneFlag, all the threads are doing is calling malloc/free repeatedly.
The Visual C++ CRT heap is single-threaded1, as malloc/free delegate to HeapAlloc/HeapFree which execute in a critical section (only one thread at a time). Calling them from more than one thread at a time will never be faster than a single thread, and often slower due to the lock contention overhead.
Either reduce allocations in threads or switch to another memory allocator, like jemalloc or tcmalloc.
1 See this note for HeapAlloc:
Serialization ensures mutual exclusion when two or more threads attempt to simultaneously allocate or free blocks from the same heap. There is a small performance cost to serialization, but it must be used whenever multiple threads allocate and free memory from the same heap. Setting the HEAP_NO_SERIALIZE value eliminates mutual exclusion on the heap. Without serialization, two or more threads that use the same heap handle might attempt to allocate or free memory simultaneously, likely causing corruption in the heap.

CUDA - separating cpu code from cuda code

Was looking to use system functions (such as rand() ) within the CUDA kernel. However, ideally this would just run on the CPU. Can I separate files (.cu and .c++), while still making use of gpu matrix addition? For example, something along these lines:
in main.cpp:
int main(){
std::vector<int> myVec;
srand(time(NULL));
for (int i = 0; i < 1024; i++){
myvec.push_back( rand()%26);
}
selfSquare(myVec, 1024);
}
and in cudaFuncs.cu:
__global__ void selfSquare_cu(int *arr, n){
int i = threadIdx.x;
if (i < n){
arr[i] = arr[i] * arr[i];
}
}
void selfSquare(std::vector<int> arr, int n){
int *cuArr;
cudaMallocManaged(&cuArr, n * sizeof(int));
for (int i = 0; i < n; i++){
cuArr[i] = arr[i];
}
selfSquare_cu<<1, n>>(cuArr, n);
}
What are best practices surrounding situations like these? Would it be a better idea to use curand and write everything in the kernel? It looks to me like in the above example, there is an extra step in taking the vector and copying it to the shared cuda memory.
In this case the only thing that you need is to have the array initialised with random values. Each value of the array can be initialised indipendently.
The CPU is involved in your code during the initialization and trasferring of the data to the device and back to the host.
In your case, do you really need to have the CPU to initialize the data for then having all those values moved to the GPU?
The best approach is to allocate some device memory and then initialize the values using a kernel.
This will save time because
The elements are initialized in parallel
There is not memory transfer required from the host to the device
As a rule of thumb, always avoid communication between host and device if possible.

Doubling buffering in CUDA so the CPU can operate on data produced by a persistent kernel

I have a Monte Carlo simulation in which the state of the system is a bit string (size N) with the bits being randomly flipped. In an effort to accelerate the simulation the code was revised to use CUDA. However because of the large number of statistics I need calculated from the system state (goes as N^2) this part needs to be done on the CPU where there is more memory. Currently the algorithm looks like this:
loop
CUDA kernel making 10s of Monte Carlo steps
Copy system state back to CPU
Calculate statistics
This is inefficient and I would like to have the kernel run persistently while the CPU occasionally queries the state of the system and calculates the statistics while the kernel continues to run.
Based on Tom's answer to this question I think the answer is double buffering, but I haven't been able to find an explanation or example of how to do this.
How does one set up the double buffering described in the third paragraph of Tom's answer for a CUDA/C++ code?
Here's a fully worked example of a "persistent" kernel, producer-consumer approach, with a double-buffered interface from device (producer) to host (consumer).
Persistent kernel design generally implies launching kernels with, at most, the number of blocks that can be simultaneously resident on the hardware (see item 1 on slide 16 here). For the most efficient usage of the machine, we'd generally like to maximize this, while still staying within the aforementioned limit. This involves an occupancy study for a specific kernel, and it will vary from kernel to kernel. Therefore I've chosen to take a shortcut here, and simply launch as many blocks as there are multiprocessors. Such an approach is always guaranteed to work (it could be considered a "lower bound" on the number of blocks to launch for a persistent kernel), but is (typically) not the most efficient usage of the machine. Nevertheless, I claim the occupancy study is beside the point of your question. Furthermore, it is arguable that proper "persistent kernel" design with guaranteed forward progress is actually quite tricky - requiring careful design of the CUDA thread code and placement of threadblocks (e.g. only use 1 threadblock per SM) to guarantee forward progress. However we don't need to delve to this level to address your question (I don't think) and the persistent kernel example I propose here only places 1 threadblock per SM.
I'm also assuming a proper UVA setup, so that I can skip the details of arranging for proper mapped memory allocations in an non-UVA setup.
The basic idea is that we will have 2 buffers on the device, along with 2 "mailboxes" in mapped memory, one for each buffer. The device kernel will fill a buffer with data, then set the "mailbox" to a value (2, in this case) that indicates the host may "consume" the buffer. The device then goes on to the other buffer and repeats the process in a ping-pong fashion between buffers. In order to make this work we must make sure that the device itself has not overrun the buffers (no thread is allowed to be more than one buffer ahead of any other thread) and that before a buffer is populated by the device, the host has consumed the previous contents.
On the host side, it is simply waiting for the mailbox to indicate "full", then copying the buffer from device to host, reset the mailbox, and perform the "processing" on it (the validate function). It then goes on to the next buffer in a ping-pong fashion. The actual data "production" by the device is just to fill each buffer with the iteration number. The host then checks to see that the proper iteration number was received.
I've structured the code to call out the actual device "work" function (my_compute_function) which is where you would put whatever your Monte Carlo code is. If your code is nicely thread-independent, this should be straightforward. Thus the device side my_compute_function is the producer function, and the host side validate is the consumer function. If your device producer code is not simply thread independent, then you may need to restructure things slightly around the calling point to my_compute_function.
The net effect of this is that the device can "race ahead" and begin filling the next buffer, while the host is "consuming" the data in the previous buffer.
Because persistent kernel design imposes an upper bound on the number of blocks (and threads) in a kernel launch, I've chosen to implement the "work" producer function in a grid-striding loop, so that arbitrary size buffers can be handled by the given grid-width.
Here's a fully worked example:
$ cat t942.cu
#include <stdio.h>
#define ITERS 1000
#define DSIZE 65536
#define nTPB 256
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__device__ volatile int blkcnt1 = 0;
__device__ volatile int blkcnt2 = 0;
__device__ volatile int itercnt = 0;
__device__ void my_compute_function(int *buf, int idx, int data){
buf[idx] = data; // put your work code here
}
__global__ void testkernel(int *buffer1, int *buffer2, volatile int *buffer1_ready, volatile int *buffer2_ready, const int buffersize, const int iterations){
// assumption of persistent block-limited kernel launch
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int iter_count = 0;
while (iter_count < iterations ){ // persistent until iterations complete
int *buf = (iter_count & 1)? buffer2:buffer1; // ping pong between buffers
volatile int *bufrdy = (iter_count & 1)?(buffer2_ready):(buffer1_ready);
volatile int *blkcnt = (iter_count & 1)?(&blkcnt2):(&blkcnt1);
int my_idx = idx;
while (iter_count - itercnt > 1); // don't overrun buffers on device
while (*bufrdy == 2); // wait for buffer to be consumed
while (my_idx < buffersize){ // perform the "work"
my_compute_function(buf, my_idx, iter_count);
my_idx += gridDim.x*blockDim.x; // grid-striding loop
}
__syncthreads(); // wait for my block to finish
__threadfence(); // make sure global buffer writes are "visible"
if (!threadIdx.x) atomicAdd((int *)blkcnt, 1); // mark my block done
if (!idx){ // am I the master block/thread?
while (*blkcnt < gridDim.x); // wait for all blocks to finish
*blkcnt = 0;
*bufrdy = 2; // indicate that buffer is ready
__threadfence_system(); // push it out to mapped memory
itercnt++;
}
iter_count++;
}
}
int validate(const int *data, const int dsize, const int val){
for (int i = 0; i < dsize; i++) if (data[i] != val) {printf("mismatch at %d, was: %d, should be: %d\n", i, data[i], val); return 0;}
return 1;
}
int main(){
int *h_buf1, *d_buf1, *h_buf2, *d_buf2;
volatile int *m_bufrdy1, *m_bufrdy2;
// buffer and "mailbox" setup
cudaHostAlloc(&h_buf1, DSIZE*sizeof(int), cudaHostAllocDefault);
cudaHostAlloc(&h_buf2, DSIZE*sizeof(int), cudaHostAllocDefault);
cudaHostAlloc(&m_bufrdy1, sizeof(int), cudaHostAllocMapped);
cudaHostAlloc(&m_bufrdy2, sizeof(int), cudaHostAllocMapped);
cudaCheckErrors("cudaHostAlloc fail");
cudaMalloc(&d_buf1, DSIZE*sizeof(int));
cudaMalloc(&d_buf2, DSIZE*sizeof(int));
cudaCheckErrors("cudaMalloc fail");
cudaStream_t streamk, streamc;
cudaStreamCreate(&streamk);
cudaStreamCreate(&streamc);
cudaCheckErrors("cudaStreamCreate fail");
*m_bufrdy1 = 0;
*m_bufrdy2 = 0;
cudaMemset(d_buf1, 0xFF, DSIZE*sizeof(int));
cudaMemset(d_buf2, 0xFF, DSIZE*sizeof(int));
cudaCheckErrors("cudaMemset fail");
// inefficient crutch for choosing number of blocks
int nblock = 0;
cudaDeviceGetAttribute(&nblock, cudaDevAttrMultiProcessorCount, 0);
cudaCheckErrors("get multiprocessor count fail");
testkernel<<<nblock, nTPB, 0, streamk>>>(d_buf1, d_buf2, m_bufrdy1, m_bufrdy2, DSIZE, ITERS);
cudaCheckErrors("kernel launch fail");
volatile int *bufrdy;
int *hbuf, *dbuf;
for (int i = 0; i < ITERS; i++){
if (i & 1){ // ping pong on the host side
bufrdy = m_bufrdy2;
hbuf = h_buf2;
dbuf = d_buf2;}
else {
bufrdy = m_bufrdy1;
hbuf = h_buf1;
dbuf = d_buf1;}
// int qq = 0; // add for failsafe - otherwise a machine failure can hang
while ((*bufrdy)!= 2); // use this for a failsafe: if (++qq > 1000000) {printf("bufrdy = %d\n", *bufrdy); return 0;} // wait for buffer to be full;
cudaMemcpyAsync(hbuf, dbuf, DSIZE*sizeof(int), cudaMemcpyDeviceToHost, streamc);
cudaStreamSynchronize(streamc);
cudaCheckErrors("cudaMemcpyAsync fail");
*bufrdy = 0; // release buffer back to device
if (!validate(hbuf, DSIZE, i)) {printf("validation failure at iter %d\n", i); exit(1);}
}
printf("Completed %d iterations successfully\n", ITERS);
}
$ nvcc -o t942 t942.cu
$ ./t942
Completed 1000 iterations successfully
$
I've tested the above code and it seems to work well on linux. I believe it should be OK on a windows TCC setup. On windows WDDM, however, I think there are issues that I am still investigating.
Note that the above kernel design attempts to do a grid-wide synchronization using a block-counting atomic strategy. CUDA now (9.0 and newer) has cooperative groups, and that is the recommended approach, rather than the above methodology, to create a grid-wide sync.
This isn't a direct answer to your question but it may be of help.
I am working with a CUDA producer-consumer code that appears to be similar in basic structure to yours. I was hoping to speed up the code by making the CPU and GPU run concurrently. I attempted this by restructuring the code this why
Launch kernel
Copy data
Loop
Launch kernel
CPU work
Copy data
CPU work
This way the CPU can work on the data from the last kernel run while the next set of data is being generated. This cut 30% off the runtime of my code. I am guess thing it could get better if the GPU/CPU work can be balanced so they take roughly the same amount of time.
I am still launching the same kernel 1000s of times. If the overhead of launching a kernel repeatedly is significant then looking for a way to do what I have accomplish with a single launch would be worth it. Otherwise this is probably the best (simplest) solution.

CUDA kernel function output variable isn't modified

I am trying to pass object to kernel. This object has basically two variables, one acts as the input and the other as the output of the kernel. But when I launch kernel the output variable does not change. But when I add another variable to kernel and assign the output value to this variable as well, it suddenly works for both of them.
I've read in another thread (While loop fails in CUDA kernel) that the compiler can evaluate kernel as empty for optimizing purposes if it doesn't produce any output.
So it is possible that this input/output object that I'm passing as the only kernel argument isn't somehow recognized by the compiler as an output? And if that's true. Is there an elegant way (I would like to avoid adding another kernel argument) such as compiling option that can prevent this?
This is the class for this object.
class Replica
{
public :
signed char gA[1024];
int MA;
__device__ __host__ Replica(){
}
};
And this is the kernel that is basically a sum reduction.
__global__ void sumKerA(Replica* Rd)
{
int t = threadIdx.x;
int b = blockIdx.x;
__shared__ signed short gAs[1024];
gAs[t] = Rd[b].gA[t];
for (unsigned int stride = 1024 >> 1; stride > 0; stride >>= 1){
__syncthreads();
if (t < stride){
gAs[t] += gAs[t + stride];
}
}
__syncthreads();
if (t == 0){
Rd[b].MA = gAs[0];
}
}
And finally my host code.
int main ()
{
// replicas - array of objects
Replica R[128];
for (int i = 0; i < 128; ++i){
for (int j = 0; j < 1024; ++j){
R[i].gA[j] = 2*(rand() % 2) - 1;
}
R[i].MA = 0;
}
Replica* Rd;
cudaSetDevice(0);
cudaMalloc((void **)&Rd,128*sizeof(Replica));
cudaMemcpy(Rd,R,128*sizeof(Replica),cudaMemcpyHostToDevice);
dim3 DimBlock(1024,1,1);
dim3 DimGridA(128,1,1);
sumKerA <<< DimBlock, DimGridA >>> (Rd);
cudaThreadSynchronize();
cudaMemcpy(&R,Rd,128*sizeof(Replica),cudaMemcpyDeviceToHost);
// cudaMemcpy(&M,Md,128*sizeof(int),cudaMemcpyDeviceToHost);
for (int i = 0; i < 128; ++i){
cout << R[i].MA << " ";
}
cudaFree(Rd);
return 0;
}
Based on your reduction code, it appears that you intend to launch 1024 threads per block.
In that case, this is incorrect:
dim3 DimBlock(1024,1,1);
dim3 DimGridA(128,1,1);
sumKerA <<< DimBlock, DimGridA >>> (Rd);
The first kernel configuration parameter is the dimensions of the grid. The second parameter is the dimension of the threadblock. If you want 1024 threads per block, while launching 128 blocks, your kernel launch should look like this:
sumKerA <<< DimGridA, DimBlock >>> (Rd);
If you add proper cuda error checking to your code, I expect you would see a kernel launch failure, because using the block variable (blockIdx.x) to index into the Rd array of 128 elements would index beyond the end of the array, in your original case.
If you modify the Replica objects pointed to by Rd in your kernel, that is externally visible state, so any code that modifies those objects cannot be "optimized away" by the compiler.
Also note that cudaThreadSynchronize() is deprecated in favor of cudaDeviceSynchronize() (they have the same behavior.)

While loop fails in CUDA kernel

I am using GPU to do some calculation for processing words.
Initially, I used one block (with 500 threads) to process one word.
To process 100 words, I have to loop the kernel function 100 times in my main function.
for (int i=0; i<100; i++)
kernel <<< 1, 500 >>> (length_of_word);
My kernel function looks like this:
__global__ void kernel (int *dev_length)
{
int length = *dev_length;
while (length > 4)
{ //do something;
length -=4;
}
}
Now I want to process all 100 words at the same time.
Each block will still have 500 threads, and processes one word (per block).
dev_totalwordarray: store all characters of the words (one after another)
dev_length_array: store the length of each word.
dev_accu_length: stores the accumulative length of the word (total char of all previous words)
dev_salt_ is an array of of size 500, storing unsigned integers.
Hence, in my main function I have
kernel2 <<< 100, 500 >>> (dev_totalwordarray, dev_length_array, dev_accu_length, dev_salt_);
to populate the cpu array:
for (int i=0; i<wordnumber; i++)
{
int length=0;
while (word_list_ptr_array[i][length]!=0)
{
length++;
}
actualwordlength2[i] = length;
}
to copy from cpu -> gpu:
int* dev_array_of_word_length;
HANDLE_ERROR( cudaMalloc( (void**)&dev_array_of_word_length, 100 * sizeof(int) ) );
HANDLE_ERROR( cudaMemcpy( dev_array_of_word_length, actualwordlength2, 100 * sizeof(int),
My function kernel now looks like this:
__global__ void kernel2 (char* dev_totalwordarray, int *dev_length_array, int* dev_accu_length, unsigned int* dev_salt_)
{
tid = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int hash[N];
int length = dev_length_array[blockIdx.x];
while (tid < 50000)
{
const char* itr = &(dev_totalwordarray[dev_accu_length[blockIdx.x]]);
hash[tid] = dev_salt_[threadIdx.x];
unsigned int loop = 0;
while (length > 4)
{ const unsigned int& i1 = *(reinterpret_cast<const unsigned int*>(itr)); itr += sizeof(unsigned int);
const unsigned int& i2 = *(reinterpret_cast<const unsigned int*>(itr)); itr += sizeof(unsigned int);
hash[tid] ^= (hash[tid] << 7) ^ i1 * (hash[tid] >> 3) ^ (~((hash[tid] << 11) + (i2 ^ (hash[tid] >> 5))));
length -=4;
}
tid += blockDim.x * gridDim.x;
}
}
However, kernel2 doesn't seem to work at all.
It seems while (length > 4) causes this.
Does anyone know why? Thanks.
I am not sure if the while is the culprit, but I see few things in your code that worry me:
Your kernel produces no output. The optimizer will most likely detect this and convert it to an empty kernel
In almost no situation you want arrays allocated per-thread. That will consume a lot of memory. Your hash[N] table will be allocated per-thread and discarded at the end of the kernel. If N is big (and then multiplied by the total amount of threads) you may run out of GPU memory. Not to mention, that accessing the hash will be almost as slow as accessing global memory.
All threads in a block will have the same itr value. Is it intended?
Every thread initializes only a single field within its own copy of hash table.
I see hash[tid] where tid is a global index. Be aware that even if hash was made global, you may hit concurrency problems. Not all blocks within a grid will run at the same time. While one block will initialize a portion of hash, another block might not even start!