CUDA shared memory programming is not working

CUDA shared memory programming is not working - c++

all:
I am learning how shared memory accelerates the GPU programming process. I am using the codes below to calculate the squared value of each element plus the squared value of the average of its left and right neighbors.
The code runs, however, the result is not as expected.
The first 10 result printed out is 0,1,2,3,4,5,6,7,8,9, while I am expecting the result as 25,2,8, 18,32,50,72,98,128,162;
The code is as follows, with the reference to here;
Would you please tell me which part goes wrong? Your help is very much appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <cuda.h>
const int N=1024;
__global__ void compute_it(float *data)
{
int tid = threadIdx.x;
__shared__ float myblock[N];
float tmp;
// load the thread's data element into shared memory
myblock[tid] = data[tid];
// ensure that all threads have loaded their values into
// shared memory; otherwise, one thread might be computing
// on unitialized data.
__syncthreads();
// compute the average of this thread's left and right neighbors
tmp = (myblock[tid>0?tid-1:(N-1)] + myblock[tid<(N-1)?tid+1:0]) * 0.5f;
// square the previousr result and add my value, squared
tmp = tmp*tmp + myblock[tid]*myblock[tid];
// write the result back to global memory
data[tid] = myblock[tid];
__syncthreads();
}
int main (){
char key;
float *a;
float *dev_a;
a = (float*)malloc(N*sizeof(float));
cudaMalloc((void**)&dev_a,N*sizeof(float));
for (int i=0; i<N; i++){
a [i] = i;
}
cudaMemcpy(dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice);
compute_it<<<N,1>>>(dev_a);
cudaMemcpy(a, dev_a, N*sizeof(float), cudaMemcpyDeviceToHost);
for (int i=0; i<10; i++){
std::cout<<a [i]<<",";
}
std::cin>>key;
free (a);
free (dev_a);

One of the most immediate problems in your kernel code is this:
data[tid] = myblock[tid];
I think you probably meant this:
data[tid] = tmp;
In addition, you're launching 1024 blocks of one thread each. This isn't a particularly effective way to use the GPU and it means that your tid variable in every threadblock is 0 (and only 0, since there is only one thread per threadblock.)
There are many problems with this approach, but one immediate problem will be encountered here:
tmp = (myblock[tid>0?tid-1:(N-1)] + myblock[tid<31?tid+1:0]) * 0.5f;
Since tid is always zero, and therefore no other values in your shared memory array (myblock) get populated, the logic in this line cannot be sensible. When tid is zero, you are selecting myblock[N-1] for the first term in the assignment to tmp, but myblock[1023] never gets populated with anything.
It seems that you don't understand various CUDA hierarchies:
a grid is all threads associated with a kernel launch
a grid is composed of threadblocks
each threadblock is a group of threads working together on a single SM
the shared memory resource is a per-SM resource, not a device-wide resource
__synchthreads() also operates on threadblock basis (not device-wide)
threadIdx.x is a built-in variable that provide a unique thread ID for all threads within a threadblock, but not globally across the grid.
Instead you should break your problem into groups of reasonable-sized threadblocks (i.e. more than one thread). Each threadblock will then be able to behave in a fashion that is roughly as you have outlined. You will then need to special-case the behavior at the starting point and ending point (in your data) of each threadblock.
You're also not doing proper cuda error checking which is recommended, especially any time you're having trouble with a CUDA code.
If you make the change I indicated first in your kernel code, and reverse the order of your block and grid kernel launch parameters:
compute_it<<<1,N>>>(dev_a);
As indicated by Kristof, you will get something that comes close to what you want, I think. However you will not be able to conveniently scale that beyond N=1024 without other changes to your code.
This line of code is also not correct:
free (dev_a);
Since dev_a was allocated on the device using cudaMalloc you should free it like this:
cudaFree (dev_a);

Since you have only one thread per block, your tid will always be 0.
Try launching the kernel this way:
compute_it<<<1,N>>>(dev_a);
instead of
compute_it<<>>(dev_a);

Related

Can a thread-local copy of select elements be created of a shared 2D array in a parallel region? (Shared, private, barrier: OPenMP)

I have a 2-D grid of nxn elements. In one iteration, I'm calculating the value of one element by averaging the values of its neighbors. That is:
for(int i=0;i<n;i++)
for(int j=0;j<n;j++)
grid[i][j] = (grid[i-1][j] + grid[i][j-1] + grid[i+1][j] + grid[i][j+1])/4.0;
And I need to run the above nested loop for iter number of iterations.
What I need is the following:
I need the threads to calculate this average, wait till all the threads have finished calculating and THEN update the grid in one go.
The loop with iter iterations will run sequentially, but during every iteration, the value of grid[i][j] for every i and j should be calculated in parallel.
In order to do that I have the following ideas and questions:
Maybe make grid shared and put a copy of the select 4 elements of the grid that is needed for calculating grid[i][j] by making only those 4 elements private to the thread. (Basically grid is shared by all threads, but there is a local copy of 4 iteration-specific elements in every thread too.) Is this possible?
Would a barrier be in fact needed for all the threads to finish and then start onto the next iteration?
I'm very new to the OpenMP way of thinking and I'm utterly lost in this simple problem. I'd be grateful if somebody could help resolve my confusion.

In practice, you'd want to have (much) fewer threads than grid points, so each thread will be calculating a whole bunch of points (for example, one row). There is a certain overhead associated with starting OpenMP (or any other kind of) threads, and you program will be memory-bound rather than CPU-bound anyway. So starting a thread per grid point will defeat the whole purpose of parallelizing the computation. Hence, your idea #1 is not recommended (I am not quite sure I understood it correctly though; maybe this is not what you were proposing).
I would recommend (also pointed out by others in OP comments) you allocate twice the memory needed to store the grid values and use two pointers that are swapped between iterations: one points to memory holding previous iteration values that are read only, the other one to new iteration values that are write-only. Note that you will only swap the pointers, not actually copy the memory. After your iteration is done, you can copy the final result into desired location.
Yes, you need to synchronize threads between iterations, however in OpenMP this is usually done implicitly simply by opening a parallel region within the iteration loop (there is an implicit barrier at the end of a parallel region):
for (int iter = 0; iter < niter; ++iter)
{
#pragma omp parallel
{
// get range of points for current thread
// loop over thread's points and apply the stencil
}
}
or, using a parallel for construct:
const int np = n*n;
for (int iter = 0; iter < niter; ++iter)
{
#pragma omp parallel for
for (int ip = 0; ip < np; ++ip)
{
const int i = ip / n;
const int j = ip % n;
// apply the stencil to [i,j]
}
}
The second version will auto-distribute the work evenly between the available threads, which is most likely what you want. In the first you have to do it manually.

Does `mem_fence` provide consistency between work-groups?

I am trying to implement the bounding-box calculation as described here. Long story short, I have a binary tree of bounding boxes. The leaf nodes are all filled in, and now it is time to calculate the internal nodes. In addition to the nodes (each defining the child/parent indices), there is a counter for each internal node.
Starting at each leaf node, the parent node is visited and its flag atomically incremented. If this is the first visit to the node, the thread exits (as only one child is guaranteed to have been initialized). If it is the second visit, then both children are initialized, its bounding box is calculated and we continue with that node's parents.
Is the mem_fence between reading the flag and reading the data of its children sufficient to guarantee the data in the children will be visible?
kernel void internalBounds(global struct Bound * const bounds,
global unsigned int * const flags,
const global struct Node * const nodes) {
const unsigned int n = get_global_size(0);
const size_t D = 3;
const size_t leaf_start = n - 1;
size_t node_idx = leaf_start + get_global_id(0);
do {
node_idx = nodes[node_idx].parent;
write_mem_fence(CLK_GLOBAL_MEM_FENCE);
// Mark node as visited, both children initialized on second visit
if (atomic_inc(&flags[node_idx]) < 1)
break;
read_mem_fence(CLK_GLOBAL_MEM_FENCE);
const global unsigned int * child_idxs = nodes[node_idx].internal.children;
for (size_t d = 0; d < D; d++) {
bounds[node_idx].min[d] = min(bounds[child_idxs[0]].min[d],
bounds[child_idxs[1]].min[d]);
bounds[node_idx].max[d] = max(bounds[child_idxs[0]].max[d],
bounds[child_idxs[1]].max[d]);
}
} while (node_idx != 0);
}
I am limited to OpenCL 1.2.

No it doesn't. CLK_GLOBAL_MEM_FENCE only provides consistency within the work group when accessing global memory. There is no inter-workgroup synchronization in OpenCL 1.x
Try to use a single, large workgroup and iterate over the data. And/or start with some small trees that will fit inside a single work group.

https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/mem_fence.html
mem_fence(...) syncs mem-accesses for only single work-item. Even if all work-items have this line, they may not hit(and continue) it at the same time.
barrier(...) does synchronize for all work items in a work group and have them wait for the slowest one(that isa accessing the specified memory given as parameter), but only connected to its own work groups work items.(such as only 64 or 256 for amd-intel and maybe 1024 for nvidia) because an opencl device driver implementation may be designed to finish all wavefronts before loading new shards of wavefronts because all global items would simply not fit inside chip memory(such as 64M work items each using 1kB local memory that need 64GB memory! --> even software emulation would need hundreds or thousands of passes and decrease performance to a level of single core cpu)
Global sync (where all work groups synchronized) is not possible.
Just in case work item work group and processing elements get mixed meanings,
OpenCL: Work items, Processing elements, NDRange
Atomic function you put there is already accesing global memory so adding group-scope synchronization shouldn't be important.
Also check machine codes if
bounds[child_idxs[0]].min[d]
is getting whole bounds[child_idxs[0]] struct into private memory before accessing to min[d]. If yes, you can separate min as an independent array access its items to have %100 more memory bandwidth for it.
Test on intel hd 400, more than 100000 threads
__kernel void fenceTest( __global float *c,
__global int *ctr)
{
int id=get_global_id(0);
if(id<128000)
for(int i=0;i<20000;i++)
{
c[id]+=ctr[0];
mem_fence(CLK_GLOBAL_MEM_FENCE);
}
ctr[0]++;
}
2900ms (c array has garbage)
__kernel void fenceTest( __global float *c,
__global int *ctr)
{
int id=get_global_id(0);
if(id<128000)
for(int i=0;i<20000;i++)
{
c[id]+=ctr[0];
}
ctr[0]++;
}
500 ms(c array has garbage). 500ms is ~6x the performance of fence version(my laptop has single channel 4GB ram which is only 5-10 GB/s but its igpu local memory has nearly 38GB/s(64B per cycle and 600 MHz frequency)). Local fence version takes 700ms so the fenceless version doesn't even touching cache or local memory for some iterations as it seems.
Without loop, it takes 8-9 ms so it wasn't optimizing the loop in these kernels I suppose.
Edit:
int id=get_global_id(0);
if(id==0)
{
atom_inc(&ctr[0]);
mem_fence(CLK_GLOBAL_MEM_FENCE);
}
mem_fence(CLK_GLOBAL_MEM_FENCE);
c[id]+=ctr[0];
behaves exactly as
int id=get_global_id(0);
if(id==0)
{
ctr[0]++;
mem_fence(CLK_GLOBAL_MEM_FENCE);
}
mem_fence(CLK_GLOBAL_MEM_FENCE);
c[id]+=ctr[0];
for this Intel igpu device(only by chance, but it proves changed memory is visible by "all" trailing threads, but doesn't prove it always happens(such as first compute unit hiccups and 2nd starts first for example) and it is not atomic for more than single threads accessing it).

Cuda triple nested for loop assignement

I'm trying to convert c++ code into Cuda code and I've got the following triple nested for loop that will fill an array for further OpenGL rendering (i'm simply creating a coordinate vertices array):
for(int z=0;z<263;z++) {
for(int y=0;y<170;y++) {
for(int x=0;x<170;x++) {
g_vertex_buffer_data_3[i]=(float)x+0.5f;
g_vertex_buffer_data_3[i+1]=(float)y+0.5f;
g_vertex_buffer_data_3[i+2]=-(float)z+0.5f;
i+=3;
}
}
}
I would like to get faster operations and so I'll use Cuda for some operations like the one listed above. I want to create one block for each iteration of the outermost loop and since the inner loops have iterations of 170 * 170 = 28900 total iterations, assign one thread to each innermost loop iteration. I converted the c++ code into this (it's just a small program that i made to understand how to use Cuda):
__global__ void mykernel(int k, float *buffer) {
int idz=blockIdx.x;
int idx=threadIdx.x;
int idy=threadIdx.y;
buffer[k]=idx+0.5;
buffer[k+1]=idy+0.5;
buffer[k+2]=idz+0.5;
k+=3;
}
int main(void) {
int dim=3*170*170*263;
float* g_vertex_buffer_data_2 = new float[dim];
float* g_vertex_buffer_data_3;
int i=0;
HANDLE_ERROR(cudaMalloc((void**)&g_vertex_buffer_data_3, sizeof(float)*dim));
dim3 dimBlock(170, 170);
dim3 dimGrid(263);
mykernel<<<dimGrid, dimBlock>>>(i, g_vertex_buffer_data_3);
HANDLE_ERROR(cudaMemcpy(&g_vertex_buffer_data_2,g_vertex_buffer_data_3,sizeof(float)*dim,cudaMemcpyDeviceToHost));
for(int j=0;j<100;j++){
printf("g_vertex_buffer_data_2[%d]=%f\n",j,g_vertex_buffer_data_2[j]);
}
cudaFree(g_vertex_buffer_data_3);
return 0;
}
Trying to launch it I get a segmenation fault. Do you know what am i doing wrong?
I think the problem is that threadIdx.x and threadIdx.y grow at the same time, while I would like to have threadIdx.x to be the inner one and threadIdx.y to be the outer one.

There is a lot wrong here, but the source of the segfault is this:
cudaMemcpy(&g_vertex_buffer_data_2,g_vertex_buffer_data_3,
sizeof(float)*dim,cudaMemcpyDeviceToHost);
You either want
cudaMemcpy(&g_vertex_buffer_data_2[0],g_vertex_buffer_data_3,
sizeof(float)*dim,cudaMemcpyDeviceToHost);
or
cudaMemcpy(g_vertex_buffer_data_2,g_vertex_buffer_data_3,
sizeof(float)*dim,cudaMemcpyDeviceToHost);
Once you fix that you will notice that the kernel is actually never launching with an invalid launch error. This is because a block size of (170,170) is illegal. CUDA has a 1024 threads per block limit on all current hardware.
There might well be other problems in your code. I stopped looking after I found these two.

Efficiently Initializing Shared Memory Array in CUDA

Note that this shared memory array is never written to, only read from.
As I have it, my shared memory gets initialized like:
__shared__ float TMshared[2592];
for (int i = 0; i< 2592; i++)
{
TMshared[i] = TM[i];
}
__syncthreads();
(TM is passed into all threads from kernel launch)
You might have noticed that this is highly inefficient as there is no parallelization going on and threads within the same block are writing to the same location.
Can someone please recommend a more efficient approach/comment on if this issue really needs optimization since the shared array in question is relatively small?
Thanks!

Use all threads to write independent locations, it will probably be quicker.
Example assumes 1D threadblock/grid:
#define SSIZE 2592
__shared__ float TMshared[SSIZE];
int lidx = threadIdx.x;
while (lidx < SSIZE){
TMShared[lidx] = TM[lidx];
lidx += blockDim.x;}
__syncthreads();

Trying to understand prefix sum execution

I am trying to understand the scan implementation scan-then-fan mentioned in the book: The CUDA Handbook.
Can some one explain the device function scanWarp? Why negative indexes? Could you please mention a numerical example?
I have the same question about for the line warpPartials[16+warpid] = sum. How the assignment is happening?
Which is the contribution of this line if ( warpid==0 ) {scanWarp<T,bZeroPadded>( 16+warpPartials+tid ); }
Could you please someone explain sum += warpPartials[16+warpid-1]; ? An numerical example will be highly appreciated.
Finally, a more c++ oriented question how do we know the indexes that are used in *sPartials = sum; to store values in sPartials?
PS: A numerical example that demonstrates the whole execution would be very helpful.
template < class T, bool bZeroPadded >
inline __device__ T
scanBlock( volatile T *sPartials ){
extern __shared__ T warpPartials[];
const int tid = threadIdx.x;
const int lane = tid & 31;
const int warpid = tid >> 5;
//
// Compute this thread's partial sum
//
T sum = scanWarp<T,bZeroPadded>( sPartials );
__syncthreads();
//
// Write each warp's reduction to shared memory
//
if ( lane == 31 ) {
warpPartials[16+warpid] = sum;
}
__syncthreads();
//
// Have one warp scan reductions
//
if ( warpid==0 ) {
scanWarp<T,bZeroPadded>( 16+warpPartials+tid );
}
__syncthreads();
//
// Fan out the exclusive scan element (obtained
// by the conditional and the decrement by 1)
// to this warp's pending output
//
if ( warpid > 0 ) {
sum += warpPartials[16+warpid-1];
}
__syncthreads();
//
// Write this thread's scan output
//
*sPartials = sum;
__syncthreads();
//
// The return value will only be used by caller if it
// contains the spine value (i.e. the reduction
// of the array we just scanned).
//
return sum;
}
template < class T >
inline __device__ T
scanWarp( volatile T *sPartials ){
const int tid = threadIdx.x;
const int lane = tid & 31;
if ( lane >= 1 ) sPartials[0] += sPartials[- 1];
if ( lane >= 2 ) sPartials[0] += sPartials[- 2];
if ( lane >= 4 ) sPartials[0] += sPartials[- 4];
if ( lane >= 8 ) sPartials[0] += sPartials[- 8];
if ( lane >= 16 ) sPartials[0] += sPartials[-16];
return sPartials[0];
}

The scan-then-fan strategy is applied at two levels. For the grid-level scan (which operates on global memory), partials are written to the temporary global memory buffer allocated in the host code, scanned by recursively calling the host function, then added to the eventual output with a separate kernel invocation. For the block-level scan (which operates on shared memory), partials are written to the base of shared memory (warpPartials[]), scanned by one warp, then added to the eventual output of the block-level scan. The code that you are asking about is doing the block-level scan.
The implementation of scanWarp that you are referencing is called with a shared memory pointer that has already had threadIdx.x added to it, so each thread's version of sPartials points to a different shared memory element. Using a fixed index on sPartials causes adjacent threads to operate on adjacent shared memory elements. Negative indices are okay as long as they do not result in out-of-bounds array indexing. This implementation borrowed from the optimized version that pads shared memory with zeros, so every thread can unconditionally use a fixed negative index and threads below a certain index just read zeros. (Listing 13.14) It could just as easily have predicated execution on the lowest threads in the warp and used positive indices.
The 31st thread of each 32-thread warp contains that warp's partial sum, which has to be stored somewhere in order to be scanned and then added to the output. warpPartials[] aliases shared memory from the first element, so can be used to hold each warp's partial sum. You could use any part of shared memory to do this calculation, because each thread already has its own scan value in registers (the assignment T sum = scanWarp...).
Some warp (it could be any warp, so it might as well be warp 0) has to scan the partials that were written to warpPartials[]. At most one warp is needed because there is a hardware limitation of 1024 threads per block = 1024/32 or 32 warps. So this code is taking advantage of the coincidence that the maximum number of threads per block, divided by the warp count, is no larger than the maximum number of threads per warp.
This code is adding the scanned per-warp partials to each output element. The first warp already has the correct values, so the addition is done only by the second and subsequent warps. Another way to look at this is that it's adding the exclusive scan of the warp partials to the output.
scanBlock is a device function - the address arithmetic gets done by its caller, scanAndWritePartials: volatile T *myShared = sPartials+tid;

(Answer rewritten now I have more time)
Here's an example (based on an implementation I wrote in C++ AMP, not CUDA). To make the diagram smaller each warp is 4 elements wide and a block is 16 elements.
The following paper is also pretty useful Efficient Parallel Scan Algorithms for GPUs. As is Parallel Scan for Stream Architectures.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js