Trying to understand prefix sum execution - c++

I am trying to understand the scan implementation scan-then-fan mentioned in the book: The CUDA Handbook.
Can some one explain the device function scanWarp? Why negative indexes? Could you please mention a numerical example?
I have the same question about for the line warpPartials[16+warpid] = sum. How the assignment is happening?
Which is the contribution of this line if ( warpid==0 ) {scanWarp<T,bZeroPadded>( 16+warpPartials+tid ); }
Could you please someone explain sum += warpPartials[16+warpid-1]; ? An numerical example will be highly appreciated.
Finally, a more c++ oriented question how do we know the indexes that are used in *sPartials = sum; to store values in sPartials?
PS: A numerical example that demonstrates the whole execution would be very helpful.
template < class T, bool bZeroPadded >
inline __device__ T
scanBlock( volatile T *sPartials ){
extern __shared__ T warpPartials[];
const int tid = threadIdx.x;
const int lane = tid & 31;
const int warpid = tid >> 5;
//
// Compute this thread's partial sum
//
T sum = scanWarp<T,bZeroPadded>( sPartials );
__syncthreads();
//
// Write each warp's reduction to shared memory
//
if ( lane == 31 ) {
warpPartials[16+warpid] = sum;
}
__syncthreads();
//
// Have one warp scan reductions
//
if ( warpid==0 ) {
scanWarp<T,bZeroPadded>( 16+warpPartials+tid );
}
__syncthreads();
//
// Fan out the exclusive scan element (obtained
// by the conditional and the decrement by 1)
// to this warp's pending output
//
if ( warpid > 0 ) {
sum += warpPartials[16+warpid-1];
}
__syncthreads();
//
// Write this thread's scan output
//
*sPartials = sum;
__syncthreads();
//
// The return value will only be used by caller if it
// contains the spine value (i.e. the reduction
// of the array we just scanned).
//
return sum;
}
template < class T >
inline __device__ T
scanWarp( volatile T *sPartials ){
const int tid = threadIdx.x;
const int lane = tid & 31;
if ( lane >= 1 ) sPartials[0] += sPartials[- 1];
if ( lane >= 2 ) sPartials[0] += sPartials[- 2];
if ( lane >= 4 ) sPartials[0] += sPartials[- 4];
if ( lane >= 8 ) sPartials[0] += sPartials[- 8];
if ( lane >= 16 ) sPartials[0] += sPartials[-16];
return sPartials[0];
}

The scan-then-fan strategy is applied at two levels. For the grid-level scan (which operates on global memory), partials are written to the temporary global memory buffer allocated in the host code, scanned by recursively calling the host function, then added to the eventual output with a separate kernel invocation. For the block-level scan (which operates on shared memory), partials are written to the base of shared memory (warpPartials[]), scanned by one warp, then added to the eventual output of the block-level scan. The code that you are asking about is doing the block-level scan.
The implementation of scanWarp that you are referencing is called with a shared memory pointer that has already had threadIdx.x added to it, so each thread's version of sPartials points to a different shared memory element. Using a fixed index on sPartials causes adjacent threads to operate on adjacent shared memory elements. Negative indices are okay as long as they do not result in out-of-bounds array indexing. This implementation borrowed from the optimized version that pads shared memory with zeros, so every thread can unconditionally use a fixed negative index and threads below a certain index just read zeros. (Listing 13.14) It could just as easily have predicated execution on the lowest threads in the warp and used positive indices.
The 31st thread of each 32-thread warp contains that warp's partial sum, which has to be stored somewhere in order to be scanned and then added to the output. warpPartials[] aliases shared memory from the first element, so can be used to hold each warp's partial sum. You could use any part of shared memory to do this calculation, because each thread already has its own scan value in registers (the assignment T sum = scanWarp...).
Some warp (it could be any warp, so it might as well be warp 0) has to scan the partials that were written to warpPartials[]. At most one warp is needed because there is a hardware limitation of 1024 threads per block = 1024/32 or 32 warps. So this code is taking advantage of the coincidence that the maximum number of threads per block, divided by the warp count, is no larger than the maximum number of threads per warp.
This code is adding the scanned per-warp partials to each output element. The first warp already has the correct values, so the addition is done only by the second and subsequent warps. Another way to look at this is that it's adding the exclusive scan of the warp partials to the output.
scanBlock is a device function - the address arithmetic gets done by its caller, scanAndWritePartials: volatile T *myShared = sPartials+tid;

(Answer rewritten now I have more time)
Here's an example (based on an implementation I wrote in C++ AMP, not CUDA). To make the diagram smaller each warp is 4 elements wide and a block is 16 elements.
The following paper is also pretty useful Efficient Parallel Scan Algorithms for GPUs. As is Parallel Scan for Stream Architectures.

Related

CUDA lane ID vs threadIdx.x based computation

It's easiest to explain via cub::LaneId() or a function like the following:
inline __device__ unsigned get_lane_id() {
unsigned ret;
asm volatile("mov.u32 %0, %laneid;" : "=r"(ret));
return ret;
}
Versus computing the lane ID as threadIdx.x & 31 .
Do these 2 approaches produce the same value in a 1D grid?
__ballot_sync() documentation speaks of lane IDs in its mask parameter, and as I understand it returns the bits set per lane ID. So would the following asserts never fail?
int nWarps = /*...*/;
bool condition = /*...*/;
if(threadIdx.x < nWarps) {
assert(__activemask() == ((1u<<nWarps)-1));
uint32_t res = __ballot_sync(__activemask(), condition);
assert(bool(res & (1<<threadIdx.x)) == condition);
}
From the PTX ISA documentation: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-laneid
%laneid A predefined, read-only special register that returns the thread's lane within the warp. The lane identifier ranges from zero to WARP_SZ-1.
This register will always contain the correct value, whereas threadIdx.x & 31 assumes that the warp size is 32. However, for all GPU generations to date, the warpsize has been 32, so for both old and current architectures the computed lane will be identical. There is no guarantee that this would always be the case, however.
On your question regarding assertion. With independent thread scheduling, there is no guarantee that all threads in a warp will execute __activemask() at the same time. I think the assertion may fail.
Quoting from the programming guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#independent-thread-scheduling-7-x
Note that threads within a warp can diverge even within a single code path. As a result, __activemask() and __ballot(1) may return only a subset of the threads on the current code path.
Do these 2 approaches produce the same value in a 1D grid?
Yes (while CUDA's warp size is 32). See also this question:
What's the most efficient way to calculate the warp id / lane id in a 1-D grid?
But I'd write it this way:
enum { warp_size = 32 };
// ...
inline unsigned lane_id() {
constexpr const auto lane_id_mask = warp_size - 1;
return threadIdx.x & lane_id_mask;
}
and if you want to be extra-pedantic, you could always static-assert to ensure the warp size is a power of 2 :-P
So would the following asserts never fail?
That code looks weird. Why would you left-shift by the thread ID or the number of warps? Don't see why that shouldn't fail.

How to move image data from within a searching window to local memory OpenCL

I've implemented an exhaustive block matching algorithm in parallel using OpenCL and am now trying to optimise the algorithm by moving the searching window to Local memory. The code I have so far is as follows:
//loop through whole search space and move to local
for (int i = -searchWindow-blockSize; i <= searchWindow+blockSize; i++) {
for (int j = -searchWindow-blockSize; j <= searchWindow+blockSize; j++) {
tgid = (cache[lid].x + i) + (cache[lid].y + j) * imWidth;
nlid = (blockSize+searchWindow + i) + (blockSize+searchWindow + j) * ((searchWindow+blockSize) * 2 + 1);
prevCache[nlid] = prevFrame[tgid];
nextCache[nlid] = nextFrame[tgid];
barrier(CLK_LOCAL_MEM_FENCE);
}
}
The nested for loops loop over the searching window and as the image data is 1 dimensional, I need to convert it to a 1D identifier by doing x + y * width to get tgid and nlid. cache[lid] is of type float2 storing the x and y coordinate of the current point of reference for the kernel location in the whole image. lid is the local work item id from get_local_id(0) Therefore, I am fetching the data from the searching window around the point of reference in the whole image and moving it to a local 'prevCache' and 'nextCache' which I can then perform my block matching algorithm on.
The issue I am getting is that the correct data is not always being assigned to the cache. To test this, I made prevCache and nextCache store the same data from only nextFrame like the code below:
//loop through whole search space and move to local
for (int i = -searchWindow-blockSize; i <= searchWindow+blockSize; i++) {
for (int j = -searchWindow-blockSize; j <= searchWindow+blockSize; j++) {
tgid = (cache[lid].x + i) + (cache[lid].y + j) * imWidth;
nlid = (blockSize+searchWindow + i) + (blockSize+searchWindow + j) * ((searchWindow+blockSize) * 2 + 1);
prevCache[nlid] = nextFrame[tgid];
nextCache[nlid] = nextFrame[tgid];
barrier(CLK_LOCAL_MEM_FENCE);
}
}
if (prevCache[220] != nextCache[220])
printf("test");
This if statement at the bottom should always be false as prevCache and nextCache should contain the same data, however, when I run the code, 'test' is being printed to the console several times. It seems to me that it is a synchronisation problem or something as when I change cache[lid].x and cache[lid].y to a fixed number, 'test' is never printed, but I have no clue at this point. Any help would be greatly appreciated!
Concerning local memory for optimization in general, let me paste something form my thesis here for you to think about:
Local memory is smaller but faster than global memory. Every compute
unit has its own local memory, which is accessible for every
processing element within this compute unit. Kernels can trigger a
synchronized bulk loading of data from global memory into local
memory. This allows for fast and efficient access for the processing
elements to that bulk of data. This can under certain conditions prove
beneficial for the runtime. Let us assume that the time for bulk
loading a chunk of data of size s into local memory, takes t_s time
units. Every work unit will have k random access read or write actions
within this chunk of data. The time needed for one read / write action
to local memory is denoted as t_l and the time needed for one read /
write action from global memory cache is denoted as t_g. We assume
that the relation t_l < t_g holds. Loading a chunk of global memory
into local memory is beneficial to the runtime if the number of read /
write actions k is large enough to satisfy:
k · t_g > t_s + k · t_l
In general it is better to use the synchronized bulk-copy operation async_work_group_copy provided by openCL to copy from global to local memory, than to manually copy word by word.
Concerning your code and your problem:
Your processing elements in your work unit write to the same slice of local memory but from different points in global memory, effectively overwriting the data. Prove:
If your work group size is at least 2 then there is one element with lid=0 and one with lid=1.
If cache[0].x != cache1.x or cache[0].y != cache1.y then tgid has a different value for those two processing elements.
Your index nlid is constant for all your processing elements as long as the state of (i,j) synchronized across your processing elements (which you guaranteed with the local memory fence).
Finally a note concerning your nested loops.
If you really want to do it this way, try an approach that reads memory that lies close together in one go. Let's say your image is organized like this:
1 2 3 ... w
w+1 w+2 w+3 ... 2w
2w+1 2w+2 2w+3 ... 3w
....
If you wanted to read a 3-by-3 window starting in the left upper corner, you should construct your loop like this
for(int row = start_pos.y; row < start_pos.y+3; row++){
for(int col = start_pos.x; col < start_pos.x+3; col++){
tmp = data[col+w*row];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
This way you read in this order 1,2,3,w+1,w+2,w+3,2w+1,2w+2,2w+3
This way openCL would cache global memory, then your processing elements would read 1,2,3 from cache, openCL would cache the next chunk of global memory and your kernel would read the next words and so on.
The way you wrote it it would it would read the data in this order: 1,w+1,2w+1,2,w+2,2w+2,3,w+3,2w+3
For this openCL will again cache global memory, your processing elements read their first data word and openCL would again start to cache after which your your PEs would read their second data word and so on. this would lead to loading the same three chunks of global memory into the global memory cache multiple times when loading each of them once would suffice.

Does `mem_fence` provide consistency between work-groups?

I am trying to implement the bounding-box calculation as described here. Long story short, I have a binary tree of bounding boxes. The leaf nodes are all filled in, and now it is time to calculate the internal nodes. In addition to the nodes (each defining the child/parent indices), there is a counter for each internal node.
Starting at each leaf node, the parent node is visited and its flag atomically incremented. If this is the first visit to the node, the thread exits (as only one child is guaranteed to have been initialized). If it is the second visit, then both children are initialized, its bounding box is calculated and we continue with that node's parents.
Is the mem_fence between reading the flag and reading the data of its children sufficient to guarantee the data in the children will be visible?
kernel void internalBounds(global struct Bound * const bounds,
global unsigned int * const flags,
const global struct Node * const nodes) {
const unsigned int n = get_global_size(0);
const size_t D = 3;
const size_t leaf_start = n - 1;
size_t node_idx = leaf_start + get_global_id(0);
do {
node_idx = nodes[node_idx].parent;
write_mem_fence(CLK_GLOBAL_MEM_FENCE);
// Mark node as visited, both children initialized on second visit
if (atomic_inc(&flags[node_idx]) < 1)
break;
read_mem_fence(CLK_GLOBAL_MEM_FENCE);
const global unsigned int * child_idxs = nodes[node_idx].internal.children;
for (size_t d = 0; d < D; d++) {
bounds[node_idx].min[d] = min(bounds[child_idxs[0]].min[d],
bounds[child_idxs[1]].min[d]);
bounds[node_idx].max[d] = max(bounds[child_idxs[0]].max[d],
bounds[child_idxs[1]].max[d]);
}
} while (node_idx != 0);
}
I am limited to OpenCL 1.2.
No it doesn't. CLK_GLOBAL_MEM_FENCE only provides consistency within the work group when accessing global memory. There is no inter-workgroup synchronization in OpenCL 1.x
Try to use a single, large workgroup and iterate over the data. And/or start with some small trees that will fit inside a single work group.
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/mem_fence.html
mem_fence(...) syncs mem-accesses for only single work-item. Even if all work-items have this line, they may not hit(and continue) it at the same time.
barrier(...) does synchronize for all work items in a work group and have them wait for the slowest one(that isa accessing the specified memory given as parameter), but only connected to its own work groups work items.(such as only 64 or 256 for amd-intel and maybe 1024 for nvidia) because an opencl device driver implementation may be designed to finish all wavefronts before loading new shards of wavefronts because all global items would simply not fit inside chip memory(such as 64M work items each using 1kB local memory that need 64GB memory! --> even software emulation would need hundreds or thousands of passes and decrease performance to a level of single core cpu)
Global sync (where all work groups synchronized) is not possible.
Just in case work item work group and processing elements get mixed meanings,
OpenCL: Work items, Processing elements, NDRange
Atomic function you put there is already accesing global memory so adding group-scope synchronization shouldn't be important.
Also check machine codes if
bounds[child_idxs[0]].min[d]
is getting whole bounds[child_idxs[0]] struct into private memory before accessing to min[d]. If yes, you can separate min as an independent array access its items to have %100 more memory bandwidth for it.
Test on intel hd 400, more than 100000 threads
__kernel void fenceTest( __global float *c,
__global int *ctr)
{
int id=get_global_id(0);
if(id<128000)
for(int i=0;i<20000;i++)
{
c[id]+=ctr[0];
mem_fence(CLK_GLOBAL_MEM_FENCE);
}
ctr[0]++;
}
2900ms (c array has garbage)
__kernel void fenceTest( __global float *c,
__global int *ctr)
{
int id=get_global_id(0);
if(id<128000)
for(int i=0;i<20000;i++)
{
c[id]+=ctr[0];
}
ctr[0]++;
}
500 ms(c array has garbage). 500ms is ~6x the performance of fence version(my laptop has single channel 4GB ram which is only 5-10 GB/s but its igpu local memory has nearly 38GB/s(64B per cycle and 600 MHz frequency)). Local fence version takes 700ms so the fenceless version doesn't even touching cache or local memory for some iterations as it seems.
Without loop, it takes 8-9 ms so it wasn't optimizing the loop in these kernels I suppose.
Edit:
int id=get_global_id(0);
if(id==0)
{
atom_inc(&ctr[0]);
mem_fence(CLK_GLOBAL_MEM_FENCE);
}
mem_fence(CLK_GLOBAL_MEM_FENCE);
c[id]+=ctr[0];
behaves exactly as
int id=get_global_id(0);
if(id==0)
{
ctr[0]++;
mem_fence(CLK_GLOBAL_MEM_FENCE);
}
mem_fence(CLK_GLOBAL_MEM_FENCE);
c[id]+=ctr[0];
for this Intel igpu device(only by chance, but it proves changed memory is visible by "all" trailing threads, but doesn't prove it always happens(such as first compute unit hiccups and 2nd starts first for example) and it is not atomic for more than single threads accessing it).

CUDA coalesced one warp on multiple data

I have a basic question on coalesced cuda access.
For example, I have an Array of 32 Elements and 32 threads, each thread accesses one element.
__global__ void co_acc ( int A[32], int B[32] ) {
int inx = threadIdx.x + (gridDim.x * blockDim.x);
B[inx] = A[inx]
}
Now, what I want to know: If I have the 32 threads, but an array of 64 elements, each thread has to copy 2 elements. To keep a coalesced access, I should shift
the index for the array access by the number of threads I have.
eg: Thread with ID 0 will access A[0] and A[0+32]. Am I right with this assumption?
__global__ void co_acc ( int A[64], int B[64] ) {
int inx = threadIdx.x + (gridDim.x * blockDim.x);
int actions = 64/blockDim.x;
for ( int i = 0; i < actions; ++i )
B[inx+(i*blockDim.x)] = A[inx+(i*blockDim.x)]
}
To keep a coalesced access, I should shift the index for the array access by the number of threads I have. eg: Thread with ID 0 will access A[0] and A[0+32]. Am I right with this assumption?
Yes, that's a correct approach.
Strictly speaking it's not should but rather could: any memory access will be coalesced as long as all threads within a warp request addresses that fall within the same (aligned) 128 byte line. This means you could permute the thread indices and your accesses would still be coalesced (but why do complicated when you can do simple).
Another solution would be to have each thread load an int2:
__global__ void co_acc ( int A[64], int B[64] ) {
int inx = threadIdx.x + (gridDim.x * blockDim.x);
reinterpret_cast<int2*>(B)[inx] = reinterpret_cast<int2*>(A)[inx];
}
This is (in my opinion) simpler and clearer code, and might give marginally better performance as this may reduce the number of instructions emitted by the compiler and the latency between memory requests (disclaimer: I have not tried it).
Note: as Robert Crovella has mentioned in his comment, if you really are using thread blocks of 32 threads, then you are likely seriously underusing the capacity of your GPU.

CUDA shared memory programming is not working

all:
I am learning how shared memory accelerates the GPU programming process. I am using the codes below to calculate the squared value of each element plus the squared value of the average of its left and right neighbors.
The code runs, however, the result is not as expected.
The first 10 result printed out is 0,1,2,3,4,5,6,7,8,9, while I am expecting the result as 25,2,8, 18,32,50,72,98,128,162;
The code is as follows, with the reference to here;
Would you please tell me which part goes wrong? Your help is very much appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <cuda.h>
const int N=1024;
__global__ void compute_it(float *data)
{
int tid = threadIdx.x;
__shared__ float myblock[N];
float tmp;
// load the thread's data element into shared memory
myblock[tid] = data[tid];
// ensure that all threads have loaded their values into
// shared memory; otherwise, one thread might be computing
// on unitialized data.
__syncthreads();
// compute the average of this thread's left and right neighbors
tmp = (myblock[tid>0?tid-1:(N-1)] + myblock[tid<(N-1)?tid+1:0]) * 0.5f;
// square the previousr result and add my value, squared
tmp = tmp*tmp + myblock[tid]*myblock[tid];
// write the result back to global memory
data[tid] = myblock[tid];
__syncthreads();
}
int main (){
char key;
float *a;
float *dev_a;
a = (float*)malloc(N*sizeof(float));
cudaMalloc((void**)&dev_a,N*sizeof(float));
for (int i=0; i<N; i++){
a [i] = i;
}
cudaMemcpy(dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice);
compute_it<<<N,1>>>(dev_a);
cudaMemcpy(a, dev_a, N*sizeof(float), cudaMemcpyDeviceToHost);
for (int i=0; i<10; i++){
std::cout<<a [i]<<",";
}
std::cin>>key;
free (a);
free (dev_a);
One of the most immediate problems in your kernel code is this:
data[tid] = myblock[tid];
I think you probably meant this:
data[tid] = tmp;
In addition, you're launching 1024 blocks of one thread each. This isn't a particularly effective way to use the GPU and it means that your tid variable in every threadblock is 0 (and only 0, since there is only one thread per threadblock.)
There are many problems with this approach, but one immediate problem will be encountered here:
tmp = (myblock[tid>0?tid-1:(N-1)] + myblock[tid<31?tid+1:0]) * 0.5f;
Since tid is always zero, and therefore no other values in your shared memory array (myblock) get populated, the logic in this line cannot be sensible. When tid is zero, you are selecting myblock[N-1] for the first term in the assignment to tmp, but myblock[1023] never gets populated with anything.
It seems that you don't understand various CUDA hierarchies:
a grid is all threads associated with a kernel launch
a grid is composed of threadblocks
each threadblock is a group of threads working together on a single SM
the shared memory resource is a per-SM resource, not a device-wide resource
__synchthreads() also operates on threadblock basis (not device-wide)
threadIdx.x is a built-in variable that provide a unique thread ID for all threads within a threadblock, but not globally across the grid.
Instead you should break your problem into groups of reasonable-sized threadblocks (i.e. more than one thread). Each threadblock will then be able to behave in a fashion that is roughly as you have outlined. You will then need to special-case the behavior at the starting point and ending point (in your data) of each threadblock.
You're also not doing proper cuda error checking which is recommended, especially any time you're having trouble with a CUDA code.
If you make the change I indicated first in your kernel code, and reverse the order of your block and grid kernel launch parameters:
compute_it<<<1,N>>>(dev_a);
As indicated by Kristof, you will get something that comes close to what you want, I think. However you will not be able to conveniently scale that beyond N=1024 without other changes to your code.
This line of code is also not correct:
free (dev_a);
Since dev_a was allocated on the device using cudaMalloc you should free it like this:
cudaFree (dev_a);
Since you have only one thread per block, your tid will always be 0.
Try launching the kernel this way:
compute_it<<<1,N>>>(dev_a);
instead of
compute_it<<>>(dev_a);