How can I use shared memory here in my CUDA kernel? - c++

I have the following CUDA kernel:
__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, double investment, double profitability) {
// Use a grid-stride loop.
// Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < strategyCount;
i += blockDim.x * gridDim.x)
{
strategies[i].backtest(data, investment, profitability);
}
}
TL;DR I would like to find a way to store data in shared (__shared__) memory. What I don't understand is how to fill the shared variable using multiple threads.
I have seen examples like this one where data is copied to shared memory thread by thread (e.g. myblock[tid] = data[tid]), but I'm not sure how to do this in my situation. The issue is that each thread needs access to an entire "row" (flattened) of data with each iteration through the data set (see further below where the kernel is called).
I'm hoping for something like this:
__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, int propertyCount, double investment, double profitability) {
__shared__ double sharedData[propertyCount];
// Use a grid-stride loop.
// Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < strategyCount;
i += blockDim.x * gridDim.x)
{
strategies[i].backtest(sharedData, investment, profitability);
}
}
Here are more details (if more information is needed, please ask!):
strategies is a pointer to a list of Strategy objects, and data is a pointer to an allocated flattened data array.
In backtest() I access data like so:
data[0]
data[1]
data[2]
...
Unflattened, data is a fixed size 2D array similar to this:
[87.6, 85.4, 88.2, 86.1]
84.1, 86.5, 86.7, 85.9
86.7, 86.5, 86.2, 86.1
...]
As for the kernel call, I iterate over the data items and call it n times for n data rows (about 3.5 million):
int dataCount = 3500000;
int propertyCount = 4;
for (i=0; i<dataCount; i++) {
unsigned int dataPointerOffset = i * propertyCount;
// Notice pointer arithmetic.
optimizer_backtest<<<32, 1024>>>(devData + dataPointerOffset, devStrategies, strategyCount, investment, profitability);
}

As confirmed in your comment, you want to apply 20k (this number is from your previous question) strategies on every one of the 3.5m data and exam the 20k x 3.5m results.
Without shared memory you have to read all data 20k times or all strategies 3.5m times, from the global memory.
Shared memory can speed up your program by reducing global memory access. Say you can read 1k strategies and 1k data to shared mem each time, exam the 1k x 1k results, and then repeat this until all are examed. By this way you can reduce the global mem access to 20 times of all data and 3.5k times of all strategies. This situation is similar to vector-vectoer cross product. You could find some reference code for more detail.
However each one of your data is large (838-D vector), maybe strategies are large too. You may not be able to cache a lot of them in the shared mem (only ~48k per block depending on the device type ). So the situation changes to something like matrix-matrix multiplication. For this, you may get some hints from the matrix multiplication code as in the following link.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory

For people in the future in search of a similar answer, here is what I ended up with for my kernel function:
__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, double investment, double profitability) {
__shared__ double sharedData[838];
if (threadIdx.x < 838) {
sharedData[threadIdx.x] = data[threadIdx.x];
}
__syncthreads();
// Use a grid-stride loop.
// Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < strategyCount;
i += blockDim.x * gridDim.x)
{
strategies[i].backtest(sharedData, investment, profitability);
}
}
Note that I use both .cuh and .cu files in my application, and I put this in the .cu file. Also note that I use --device-c in my Makefile when compiling object files. I don't know if that's how things should be done, but that's what worked for me.

Related

For loop based kernel vs If statement Kernel - Cuda

I have seen the Cuda Kernel started two separate ways:
1.
for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x; i < length; i += blockDim.x * gridDim.x)
{
// do stuff
}
if(i < length)
{
// do stuff
}
Both versions are launched with kernel<<<num_blocks, threads_per_block>>> where the threads per block are maximized for our device (1024) and the number of blocks (2) for a length of 1025, for example.
The obvious difference is that the for loop allows the kernel to loop when the kernel is launched with less threads, for example 512 threads with 2 blocks length of 1025 it loops twice.
From previous research I've gathered that Nvidia suggests that we do not try and load balance ourselves (read as loop within the kernel like this), for instance, giving a kernel less threads or less blocks to reserve space for other kernels on the device because the load balancing that is built in is supposed to handle this in a more globally optimized way.
So my question is why would we want to use the for loop vs the if statement form of kernel? Is there a benefit to either at run time?
Given my understanding of Nvidia's stance on load balancing, the only value I can see is the ability to debug synchronously via 1 thread and 1 block setting <<<1, 1>>> when we launch the kernel in the for loop version or not having to precompute the # of blocks needed (and/or threads).
This is the test project I ran:
#include <cstdint>
#include <cstdio>
__global__
inline void kernel(int length)
{
int counter = 0;
for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x; i < length; i += blockDim.x * gridDim.x)
{
printf("%u: | i+: %u | tid: %u | counter: %u \n", i, blockDim.x * gridDim.x, threadIdx.x, counter++);
}
}
__global__
inline void kernel2(int length)
{
uint32_t i = blockIdx.x * blockDim.x + threadIdx.x;
if(i < length)
printf("%u: | i+: %u | tid: %u | \n", i, blockDim.x * gridDim.x, threadIdx.x);
}
int main()
{
//kernel<<<2, 1024>>>(1025);
kernel2<<<2, 1024>>>(1025);
cudaDeviceSynchronize();
}
So my question is why would we want to use the for loop vs the if statement form of kernel? Is there a benefit to either at run time?
Yes, there is. Every CUDA thread needs to:
Read all of its parameters from constant memory
Read grid and thread information from special registers: blockDim, blockIdx, threadIdx (or at least their .x components)
Do the arithemtic for computing its global index.
That takes a bit of time. It's not a lot; but if your kernel is very simple (e.g. something like adding up two arrays), then - yes, that has a cost. And of course, if you perform your own preliminary computation that is used with all items in the sequence - each thread has to take the time to do that as well.
From previous research I've gathered that Nvidia suggests that we do not try and load balance ourselves (read as loop within the kernel like this)
I doubt that. The question of whether to iterate a large sequence with a single "CUDA thread" per item or with less threads, each working on multiple items, depends on what is done for individual items in the sequence.

Count values from array CUDA

I have an array of float values, namely life, of which i want to count the number of entries with a value greater than 0 in CUDA.
On the CPU, the code would look like this:
int numParticles = 0;
for(int i = 0; i < MAX_PARTICLES; i++){
if(life[i]>0){
numParticles++;
}
}
Now in CUDA, I've tried something like this:
__global__ void update(float* life, int* numParticles){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (life[idx]>0){
(*numParticles)++;
}
}
//life is a filled device pointer
int launchCount(float* life)
{
int numParticles = 0;
int* numParticles_d = 0;
cudaMalloc((void**)&numParticles_d, sizeof(int));
update<<<MAX_PARTICLES/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(life, numParticles_d);
cudaMemcpy(&numParticles, numParticles_d, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << "numParticles: " << numParticles << std::endl;
}
But for some reason the CUDA attempt always returns 0 for numParticles. How come?
This:
if (life[idx]>0){
(*numParticles)++;
}
is a read-after write hazard. Multiple threads will be simultaneously attempting to read and write from numParticles. The CUDA execution model does not guarantee anything about the order of simultaneous transactions.
You could make this work by using atomic memory transactions, for example:
if (life[idx]>0){
atomicAdd(numParticles, 1);
}
This will serialize the memory transactions and make the calculation correct. It will also have a big negative effect on performance.
You might want to investigate having each block calculate a local sum using a reduction type calculation and then sum the block local sums atomically or on the host, or in a second kernel.
Your code is actually launching MAX_PARTICLES threads, and multiple thread blocks are executing (*numParticles)++; concurrently. It is a race condition. So you have the result 0, or if you are luck, sometimes a little bigger than 0.
As your attempt to sum up life[i]>0 ? 1 : 0 for all i, you could follow CUDA parallel reduction to implement your kernel, or use Thrust reduction to simplify your life.

CUDA coalesced one warp on multiple data

I have a basic question on coalesced cuda access.
For example, I have an Array of 32 Elements and 32 threads, each thread accesses one element.
__global__ void co_acc ( int A[32], int B[32] ) {
int inx = threadIdx.x + (gridDim.x * blockDim.x);
B[inx] = A[inx]
}
Now, what I want to know: If I have the 32 threads, but an array of 64 elements, each thread has to copy 2 elements. To keep a coalesced access, I should shift
the index for the array access by the number of threads I have.
eg: Thread with ID 0 will access A[0] and A[0+32]. Am I right with this assumption?
__global__ void co_acc ( int A[64], int B[64] ) {
int inx = threadIdx.x + (gridDim.x * blockDim.x);
int actions = 64/blockDim.x;
for ( int i = 0; i < actions; ++i )
B[inx+(i*blockDim.x)] = A[inx+(i*blockDim.x)]
}
To keep a coalesced access, I should shift the index for the array access by the number of threads I have. eg: Thread with ID 0 will access A[0] and A[0+32]. Am I right with this assumption?
Yes, that's a correct approach.
Strictly speaking it's not should but rather could: any memory access will be coalesced as long as all threads within a warp request addresses that fall within the same (aligned) 128 byte line. This means you could permute the thread indices and your accesses would still be coalesced (but why do complicated when you can do simple).
Another solution would be to have each thread load an int2:
__global__ void co_acc ( int A[64], int B[64] ) {
int inx = threadIdx.x + (gridDim.x * blockDim.x);
reinterpret_cast<int2*>(B)[inx] = reinterpret_cast<int2*>(A)[inx];
}
This is (in my opinion) simpler and clearer code, and might give marginally better performance as this may reduce the number of instructions emitted by the compiler and the latency between memory requests (disclaimer: I have not tried it).
Note: as Robert Crovella has mentioned in his comment, if you really are using thread blocks of 32 threads, then you are likely seriously underusing the capacity of your GPU.

Efficiently Initializing Shared Memory Array in CUDA

Note that this shared memory array is never written to, only read from.
As I have it, my shared memory gets initialized like:
__shared__ float TMshared[2592];
for (int i = 0; i< 2592; i++)
{
TMshared[i] = TM[i];
}
__syncthreads();
(TM is passed into all threads from kernel launch)
You might have noticed that this is highly inefficient as there is no parallelization going on and threads within the same block are writing to the same location.
Can someone please recommend a more efficient approach/comment on if this issue really needs optimization since the shared array in question is relatively small?
Thanks!
Use all threads to write independent locations, it will probably be quicker.
Example assumes 1D threadblock/grid:
#define SSIZE 2592
__shared__ float TMshared[SSIZE];
int lidx = threadIdx.x;
while (lidx < SSIZE){
TMShared[lidx] = TM[lidx];
lidx += blockDim.x;}
__syncthreads();

CUDA shared memory programming is not working

all:
I am learning how shared memory accelerates the GPU programming process. I am using the codes below to calculate the squared value of each element plus the squared value of the average of its left and right neighbors.
The code runs, however, the result is not as expected.
The first 10 result printed out is 0,1,2,3,4,5,6,7,8,9, while I am expecting the result as 25,2,8, 18,32,50,72,98,128,162;
The code is as follows, with the reference to here;
Would you please tell me which part goes wrong? Your help is very much appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <cuda.h>
const int N=1024;
__global__ void compute_it(float *data)
{
int tid = threadIdx.x;
__shared__ float myblock[N];
float tmp;
// load the thread's data element into shared memory
myblock[tid] = data[tid];
// ensure that all threads have loaded their values into
// shared memory; otherwise, one thread might be computing
// on unitialized data.
__syncthreads();
// compute the average of this thread's left and right neighbors
tmp = (myblock[tid>0?tid-1:(N-1)] + myblock[tid<(N-1)?tid+1:0]) * 0.5f;
// square the previousr result and add my value, squared
tmp = tmp*tmp + myblock[tid]*myblock[tid];
// write the result back to global memory
data[tid] = myblock[tid];
__syncthreads();
}
int main (){
char key;
float *a;
float *dev_a;
a = (float*)malloc(N*sizeof(float));
cudaMalloc((void**)&dev_a,N*sizeof(float));
for (int i=0; i<N; i++){
a [i] = i;
}
cudaMemcpy(dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice);
compute_it<<<N,1>>>(dev_a);
cudaMemcpy(a, dev_a, N*sizeof(float), cudaMemcpyDeviceToHost);
for (int i=0; i<10; i++){
std::cout<<a [i]<<",";
}
std::cin>>key;
free (a);
free (dev_a);
One of the most immediate problems in your kernel code is this:
data[tid] = myblock[tid];
I think you probably meant this:
data[tid] = tmp;
In addition, you're launching 1024 blocks of one thread each. This isn't a particularly effective way to use the GPU and it means that your tid variable in every threadblock is 0 (and only 0, since there is only one thread per threadblock.)
There are many problems with this approach, but one immediate problem will be encountered here:
tmp = (myblock[tid>0?tid-1:(N-1)] + myblock[tid<31?tid+1:0]) * 0.5f;
Since tid is always zero, and therefore no other values in your shared memory array (myblock) get populated, the logic in this line cannot be sensible. When tid is zero, you are selecting myblock[N-1] for the first term in the assignment to tmp, but myblock[1023] never gets populated with anything.
It seems that you don't understand various CUDA hierarchies:
a grid is all threads associated with a kernel launch
a grid is composed of threadblocks
each threadblock is a group of threads working together on a single SM
the shared memory resource is a per-SM resource, not a device-wide resource
__synchthreads() also operates on threadblock basis (not device-wide)
threadIdx.x is a built-in variable that provide a unique thread ID for all threads within a threadblock, but not globally across the grid.
Instead you should break your problem into groups of reasonable-sized threadblocks (i.e. more than one thread). Each threadblock will then be able to behave in a fashion that is roughly as you have outlined. You will then need to special-case the behavior at the starting point and ending point (in your data) of each threadblock.
You're also not doing proper cuda error checking which is recommended, especially any time you're having trouble with a CUDA code.
If you make the change I indicated first in your kernel code, and reverse the order of your block and grid kernel launch parameters:
compute_it<<<1,N>>>(dev_a);
As indicated by Kristof, you will get something that comes close to what you want, I think. However you will not be able to conveniently scale that beyond N=1024 without other changes to your code.
This line of code is also not correct:
free (dev_a);
Since dev_a was allocated on the device using cudaMalloc you should free it like this:
cudaFree (dev_a);
Since you have only one thread per block, your tid will always be 0.
Try launching the kernel this way:
compute_it<<<1,N>>>(dev_a);
instead of
compute_it<<>>(dev_a);