CUDA - no blocks, just threads for undefined dimensions

CUDA - no blocks, just threads for undefined dimensions - c++

I have some matrices with unknown sizes varying from 10-20.000 in both directions.
I designed a CUDA kernel with (x;y) blocks and (x;y) threads.
Since matrices width/height aren't multiple of my dimensions, it was a terrible pain to get things work and the code is becoming more and more complicated to get coalescence memory reads.
Besides all of that, the kernel is growing in size using more and more registers to check for correctness... so I think this is not the way I should adopt.
My question is: what if I totally eliminate blocks and just create a grid of x;y threads? Will a SM unit have problems without many blocks?
Can I eliminate blocks and use a large amount of threads or is the block subdivision necessary?

You can't really just make a "grid of threads", since you have to organize threads into blocks and you can have a maximum of 512 threads per block. However, you could effectively do this by using 1 thread per block, which will result in a X by Y grid of 1x1 blocks. However, this will result in pretty terrible performance due to several factors:
According to the CUDA Programming Guide, a SM can handle a maximum of 8 blocks at any time. This will limit you to 8 threads per SM, which isn't enough to fill even a single warp. If you have, say, 48 CUDA cores, you will only be able to handle 384 threads at any given time.
With only 8 threads available on a SM, there will be too few warps to hide memory latencies. The GPU will spend most of its time waiting for memory accesses to complete, rather than doing any computations.
You will be unable to coalesce memory reads and writes, resulting in poor memory bandwidth usage.
You will be effectively unable to leverage shared memory, as this is a shared resource between threads in a block.
While having to ensure correctness for threads in a block is annoying, your performance will be vastly better than your "grid of threads" idea.

Here's the code i use to divide a given task requiring num_threads into block and grid. Yes, you might end up launching to many blocks (but only very few) and you will probably end up having more actual threads than required, but it's easy and efficient this way. See the second code example below for my simple in-kernel boundary check.
PS: I always have block_size == 128 because it has been a good tradeoff between multicore occupancy, register usage, shared memory requirements and coalescent access for all of my kernels.
Code to calculate a good grid size (host):
#define GRID_SIZE 65535
//calculate grid size (store result in grid/block)
void kernelUtilCalcGridSize(unsigned int num_threads, unsigned int block_size, dim3* grid, dim3* block) {
//block
block->x = block_size;
block->y = 1;
block->z = 1;
//number of blocks
unsigned int num_blocks = kernelUtilCeilDiv(num_threads, block_size);
unsigned int total_threads = num_blocks * block_size;
assert(total_threads >= num_threads);
//calculate grid size
unsigned int gy = kernelUtilCeilDiv(num_blocks, GRID_SIZE);
unsigned int gx = kernelUtilCeilDiv(num_blocks, gy);
unsigned int total_blocks = gx * gy;
assert(total_blocks >= num_blocks);
//grid
grid->x = gx;
grid->y = gy;
grid->z = 1;
}
//ceil division (rounding up)
unsigned int kernelUtilCeilDiv(unsigned int numerator, unsigned int denominator) {
return (numerator + denominator - 1) / denominator;
}
Code to calculate the unique thread id and check boundaries (device):
//some kernel
__global__ void kernelFoo(unsigned int num_threads, ...) {
//calculate unique id
const unsigned int thread_id = threadIdx.x;
const unsigned int block_id = blockIdx.x + blockIdx.y * gridDim.x;
const unsigned int unique_id = thread_id + block_id * blockDim.x;
//check range
if (unique_id >= num_threads) return;
//do the actual work
...
}
I don't think that's a lot of effort/registers/lines-of-code to check for correctness.

Related

OpenCL Memory Bandwidth/Coalescing

Summary:
I'm trying to write a memory bound OpenCL program that comes close to the advertised memory bandwidth on my GPU. In reality I'm off by a factor of ~50.
Setup:
I only have a relatively old Polaris Card (RX580), so I can't use CUDA and have to settle on OpenCL for now. I know this is suboptmial, and I can't get any debugging/performance counters to work, but it's all I have.
I'm new to GPU computing and want to get a feel for some of the performance that I can expect
from GPU vs CPU. First thing to work on for me is memory bandwidth.
I wrote a very small OpenCL Kernel, which reads from strided memory locations in a way that I want all workers in the wavefront together to perform continuous memory access over a large memory segment, coalescing the accesses. All that the kernel then does with the loaded data is to sum the values up and write the sum back to another memory location at the very end. The code (which I shamelessly copied together from various sources for the most part) is quite simply
__kernel void ThroughputTestKernel(
__global float* vInMemory,
__global float* vOutMemory,
const int iNrOfIterations,
const int iNrOfWorkers
)
{
const int gtid = get_global_id(0);
__private float fAccumulator = 0.0;
for (int k = 0; k < iNrOfIterations; k++) {
fAccumulator += vInMemory[gtid + k * iNrOfWorkers];
}
vOutMemory[gtid] = fAccumulator;
}
I spawn iNrOfWorkers of these Kernels and measure the time it takes them to finish processing. For my tests I set iNrOfWorkers = 1024 and iNrOfIterations = 64*1024. From the processing time and the iMemorySize = iNrOfWorkers * iNrOfIterations * sizeof(float) I calculate a memory bandwidth of around 5GByte/s.
Expectations:
My problem is that memory accesses seem to be one to two orders of magnitude slower than the 256GByte/s that I was led to believe I have available.
The GCN ISA Manual [1] has me assuming that I have 36 CUs, each of which contains 4 SIMD units, each of which process vectors of 16 elements. Therefore I should have 36416 = 2304 processing elements available.
I spawn less than that amount, i.e. 1024, global work units ("threads"). The threads access memory locations in order, 1024 locations apart, so that in each iteration of the loop, the entire wavefront accesses 1024 consecutive elements. Therefore I believe that the GPU should be able to produce consecutive memory address accesses with no breaks in between.
My guess is that, instead of 1024, it only spawns very few threads, one per CU maybe? That way it would have to re-read the data over and over again. I don't know how I would be able to verify that, though.
[1] http://developer.amd.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf

A few issues with your approach:
You don't saturate the GPU. To get peak performance, you need to launch much more threads than your GPU has execution units. Much more means >10000000.
Your loop contains index integer computation (for array-of-structures coalesced access). Here this is probably not enough to get you into the compute limit, but it's generally better to unroll the small loop with #pragma unroll; then the compiler does all the index calculation already. You can also bake the constants iNrOfIterations and iNrOfWorkers right into the OpenCL code with #define iNrOfIterations 16 / #define iNrOfWorkers 15728640 via C++ string concatenation or by hardcoding.
There is 4 different memory bandwidths based on your access pattern: coalesced/misaligned reads/writes. Coalesced is much faster than misaligned and the performance penalty for misaligned reads is less than misaligned writes. Only coalesced memory access gets you anywhere near the advertised bandwidth. You measure iNrOfIterations coalesced reads and 1 coalesced write. To measure all four types separately, you can use this:
#define def_N 15728640
#define def_M 16
kernel void benchmark_1(global float* data) {
const uint n = get_global_id(0);
#pragma unroll
for(uint i=0; i<def_M; i++) data[i*def_N+n] = 0.0f; // M coalesced writes
}
kernel void benchmark_2(global float* data) {
const uint n = get_global_id(0);
float x = 0.0f;
#pragma unroll
for(uint i=0; i<def_M; i++) x += data[i*def_N+n]; // M coalesced reads
data[n] = x; // 1 coalesced write (to prevent compiler optimization)
}
kernel void benchmark_3(global float* data) {
const uint n = get_global_id(0);
#pragma unroll
for(uint i=0; i<def_M; i++) data[n*def_M+i] = 0.0f; // M misaligned writes
}
kernel void benchmark_4(global float* data) {
const uint n = get_global_id(0);
float x = 0.0f;
#pragma unroll
for(uint i=0; i<def_M; i++) x += data[n*def_M+i]; // M misaligned reads
data[n] = x; // 1 coalesced write (to prevent compiler optimization)
}
Here the data array has the size N*M and each kernel is executed across the range N. For bandwidth calculation, execute each kernel a few hundred times (better average) and get the average execution times time1, time2, time3 and time4. The bandwidths are then computed like this:
coalesced read bandwidth (GB/s) = 4.0E-9f*M*N/(time2-time1/M)
coalesced write bandwidth (GB/s) = 4.0E-9f*M*N/( time1 )
misaligned read bandwidth (GB/s) = 4.0E-9f*M*N/(time4-time1/M)
misaligned write bandwidth (GB/s) = 4.0E-9f*M*N/(time3 )
For reference, here are a few bandwidth values measured with this benchmark.
Edit: How to measure kernel execution time:
Clock
#include <thread>
class Clock {
private:
typedef chrono::high_resolution_clock clock;
chrono::time_point<clock> t;
public:
Clock() { start(); }
void start() { t = clock::now(); }
double stop() const { return chrono::duration_cast<chrono::duration<double>>(clock::now()-t).count(); }
};
Time measurement of K executions of a kernel
const int K = 128; // execute kernel 128 times and average execution time
NDRange range_local = NDRange(256); // thread block size
NDRange range_global = NDRange(N); // N must be divisible by thread block size
Clock clock;
clock.start();
for(int k=0; k<K; k++) {
queue.enqueueNDRangeKernel(kernel_1, NullRange, range_global, range_local);
queue.finish();
}
const double time1 = clock.stop()/(double)K;

It's slower to calculate integral image using CUDA than CPU code

I am implementing the integral image calculation module using CUDA to improve performance.
But its speed slower than the CPU module.
Please let me know what i did wrong.
cuda kernels and host code follow.
And also, another problem is...
In the kernel SumH, using texture memory is slower than global one, imageTexture was defined as below.
texture<unsigned char, 1> imageTexture;
cudaBindTexture(0, imageTexture, pbImage);
// kernels to scan the image horizontally and vertically.
__global__ void SumH(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rVSpan, int nWidth)
{
int nStartY, nEndY, nIdx;
if (!threadIdx.x)
{
nStartY = 1;
}
else
nStartY = (int)(threadIdx.x * rVSpan);
nEndY = (int)((threadIdx.x + 1) * rVSpan);
for (int i = nStartY; i < nEndY; i ++)
{
for (int j = 1; j < nWidth; j ++)
{
nIdx = i * nWidth + j;
pnIntImage[nIdx] = pnIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i];
pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i] * pbImage[nIdx - nWidth - i];
//pnIntImage[nIdx] = pnIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i);
//pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i) * tex1Dfetch(imageTexture, nIdx - nWidth - i);
}
}
}
__global__ void SumV(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rHSpan, int nHeight, int nWidth)
{
int nStartX, nEndX, nIdx;
if (!threadIdx.x)
{
nStartX = 1;
}
else
nStartX = (int)(threadIdx.x * rHSpan);
nEndX = (int)((threadIdx.x + 1) * rHSpan);
for (int i = 1; i < nHeight; i ++)
{
for (int j = nStartX; j < nEndX; j ++)
{
nIdx = i * nWidth + j;
pnIntImage[nIdx] = pnIntImage[nIdx - nWidth] + pnIntImage[nIdx];
pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - nWidth] + pn64SqrIntImage[nIdx];
}
}
}
// host code
int nW = image_width;
int nH = image_height;
unsigned char* pbImage;
int* pnIntImage;
__int64* pn64SqrIntImage;
cudaMallocManaged(&pbImage, nH * nW);
// assign image gray values to pbimage
cudaMallocManaged(&pnIntImage, sizeof(int) * (nH + 1) * (nW + 1));
cudaMallocManaged(&pn64SqrIntImage, sizeof(__int64) * (nH + 1) * (nW + 1));
float rHSpan, rVSpan;
int nHThreadNum, nVThreadNum;
if (nW + 1 <= 1024)
{
rHSpan = 1;
nVThreadNum = nW + 1;
}
else
{
rHSpan = (float)(nW + 1) / 1024;
nVThreadNum = 1024;
}
if (nH + 1 <= 1024)
{
rVSpan = 1;
nHThreadNum = nH + 1;
}
else
{
rVSpan = (float)(nH + 1) / 1024;
nHThreadNum = 1024;
}
SumH<<<1, nHThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rVSpan, nW + 1);
cudaDeviceSynchronize();
SumV<<<1, nVThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rHSpan, nH + 1, nW + 1);
cudaDeviceSynchronize();

Regarding the code that is currently in the question. There are two things I'd like to mention: launch parameters and timing methodology.
1) Launch parameters
When you launch a kernel there are two main arguments that specify the amount of threads you are launching. These are between the <<< and >>> sections, and are the number of blocks in the grid, and the number of threads per block as follows:
foo <<< numBlocks, numThreadsPerBlock >>> (args);
For a single kernel to be efficient on a current GPU you can use the rule of thumb that numBlocks * numThreadsPerBlock should be at least 10,000. Ie. 10,000 pieces of work. This is a rule of thumb, so you may get good results with only 5,000 threads (it varies with GPU: cheaper GPUs can get away with fewer threads), but this is the order of magnitude you need to be looking at as a minimum. You are running 1024 threads. This is almost certainly not enough (Hint: the loops inside your kernel look like scan primatives, these can be done in parallel).
Further to this there are a few other things to consider.
The number of blocks should be large in comparison to the number of SMs on your GPU. A Kepler K40 has 15 SMs, and to avoid a signficant tail effect you'd probably want at least ~100 blocks on this GPU. Other GPUs have fewer SMs, but you haven't specified which you have, so I can't be more specific.
The number of threads per block should not be too small. You can only have so many blocks on each SM, so if your blocks are too small you will use the GPU suboptimally. Furthermore, on newer GPUs up to four warps can receive instructions on a SM simultaneously, and as such is it often a good idea to have block sizes as multiples of 128.
2) Timing
I'm not going to go into so much depth here, but make sure your timing is sane. GPU code tends to have a one-time initialisation delay. If this is within your timing, you will see erroneously large runtimes for codes designed to represent a much larger code. Similarly, data transfer between the CPU and GPU takes time. In a real application you may only do this once for thousands of kernel calls, but in a test application you may do it once per kernel launch.
If you want to get accurate timings you must make your example more representitive of the final code, or you must be sure that you are only timing the regions that will be repeated.

The only way to be sure is to profile the code, but in this case we can probably make a reasonable guess.
You're basically just doing a single scan through some data, and doing extremely minimal processing on each item.
Given how little processing you're doing on each item, the bottleneck when you process the data with the CPU is probably just reading the data from memory.
When you do the processing on the GPU, the data still needs to be read from memory and copied into the GPU's memory. That means we still have to read all the data from main memory, just like if the CPU did the processing. Worse, it all has to be written to the GPU's memory, causing a further slowdown. By the time the GPU even gets to start doing real processing, you've already used up more time than it would have taken the CPU to finish the job.
For Cuda to make sense, you generally need to be doing a lot more processing on each individual data item. In this case, the CPU is probably already nearly idle most of the time, waiting for data from memory. In such a case, the GPU is unlikely to be of much help unless the input data was already in the GPU's memory so the GPU could do the processing without any extra copying.

When working with CUDA there are a few things you should keep in mind.
Copying from host memory to device memory is 'slow' - when you copy some data from the host to the device you should do as much calculations as possible (do all the work) before you copy it back to the host.
On the device there are 3 types of memory - global, shared, local. You can rank them in speed like global < shared < local (local = fastest).
Reading from consecutive memory blocks is faster than random access. When working with array of structures you would like to transpose it to a structure of arrays.
You can always consult the CUDA Visual Profiler to show you the bottleneck of your program.

the above mentioned GTX750 has 512 CUDA cores (these are the same as the shader units, just driven in a /different/ mode).
http://www.nvidia.de/object/geforce-gtx-750-de.html#pdpContent=2
the duty of creating integral images is only partially able to be parallel'ized as any result value in the results array depends on a bigger bunch of it's predecessors. further it is only a tiny math portion per memory transfer so the ALU powers and thus the unavoidable memory transfers might be the bottle neck. such an accelerator might provide some speed up, but not a thrilling speed up because of the duty itself does not allow it.
if you would compute multiple variations of integral images on the same input data you would be able to see the "thrill" much more likely due to much higher parallelism options and a higher amount of math ops. but that would be a different duty then.
as a wild guess from google search - others have already fiddled with those item: https://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=11&cad=rja&uact=8&ved=0CD8QFjAKahUKEwjjnoabw8bIAhXFvhQKHbUpA1Y&url=http%3A%2F%2Fdspace.mit.edu%2Fopenaccess-disseminate%2F1721.1%2F71883&usg=AFQjCNHBbOEB_OHAzLZI9__lXO_7FPqdqA

How do you iterate through a pitched CUDA array?

Having parallelized with OpenMP before, I'm trying to wrap my head around CUDA, which doesn't seem too intuitive to me. At this point, I'm trying to understand exactly how to loop through an array in a parallelized fashion.
Cuda by Example is a great start.
The snippet on page 43 shows:
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x; // handle the data at this index
if (tid < N)
c[tid] = a[tid] + b[tid];
}
Whereas in OpenMP the programmer chooses the number of times the loop will run and OpenMP splits that into threads for you, in CUDA you have to tell it (via the number of blocks and number of threads in <<<...>>>) to run it sufficient times to iterate through your array, using a thread ID number as an iterator. In other words you can have a CUDA kernel always run 10,000 times which means the above code will work for any array up to N = 10,000 (and of course for smaller arrays you're wasting cycles dropping out at if (tid < N)).
For pitched memory (2D and 3D arrays), the CUDA Programming Guide has the following example:
// Host code
int width = 64, height = 64;
float* devPtr; size_t pitch;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c > width; ++c) {
float element = row[c];
}
}
}
This example doesn't seem too useful to me. First they declare an array that is 64 x 64, then the kernel is set to execute 512 x 100 times. That's fine, because the kernel does nothing other than iterate through the array (so it runs 51,200 loops through a 64 x 64 array).
According to this answer the iterator for when there are blocks of threads going on will be
int tid = (blockIdx.x * blockDim.x) + threadIdx.x;
So if I wanted to run the first snippet in my question for a pitched array, I could just make sure I had enough blocks and threads to cover every element including the padding that I don't care about. But that seems wasteful.
So how do I iterate through a pitched array without going through the padding elements?
In my particular application I have a 2D FFT and I'm trying to calculate arrays of the magnitude and angle (on the GPU to save time).

After reviewing the valuable comments and answers from JackOLantern, and re-reading the documentation, I was able to get my head straight. Of course the answer is "trivial" now that I understand it.
In the code below, I define CFPtype (Complex Floating Point) and FPtype so that I can quickly change between single and double precision. For example, #define CFPtype cufftComplex.
I still can't wrap my head around the number of threads used to call the kernel. If it's too large, it simply won't go into the function at all. The documentation doesn't seem to say anything about what number should be used - but this is all for a separate question.
The key in getting my whole program to work (2D FFT on pitched memory and calculating magnitude and argument) was realizing that even though CUDA gives you plenty of "apparent" help in allocating 2D and 3D arrays, everything is still in units of bytes. It's obvious in a malloc call that the sizeof(type) must be included, but I totally missed it in calls of the type allocate(width, height). Noob mistake, I guess. Had I written the library I would have made the type size a separate parameter, but whatever.
So given an image of dimensions width x height in pixels, this is how it comes together:
Allocating memory
I'm using pinned memory on the host side because it's supposed to be faster. That's allocated with cudaHostAlloc which is straightforward. For pitched memory, you need to store the pitch for each different width and type, because it could change. In my case the dimensions are all the same (complex to complex transform) but I have arrays that are real numbers so I store a complexPitch and a realPitch. The pitched memory is done like this:
cudaMallocPitch(&inputGPU, &complexPitch, width * sizeof(CFPtype), height);
To copy memory to/from pitched arrays you cannot use cudaMemcpy.
cudaMemcpy2D(inputGPU, complexPitch, //destination and destination pitch
inputPinned, width * sizeof(CFPtype), //source and source pitch (= width because it's not padded).
width * sizeof(CFPtype), height, cudaMemcpyKind::cudaMemcpyHostToDevice);
FFT plan for pitched arrays
JackOLantern provided this answer, which I couldn't have done without. In my case the plan looks like this:
int n[] = {height, width};
int nembed[] = {height, complexPitch/sizeof(CFPtype)};
result = cufftPlanMany(
&plan,
2, n, //transform rank and dimensions
nembed, 1, //input array physical dimensions and stride
1, //input distance to next batch (irrelevant because we are only doing 1)
nembed, 1, //output array physical dimensions and stride
1, //output distance to next batch
cufftType::CUFFT_C2C, 1);
Executing the FFT is trivial:
cufftExecC2C(plan, inputGPU, outputGPU, CUFFT_FORWARD);
So far I have had little to optimize. Now I wanted to get magnitude and phase out of the transform, hence the question of how to traverse a pitched array in parallel. First I define a function to call the kernel with the "correct" threads per block and enough blocks to cover the entire image. As suggested by the documentation, creating 2D structures for these numbers is a great help.
void GPUCalcMagPhase(CFPtype *data, size_t dataPitch, int width, int height, FPtype *magnitude, FPtype *phase, size_t magPhasePitch, int cudaBlockSize)
{
dim3 threadsPerBlock(cudaBlockSize, cudaBlockSize);
dim3 numBlocks((unsigned int)ceil(width / (double)threadsPerBlock.x), (unsigned int)ceil(height / (double)threadsPerBlock.y));
CalcMagPhaseKernel<<<numBlocks, threadsPerBlock>>>(data, dataPitch, width, height, magnitude, phase, magPhasePitch);
}
Setting the blocks and threads per block is equivalent to writing the (up to 3) nested for-loops. So you have to have enough blocks * threads to cover the array, and then in the kernel you must make sure that you are not exceeding the array size. By using 2D elements for threadsPerBlock and numBlocks, you avoid having to go through the padding elements in the array.
Traversing a pitched array in parallel
The kernel uses the standard pointer arithmetic from the documentation:
__global__ void CalcMagPhaseKernel(CFPtype *data, size_t dataPitch, int width, int height,
FPtype *magnitude, FPtype *phase, size_t magPhasePitch)
{
int threadX = threadIdx.x + blockDim.x * blockIdx.x;
if (threadX >= width)
return;
int threadY = threadIdx.y + blockDim.y * blockIdx.y;
if (threadY >= height)
return;
CFPtype *threadRow = (CFPtype *)((char *)data + threadY * dataPitch);
CFPtype complex = threadRow[threadX];
FPtype *magRow = (FPtype *)((char *)magnitude + threadY * magPhasePitch);
FPtype *magElement = &(magRow[threadX]);
FPtype *phaseRow = (FPtype *)((char *)phase + threadY * magPhasePitch);
FPtype *phaseElement = &(phaseRow[threadX]);
*magElement = sqrt(complex.x*complex.x + complex.y*complex.y);
*phaseElement = atan2(complex.y, complex.x);
}
The only wasted threads here are for the cases where the width or height are not multiples of the number of threads per block.

A CUDA kernel to optimize

Hi I recently have a CUDA kernel to optimize. Here is the original CUDA kernel:
__glboal__ void kernel_base( float *data, int x_dim, int y_dim )
{
int ix = blockIdx.x;
int iy = blockIdx.y*blockDim.y + threadIdx.y;
int idx = iy*x_dim + ix;
float tmp = data[idx];
if( ix % 2 )
{
tmp += sqrtf( sinf(tmp) + 1.f );
}
else
{
tmp += sqrtf( cosf(tmp) + 1.f );
}
data[idx] = tmp;
}
dim3 block( 1, 512 );
dim3 grid( 2048/1, 2048/512 );
kernel<<<grid,block>>>( d_data, 2048, 2048 );
The basic problem here is the dilemma of memory coalescing and thread divergence. The original code processes the array in a column major, so it has strided memory access pattern, but no divergence. I could change it to row-major, which again has the problem of thread divergence.
So does anyone have better idea how to maximize the performance?

Thread divergence here isn't a big problem compared to the strided memory access, in terms of performance. I would go for coalescing. Furthermore, your data storage has an implicit AoS ordering. If you can reorder the data to SoA, you can solve both problems.
So I would reorder this kernel to first handle things in a row-major fashion. This solves the coalescing problem but introduces warp divergence.
If you're unable to re-order the data, I would then consider eliminating warp divergence by modifying the indexing scheme, so that even warps handle even elements, and odd warps handle odd elements.
This will eliminate warp divergence, but will break perfect coalescing again, but the caches should help with this issue. In the case of Fermi, the L1 cache should smooth over this pattern pretty well. I would then compare this case against the warp divergent case, to see which is faster.

Take into account that
sin(x) = cos(x + pi/2)
Accordingly, you can replace the if ... else conditions to
tmp += sqrtf( cosf(tmp + (ix%2) * pi/2) + 1.f );
avoiding branch divergence.

If I were doing this, I would make the block sizes 16 x 16 or some other shape with a lower aspect ratio. I would use shared memory to grab 2 blocks worth of data (each idx grabs 2 elements from data, probably separated by blockDim.x elements), then have each block do its assigned "odd" rows followed by the "even" rows. You'll have to recompute ix, and iy, (and probably idx as well) and you'll use 1/2 as many blocks, but there should be coalesced memory access followed by non-divergent code.

Co-occurrence matrix creation with cuda

//This is my kernel function
__global__ void createSCM(Pixel*pixelMat, //image
int imgRows, //image dimensions
int imgCols,
int*matrizSCM, //Coocurrence matrix
int numNiveles, //coocurrence matrix levels = 256
int delta_R, //value = {-1,0 or 1}
int delta_C) //value = {-1,0 or 1}
{
int i = blockIdx.y*blockDim.y+threadIdx.y;
int j = blockIdx.x*blockDim.x+threadIdx.x;
int cols = numNiveles;
int posx,posy;
if ( (j + delta_C) < imgCols && (i + delta_R) < imgRows &&
((j + delta_C) >= 0) && ((i + delta_R) >= 0) )
{
posx = pixelMat[i*imgCols+j].channel_0;
posy = pixelMat[(i + delta_R)*imgCols+(j + delta_C)].channel_0;
matrizSCM[posx*cols+posy]++;
matrizSCM[posy*cols+posx]++;
}
}
struct Pixel {
int channel_0;
};
I have counting errors in the coocurrence matrix, because
pixelMat[i*imgCols+j] and pixelMat[(i + delta_R)*imgCols+(j + delta_C)]
are accessing to different positions with the same thread.
This is my kernel call
int Grid_Dim_x=imagenTest.rows, Grid_Dim_y=imagenTest.cols;
int Block_Dim_x=1, Block_Dim_y=1;
dim3 Grid(Grid_Dim_x, Grid_Dim_y);
dim3 Block(Block_Dim_x,Block_Dim_x);
createSCM<<<Grid,Block>>>(...)
There is just one thread on each block, and each block represents a pixel
is there a nice solution to this problem?
Thanks :)

Reading from different memory cells of immutable input incurs no parallel hazard that you would have to deal with. The problem lies within the matrizSCM where the same memory cell can be incremented by multiple threads at once.
An atomicAdd(addr,1) is a quick fix --- it should make the algorithm correct, but it may be fairly slow. Making it correct should be the first step; then you can look on available examples on the web of histogram computation and parallel reduction algorithm and check if it can be applied to your problem.
Finally, as Robert pointed out in the comment, launching just one thread in a block is very inefficient. You need a multiple of 32 to utilize the hardware SIMD unit, and usually about 256 threads to hide various memory latencies.
Also, if your image is big and you still need thousands of 256-thread blocks, you may consider launching less blocks (around 60-120) but having each block process multiple pixels sequentially. If you do that, you might be able to put a copy of matrixSCM in shared memory. This will make a separate copy of matrixSCM for each block, resulting in less atomic conflicts between the blocks. Obviously, at the end of the kernel, your block will still need to "submit" the partial result into the global one, but that would be a single step operation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js