CUDA combining thread independent(??) variables during execution - c++

Guys I apologize if the title is confusing. I though long and hard and couldn't come up with proper way to phrase the question in a single line. So here's more detail. I am doing a basic image subtraction where the second image has been modified and I need to find the ratio of how much change was done to the image. for this I used the following code. Both images are 128x1024.
for(int i = 0; i < 128; i++)
{
for(int j = 0; j < 1024; j++)
{
den++;
diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
if(diff[i * 1024 + j] < error)
{
num++;
}
}
}
ratio = num/den;
The above code works fine on the CPU but I want to try to do this on CUDA. For this I can setup CUDA to do the basic subtraction of the images (code below) but I can't figure out how to do the conditional if statement to get my ratio out.
__global__ void calcRatio(float *orig, float *modified, int size, float *result)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
if(index < size)
result[index] = orig[index] - modified[index];
}
So, up to this point it works but I cannot figure out how to parrallelize the num and den counters in each thread to calculate the ratio at the end of all the thread executions. To me it feels like the num and den counders are independent of the threads as every time I have tried to use them it seems they get incremented only once.
Any help will be appreciated as I am just starting out in CUDA and every example I see online never seems to apply to what I need to do.
EDIT: Fixed my naive code. Forgot to type one of the main condition in the code. It was a long long day.
for(int i = 0; i < 128; i++)
{
for(int j = 0; j < 1024; j++)
{
if(modified[i * 1024 + j] < 400.0) //400.0 threshold value to ignore noise
{
den++;
diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
if(diff[i * 1024 + j] < error)
{
num++;
}
}
}
}
ratio = num/den;

The operation you need to use to perform global summation across all the threads is known as a "parallel reduction". While you could use atomic operations to do this, I would not recommend it. There is a reduction kernel and a very good paper discussing the technique in the CUDA SDK, it is worth reading.
If I were writing code to do what you want, it would probably look like this:
template <int blocksize>
__global__ void calcRatio(float *orig, float *modified, int size, float *result,
int *count, const float error)
{
__shared__ volatile float buff[blocksize];
int index = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
int count = 0;
for(int i=index; i<n; i+=stride) {
val = orig[index] - modified[index];
count += (val < error);
result[index] = val;
}
buff[threadIdx.x] = count;
__syncthreads();
// Parallel reduction in shared memory using 1 warp
if (threadId.x < warpSize) {
for(int i=threadIdx.x + warpSize; i<blocksize; i+= warpSize) {
buff[threadIdx.x] += buff[i];
if (threadIdx.x < 16) buff[threadIdx.x] +=buff[threadIdx.x + 16];
if (threadIdx.x < 8) buff[threadIdx.x] +=buff[threadIdx.x + 8];
if (threadIdx.x < 4) buff[threadIdx.x] +=buff[threadIdx.x + 4];
if (threadIdx.x < 2) buff[threadIdx.x] +=buff[threadIdx.x + 2];
if (threadIdx.x == 0) count[blockIdx.x] = buff[0] + buff[1];
}
}
The first stanza does what your serial code does - computes a difference and a thread local total of elements which are less than error. Note I have written this version so that each thread is designed to process more than one entry of the input data. This has been done to help offset the computational cost of the parallel reduction that follows, and the idea is that you would use fewer blocks and threads than there were input data set entries.
The second stanza is the reduction itself, done in shared memory. It is effectively a "tree like" operation where the size of the set of thread local subtotals within a single block of threads is first summed down to 32 subtotals, then the subtotals are combined until there is the final subtotal for the block, and that is then stored is the total for the block. You will wind up with a small list of sub totals in count, one for each block you launched, which can be copied back to the host and the final result you need calculated there.
Please note I coded this in the browser and haven't compiled it, there might be errors, but it should give an idea about how an "advanced" version of what you are trying to do would work.

The denominator is pretty simple, since it's just the size.
The numerator is more troublesome, since its value for a given thread depends on all previous values. You're going to have to do that operation serially.
The thing you're looking for is probably atomicAdd. It's very slow, though.
I think you'd find this question relevant. Your num is basically global data.
CUDA array-to-array sum
Alternatively, you could dump the results of the error check into an array. Counting the results could then be parallelized. It would be a little tricky, but I think something like this would scale up: http://tekpool.wordpress.com/2006/09/25/bit-count-parallel-counting-mit-hakmem/

Related

Optimizing MatMult with AVX

I decided to play a little bit with AVX. For this reason I wrote a simple matrix multiplication "benchmark code" and started applying some optimizations to it - just to see how fast I can make it. Below is my naive implementation, followed by the simplest AVX one I could think of:
void mmult_naive()
{
int i, j, k = 0;
// Traverse through each row element of Matrix A
for (i = 0; i < SIZE; i++) {
// Traverse through each column element of Matrix B
for (j = 0; j < SIZE; j++) {
for (k = 0; k < SIZE; k++) {
matrix_C[i][j] += matrix_A[i][k] * matrix_B[k][j];
}
}
}
}
AVX:
void mmult_avx_transposed()
{
__m256 row_vector_A;
__m256 row_vector_B;
int i, j, k = 0;
__m256 int_prod;
// Transpose matrix B
transposeMatrix();
for (i = 0; i < SIZE; i++) {
for (j = 0; j < SIZE; j++) {
int_prod = _mm256_setzero_ps();
for (k = 0; k < (SIZE / 8); k++) {
row_vector_A = _mm256_load_ps(&matrix_A[i][k * 8]);
row_vector_B = _mm256_load_ps(&T_matrix[j][k * 8]);
int_prod = _mm256_fmadd_ps(row_vector_A, row_vector_B, int_prod);
}
matrix_C[i][j] = hsum_single_avx(int_prod);
}
}
}
I chose to transpose the second matrix to make it easier to load the values from the memory to the vector registers. This part works fine, gives all the nice
expected speed up and makes me happy.
While measuring the execution time for some larger matrix sizes (NxN matrices, N>1024) I thought the transpose might not be necessary if I find a "smarter" way to access the elements. The transpose function itself was roughly a 4-5% of the execution time, so it looked like a low-hanging fruit.
I replaced the second _mm256_load_ps with the following line and got rid of the transposeMatrix():
row_vector_B = _mm256_setr_ps(matrix_B[k * 8][j], matrix_B[(k * 8) + 1][j], matrix_B[(k * 8) + 2][j], matrix_B[(k * 8) + 3][j],
matrix_B[(k * 8) + 4][j], matrix_B[(k * 8) + 5][j], matrix_B[(k * 8) + 6][j], matrix_B[(k * 8) + 7][j]);
But now the code runs even worse! The results I got are the following:
MMULT_NAIVE execution time: 195,499 us
MMULT_AVX_TRANSPOSED execution time: 127,802 us
MMULT_AVX_INDEXED execution time: 1,482,524 us
I wanted to see if I had better luck with clang but it only made things worse:
MMULT_NAIVE execution time: 2,027,125 us
MMULT_AVX_TRANSPOSED execution time: 125,781 us
MMULT_AVX_INDEXED execution time: 1,798,410 us
My questions are in fact two:
Why the indexed part runs slower?
What's going on with clang? Apparently, even the "slow version" is much slower.
Everything was compiled with -O3, -mavx2 and -march=native on an i7-8700, with Arch Linux. g++ was in version 12.1.0 and clang in version 14.0.6.

Kernels Synchronisation

I'm new to Cuda programming and I'm implementing the classical Floyd APSP Algorithm. This algorithm consists in 3 nested loops and all the code inside the two inner loops can be executed in parallel.
As main parts of my code, here is the kernel code:
__global__ void dfloyd(double *dM, size_t k, size_t n)
{
unsigned int x = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int y = threadIdx.y + blockIdx.y * blockDim.y;
unsigned int index = y * n + x;
double d;
if (x < n && y < n)
{
d=dM[x+k*n] + dM[k+y*n];
if (d<dM[index])
dM[index]=d;
}
}
and here is the part from the main function where the kernels are launched (for readability I omitted error handling code):
double *dM;
cudaMalloc((void **)&dM, sizeof_M);
cudaMemcpy(dM, hM, sizeof_M, cudaMemcpyHostToDevice);
int dimx = 32;
int dimy = 32;
dim3 block(dimx, dimy);
dim3 grid((n + block.x - 1) / block.x, (n + block.y - 1) / block.y);
for (size_t k=0; k<n; k++)
{
dfloyd<<<grid, block>>>(dM, k, n);
cudaDeviceSynchronize();
}
cudaMemcpy(hM, dM, sizeof_M, cudaMemcpyDeviceToHost);
[For the understanding, dM is referring to the distance matrix stored in the device side and hM in the host side and n is referring to the number of nodes.]
Kernels inside the k-loop have to be executed serially, this explains why I write the cudaDeviceSynchronize() instruction after each kernel execution.
However, I notice that putting this synchro instruction outside the loop leads to the same result.
Now, my question. Do the two following pieces of code
for (size_t k=0; k<n; k++)
{
dfloyd<<<grid, block>>>(dM, k, n);
cudaDeviceSynchronize();
}
and
for (size_t k=0; k<n; k++)
{
dfloyd<<<grid, block>>>(dM, k, n);
}
cudaDeviceSynchronize();
are equivalent?
They are not equivalent but will give the same results. The first one will make the host wait after each kernel call until the kernel has returned, while the other one will make it wait only once.
Maybe the confusing part is why does it work; in CUDA, two consecutive kernel calls on the same stream (in your case, default stream) are guaranteed to be executed serially.
Performance wise, it is advised to use the second version, as synchronisation with the host adds overhead.
Edit: in that specific case, you do not even need to call cudaDeviceSynchronize() because the cudaMemcpy will synchronize.

'Official' CUDA Reduction function can't accept certain numbers?

currently trying to use the Reduction #3 outline in the CUDA pdf here.
Here is how my Reduction function looks
template <typename T>
__device__ void offsetReduction(planet<T> *bodies, T *outdata, int arrayIdent, int nbodies){
extern __shared__ T sdata[];
unsigned int tID = threadIdx.x;
unsigned int i = tID + blockIdx.x * blockDim.x;
if (arrayIdent == 1){
if (i < nbodies){
sdata[tID] = bodies[i].vx * bodies[i].mass;
}
__syncthreads();
}
if (arrayIdent == 2){
if (i < nbodies){
sdata[tID] = (bodies[i].vy * bodies[i].mass);
}
__syncthreads();
}
if (arrayIdent == 3){
if (i < nbodies){
sdata[tID] = (bodies[i].vz * bodies[i].mass);
}
__syncthreads();
}
for (unsigned int stride = blockDim.x / 2; stride > 0; stride >>=1)
{
if (tID < stride)
{
sdata[tID] += sdata[tID + stride];
}
__syncthreads();
}
if (tID == 0)
{
outdata[blockIdx.x] = sdata[0];
}
However, it doesn't seem to be working correctly so I did some calculations.
I launch the same number of threads as 'int nbodies', and in my case I have chosen 5. So each of the 5 threads comes in and adds a value to sdata[] no problem. However once it gets to the addition part it goes wrong.
On the first iteration Thread 0 accesses sdata[3], Thread 1 accesses sdata[4] and the other threads do nothing. On the second iteration Thread 0 accesses sdata1 and the other threads do nothing. The addition is then 'finished' and the kernel finishes. But sdata[2] is never added so I get an incorrect value stored at sdata[0].
Am I missing something really obvious? (I have been staring at this for a while so I probably have.
This reduction code, like any other "tree-like" reduction operation, requires that the number of threads that participate in the shared memory reduction be equal to a power of 2 to work correctly.
Note that means you could design a reduction kernel which would run correctly for any multiple of 2 threads per block by having the nearest smaller power of 2 threads perform the actual reduction. The code you have posted cannot, however, work like that.

Weird but close fft and ifft of image in c++

I wrote a program that loads, saves, and performs the fft and ifft on black and white png images. After much debugging headache, I finally got some coherent output only to find that it distorted the original image.
input:
fft:
ifft:
As far as I have tested, the pixel data in each array is stored and converted correctly. Pixels are stored in two arrays, 'data' which contains the b/w value of each pixel and 'complex_data' which is twice as long as 'data' and stores real b/w value and imaginary parts of each pixel in alternating indices. My fft algorithm operates on an array structured like 'complex_data'. After code to read commands from the user, here's the code in question:
if (cmd == "fft")
{
if (height > width) size = height;
else size = width;
N = (int)pow(2.0, ceil(log((double)size)/log(2.0)));
temp_data = (double*) malloc(sizeof(double) * width * 2); //array to hold each row of the image for processing in FFT()
for (i = 0; i < (int) height; i++)
{
for (j = 0; j < (int) width; j++)
{
temp_data[j*2] = complex_data[(i*width*2)+(j*2)];
temp_data[j*2+1] = complex_data[(i*width*2)+(j*2)+1];
}
FFT(temp_data, N, 1);
for (j = 0; j < (int) width; j++)
{
complex_data[(i*width*2)+(j*2)] = temp_data[j*2];
complex_data[(i*width*2)+(j*2)+1] = temp_data[j*2+1];
}
}
transpose(complex_data, width, height); //tested
free(temp_data);
temp_data = (double*) malloc(sizeof(double) * height * 2);
for (i = 0; i < (int) width; i++)
{
for (j = 0; j < (int) height; j++)
{
temp_data[j*2] = complex_data[(i*height*2)+(j*2)];
temp_data[j*2+1] = complex_data[(i*height*2)+(j*2)+1];
}
FFT(temp_data, N, 1);
for (j = 0; j < (int) height; j++)
{
complex_data[(i*height*2)+(j*2)] = temp_data[j*2];
complex_data[(i*height*2)+(j*2)+1] = temp_data[j*2+1];
}
}
transpose(complex_data, height, width);
free(temp_data);
free(data);
data = complex_to_real(complex_data, image.size()/4); //tested
image = bw_data_to_vector(data, image.size()/4); //tested
cout << "*** fft success ***" << endl << endl;
void FFT(double* data, unsigned long nn, int f_or_b){ // f_or_b is 1 for fft, -1 for ifft
unsigned long n, mmax, m, j, istep, i;
double wtemp, w_real, wp_real, wp_imaginary, w_imaginary, theta;
double temp_real, temp_imaginary;
// reverse-binary reindexing to separate even and odd indices
// and to allow us to compute the FFT in place
n = nn<<1;
j = 1;
for (i = 1; i < n; i += 2) {
if (j > i) {
swap(data[j-1], data[i-1]);
swap(data[j], data[i]);
}
m = nn;
while (m >= 2 && j > m) {
j -= m;
m >>= 1;
}
j += m;
};
// here begins the Danielson-Lanczos section
mmax = 2;
while (n > mmax) {
istep = mmax<<1;
theta = f_or_b * (2 * M_PI/mmax);
wtemp = sin(0.5 * theta);
wp_real = -2.0 * wtemp * wtemp;
wp_imaginary = sin(theta);
w_real = 1.0;
w_imaginary = 0.0;
for (m = 1; m < mmax; m += 2) {
for (i = m; i <= n; i += istep) {
j = i + mmax;
temp_real = w_real * data[j-1] - w_imaginary * data[j];
temp_imaginary = w_real * data[j] + w_imaginary * data[j-1];
data[j-1] = data[i-1] - temp_real;
data[j] = data[i] - temp_imaginary;
data[i-1] += temp_real;
data[i] += temp_imaginary;
}
wtemp = w_real;
w_real += w_real * wp_real - w_imaginary * wp_imaginary;
w_imaginary += w_imaginary * wp_real + wtemp * wp_imaginary;
}
mmax=istep;
}}
My ifft is the same only with the f_or_b set to -1 instead of 1. My program calls FFT() on each row, transposes the image, calls FFT() on each row again, then transposes back. Is there maybe an error with my indexing?
Not an actual answer as this question is Debug only so some hints instead:
your results are really bad
it should look like this:
first line is the actual DFFT result
Re,Im,Power is amplified by a constant otherwise you would see a black image
the last image is IDFFT of the original not amplified Re,IM result
the second line is the same but the DFFT result is wrapped by half size of image in booth x,y to match the common results in most DIP/CV texts
As you can see if you IDFFT back the wrapped results the result is not correct (checker board mask)
You have just single image as DFFT result
is it power spectrum?
or you forget to include imaginary part? to view only or perhaps also to computation somewhere as well?
is your 1D **DFFT working?**
for real data the result should be symmetric
check the links from my comment and compare the results for some sample 1D array
debug/repair your 1D FFT first and only then move to the next level
do not forget to test Real and complex data ...
your IDFFT looks BW (no gray) saturated
so did you amplify the DFFT results to see the image and used that for IDFFT instead of the original DFFT result?
also check if you do not round to integers somewhere along the computation
beware of (I)DFFT overflows/underflows
If your image pixel intensities are big and the resolution of image too then your computation could loss precision. Newer saw this in images but if your image is HDR then it is possible. This is a common problem with convolution computed by DFFT for big polynomials.
Thank you everyone for your opinions. All that stuff about memory corruption, while it makes a point, is not the root of the problem. The sizes of data I'm mallocing are not overly large, and I am freeing them in the right places. I had a lot of practice with this while learning c. The problem was not the fft algorithm either, nor even my 2D implementation of it.
All I missed was the scaling by 1/(M*N) at the very end of my ifft code. Because the image is 512x512, I needed to scale my ifft output by 1/(512*512). Also, my fft looks like white noise because the pixel data was not rescaled to fit between 0 and 255.
Suggest you look at the article http://www.yolinux.com/TUTORIALS/C++MemoryCorruptionAndMemoryLeaks.html
Christophe has a good point but he is wrong about it not being related to the problem because it seems that in modern times using malloc instead of new()/free() does not initialise memory or select best data type which would result in all problems listed below:-
Possibly causes are:
Sign of a number changing somewhere, I have seen similar issues when a platform invoke has been used on a dll and a value is passed by value instead of reference. It is caused by memory not necessarily being empty so when your image data enters it will have boolean maths performed on its values. I would suggest that you make sure memory is empty before you put your image data there.
Memory rotating right (ROR in assembly langauge) or left (ROL) . This will occur if data types are being used which do not necessarily match, eg. a signed value entering an unsigned data type or if the number of bits is different in one variable to another.
Data being lost due to an unsigned value entering a signed variable. Outcomes are 1 bit being lost because it will be used to determine negative or positive, or at extremes if twos complement takes place the number will become inverted in meaning, look for twos complement on wikipedia.
Also see how memory should be cleared/assigned before use. http://www.cprogramming.com/tutorial/memory_debugging_parallel_inspector.html

CUDA - Optimize mean of matrix rows calculation using shared memory

I am trying to optimize the computation of the mean of each row in my 512w x 1024h image, and then subtract the mean from the row from which it was computed. I wrote a piece of code which does it in 1.86 ms, but I want to reduce the speed. This piece of code works fine, but does not use shared memory, and it utilizes for loops. I want to do away with them.
__global__ void subtractMean (const float *__restrict__ img, float *lineImg, int height, int width) {
// height = 1024, width = 512
int tidy = threadIdx.x + blockDim.x * blockIdx.x;
float sum = 0.0f;
float sumDiv = 0.0f;
if(tidy < height) {
for(int c = 0; c < width; c++) {
sum += img[tidy*width + c];
}
sumDiv = (sum/width)/2;
//__syncthreads();
for(int cc = 0; cc < width; cc++) {
lineImg[tidy*width + cc] = img[tidy*width + cc] - sumDiv;
}
}
__syncthreads();
I called the above kernel using:
subtractMean <<< 2, 512 >>> (originalImage, rowMajorImage, actualImHeight, actualImWidth);
However, the following code I wrote uses shared memory to optimize. But, it does not work as expected. Any thoughts on what the problem might be?
__global__ void subtractMean (const float *__restrict__ img, float *lineImg, int height, int width) {
extern __shared__ float perRow[];
int idx = threadIdx.x; // set idx along x
int stride = width/2;
while(idx < width) {
perRow[idx] = 0;
idx += stride;
}
__syncthreads();
int tidx = threadIdx.x; // set idx along x
int tidy = blockIdx.x; // set idx along y
if(tidy < height) {
while(tidx < width) {
perRow[tidx] = img[tidy*width + tidx];
tidx += stride;
}
}
__syncthreads();
tidx = threadIdx.x; // reset idx along x
tidy = blockIdx.x; // reset idx along y
if(tidy < height) {
float sumAllPixelsInRow = 0.0f;
float sumDiv = 0.0f;
while(tidx < width) {
sumAllPixelsInRow += perRow[tidx];
tidx += stride;
}
sumDiv = (sumAllPixelsInRow/width)/2;
tidx = threadIdx.x; // reset idx along x
while(tidx < width) {
lineImg[tidy*width + tidx] = img[tidy*width + tidx] - sumDiv;
tidx += stride;
}
}
__syncthreads();
}
The shared memory function was called using:
subtractMean <<< 1024, 256, sizeof(float)*512 >>> (originalImage, rowMajorImage, actualImHeight, actualImWidth);
2 blocks is hardly enough to saturate GPU use. You are going towards the right approach with utilizing more blocks, however, you are using Kepler and I would like to present an option that does not use shared memory at all.
Start with 32 threads in a block (this can be changed later using 2D blocks)
With those 32 threads you should do something along the lines of this:
int rowID = blockIdx.x;
int tid = threadIdx.x;
int stride= blockDim.x;
int index = threadIdx.x;
float sum=0.0;
while(index<width){
sum+=img[width*rowID+index];
index+=blockDim.x;
}
at this point you will have 32 threads that have a partial sum in each of them. You next need to add them all together. You can do this without the use of shared memory (since we are within a warp) by utilizing a shuffle reduction. For details on that look here: http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/ what you want is the shuffle warp reduce, but you need to change it to use the full 32 threads.
Now that thread 0 in each warp has the sum of every row, you can divide that by the width cast to a float, and broadcast it to the rest of the warp using shfl using shfl(average, 0);. http://docs.nvidia.com/cuda/cuda-c-programming-guide/#warp-description
With the average found and the warps synchronized implicitly and explicitly (with shfl), you can continue on in a similar method with the subtract.
Possible further optimizations would be to include more than one warp in a block to improve occupancy, and to manually unroll the loops over the width to improve instruction level parallelism.
Good Luck.