Karatsuba - polynomials multiplication with CUDA - c++

I'm using CUDA for the iterative Karatsuba algorithm and I would like to ask, why is one line computed always different.
First, I implemented this function, which computed the result always correctly:
__global__ void kernel_res_main(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) / 2;
for(TYPE inner = start; inner < end; inner++){
result[i] += ( A[inner] + A[i - inner] ) * ( B[inner] + B[i - inner] );
result[i] -= ( D[inner] + D[i-inner] );
}
}
}
Now I would like to use the 2D grid and use CUDA for the for-loop, so I changed my function to this:
__global__ void kernel_res_nested(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
TYPE rtmp = result[i];
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) >> 1;
if(j >= start && j <= end ){
// WRONG
rtmp += ( A[j] + A[i - j] ) * ( B[j] + B[i - j] ) - ( D[j] + D[i - j] );
}
}
result[i] = rtmp;
}
I am calling this function like this:
dim3 block( 32, 8 );
dim3 grid( (resultSize+1/32) , (resultSize+7/8) );
kernel_res_nested <<<grid, block>>> (devA, devB, devD, devResult, size, resultSize);
And the result is alway wrong and always different. I can't understand why is that second implementation wrong and always computes wrong results. I can't see there any logical problem connected with data dependency. Does anyone know How can I solve this problem?

For question like this, you are supposed to provide a MCVE. (See item 1 here) For example, I don't know what type is indicated by TYPE, and it does matter for the correctness of the solution I will propose.
In your first kernel, only one thread in your entire grid was reading and writing location result[i]. But in your second kernel, you now have multiple threads writing to the result[i] location. They are conflicting with each other. CUDA doesn't specify the order in which threads will run, and some may run before, after, or at the same time as, others. In this case, some threads may read result[i] at the same time as others. Then, when the threads write their results, they will be inconsistent. And it may vary from run-to-run. You have a race condition there (execution order dependency, not data dependency).
The canonical method to sort this out would be to employ a reduction technique.
However for simplicity, I will suggest that atomics could help you sort it out. This is easier to implement based on what you have shown, and will help confirm the race condition. After that, if you want to try a reduction method, there are plenty of tutorials for that (one is linked above) and plenty of questions here on the cuda tag about it.
You could modify your kernel to something like this, to sort out the race condition:
__global__ void kernel_res_nested(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) >> 1;
if(j >= start && j < end ){ // see note below
atomicAdd(result+i, (( A[j] + A[i - j] ) * ( B[j] + B[i - j] ) - ( D[j] + D[i - j] )));
}
}
}
Note that depending on your GPU type, and the actual type of TYPE you are using, this may not work (may not compile) as-is. But since you had previously used TYPE as a loop variable, I am assuming it is an integer type, and the necessary atomicAdd for those should be available.
A few other comments:
This may not be giving you the grid size you expect:
dim3 grid( (resultSize+1/32) , (resultSize+7/8) );
I think the usual calculations there would be:
dim3 grid( (resultSize+31)/32, (resultSize+7)/8 );
I always recommend proper CUDA error checking and running your codes with cuda-memcheck, any time you are having trouble with a CUDA code, to make sure there are no runtime errors.
It also looks to me like this:
if(j >= start && j <= end ){
should be this:
if(j >= start && j < end ){
to match your for-loop range. I am also making an assumption that size is less than resultSize (again, a MCVE would help).

Related

how to further optimize this code using Openmp multithreading

i have this code snippet I came across and I'm trying to use OpenMP to make it run faster than the original version. However, in comparison this seems to be taking about the same amount of time as the older version. Not sure why this multithreading approach is not working to optimize it. Like the timings are still the same. What can I do to make it run even faster?:
void sobel(unsigned char *data_out,
unsigned char *data_in, unsigned height,
unsigned width)
{
/* Sobel matrices for convolution */
int sobelv[3][3] = { {-1, -2, -1}, {0, 0, 0}, {1, 2, 1} };
int sobelh[3][3] = { {-1, 0, 1}, {-2, 0, 2}, {-1, 0, 1} };
unsigned int size, i, j;
int lay;
size = height * width;
#ifdef OPENMP
#pragma omp parallel for collapse(64) shared (data_in,data_out,sobelv, sobelh,size) private (i,j,lay)
#endif
for (lay = 0; lay < 3; lay++) {
for (i = 1; i < height - 1; ++i) {
for (j = 1; j < width - 1; j++) {
int sumh, sumv;
int k = -1, l = -1;
sumh = 0;
sumv = 0;
/* Convolution part */
for ( k = -1; k < 2; k++)
for (l = -1; l < 2; l++) {
sumh =
sumh + sobelh[k + 1][l + 1] *(int) data_in[lay * size + (i + k) * width +(j + l)];
sumv =
sumv + sobelv[k + 1][l +1] * (int) data_in[lay *size +(i +k) *width + (j +l)];
}
int temp = abs(sumh / 8) + abs(sumv / 8);
data_out[lay * size + i * width + j] =
(temp > 255? 255: temp);
}
}
}
}
the main function is simply calling this function like this:
sobel(data_out, data_in, header.height, header.width);
any help would be appreciated!! :)
The best optimization you can apply is to vectorize the code. Compilers can often auto-vectorize the code when it is sufficiently simple but this one is too complex for most compilers (including GCC and Clang) to vectorize it.
Manual code vectorization is cumbersome error-prone and often make the code (more) dependent of a specific architecture (eg. x86-64). However, you can help the compiler to generate it for you. To do that, you it better to:
avoid mixing signed/unsigned types and type of different size;
use the smallest possible types fitting your needs;
avoid loops and conditions in the vectorized loop;
access data contiguously;
avoid integer multiplication/division with small types (on x86-64 and/or with some compilers);
prefer using local short-scoped variables when this is possible;
enable advanced optimizations like -O3 for GCC/Clang, possibly coupled with -mavx2 if your target platform supports the AVX-2 instruction set, or with -march=native if your target platform is the one where the program is built;
be careful about aliasing (possibly using temporary arrays, strict aliasing rules, memcpy calls, restrict compiler extensions, etc.) [thanks to #Laci].
You can check the generated assembly code to see if the code is vectorized or not.
Moreover, using collapse(2) should enough here to get a good speed-up. collapse(3) can introduce some unwanted overheads due to the last loop being shared amongst threads. collapse(64) is not correct (it cannot be bigger than the number of nested loops).
Here is the resulting untested code:
#include <cmath>
void sobel(unsigned char *data_out,
unsigned char *data_in, int height,
int width)
{
const int size = height * width;
#ifdef OPENMP
#pragma omp parallel for collapse(2) shared(data_in,data_out,size)
#endif
for (int lay = 0; lay < 3; lay++)
{
for (int i = 1; i < height - 1; ++i)
{
for (int j = 1; j < width - 1; j++)
{
short a11 = data_in[lay * size + (i-1) * width + (j-1)];
short a12 = data_in[lay * size + (i-1) * width + j];
short a13 = data_in[lay * size + (i-1) * width + (j+1)];
short a21 = data_in[lay * size + i * width + (j-1)];
short a23 = data_in[lay * size + i * width + (j+1)];
short a31 = data_in[lay * size + (i+1) * width + (j-1)];
short a32 = data_in[lay * size + (i+1) * width + j];
short a33 = data_in[lay * size + (i+1) * width + (j+1)];
short sumh = a13 - a11 + (a23 - a21) + (a23 - a21) + a33 - a31;
short sumv = a31 + a32 + a32 + a33 - (a11 + a12 + a12 + a13);
short temp = (abs(sumh) >> 3) + (abs(sumv) >> 3);
data_out[lay * size + i * width + j] = (temp > 255? 255: temp);
}
}
}
}
I expect the code to be several time faster (especially true in sequential) -- typically about 10 times faster with AVX-2 since the processor can work on 16 values at once (despite a bit more work related to SIMD instructions).
Another possible optimization you can do is called register blocking. The idea is to change the loop so that you work on small fixed-size tiles (eg. 2x2 or 4x2 SIMD values). This should reduces the number of L1-cache loads and the number of char-to-short/short-to-char conversions to perform. However, this is hard to help the compiler so it does this optimization correctly on such a code. It is probably better to use SIMD intrinsics if performance is critical and do the register blocking yourself.

Need help understanding how to work with 2D/3D glyphs

Here's the code snippet I'd like help understanding
for (i = 0; i < samplesX; i++)
for (j = 0; j < samplesY; j++)
{
newI = DIM * i / samplesX;
newJ = DIM * j / samplesY;
idx = (round(newJ) * DIM) + round(newI);
if (color_dir == 1 && draw_vecs == 1) {
direction_to_color(vx[idx], vy[idx], color_dir);
}
if (color_dir == 1 && draw_vecs == 2) {
direction_to_color(fx[idx], fy[idx], color_dir);
}
else if (color_dir == 2) {
scalar = rho[idx];
set_colormap(scalar, min, max, clampLow, clampHigh);
}
else if (color_dir == 3) {
scalar = sqrt(vx[idx] * vx[idx] + vy[idx] * vy[idx]);
set_colormap(scalar, min, max, clampLow, clampHigh);
}
else if (color_dir == 4) {
scalar = sqrt(fx[idx] * fx[idx] + fy[idx] * fy[idx]);
set_colormap(scalar, min, max, clampLow, clampHigh);
}
/*if (draw_vecs == 1) {
glVertex2f(wn + (fftw_real)newI * wn, hn + (fftw_real)newJ * hn);
glVertex2f((wn + (fftw_real)newI * wn) + vec_scale * vx[idx], (hn + (fftw_real)newJ * hn) + vec_scale * vy[idx]);
}
else if (draw_vecs == 2) {
glVertex2f(wn + (fftw_real)newI * wn, hn + (fftw_real)newJ * hn);
glVertex2f((wn + (fftw_real)newI * wn) + vec_scale * fx[idx], (hn + (fftw_real)newJ * hn) + vec_scale * fy[idx]);
}*/
if (draw_vecs == 1) {
glVertex2f(wn + (fftw_real)i * wn, hn + (fftw_real)j * hn);
glVertex2f((wn + (fftw_real)i * wn) + vec_scale * vx[idx], (hn + (fftw_real)j * hn) + vec_scale * vy[idx]);
}
else if (draw_vecs == 2) {
glVertex2f(wn + (fftw_real)i * wn, hn + (fftw_real)j * hn);
glVertex2f((wn + (fftw_real)i * wn) + vec_scale * fx[idx], (hn + (fftw_real)j * hn) + vec_scale * fy[idx]);
}
}
glEnd();
}
What this currently does, as far as my understanding goes, is display these two-dimensional lines/arrows (hedgehogs) that visualize force/velocity in 2D as can be seen in the picture below.
Sadly, my understanding of linear algebra, calculus and computer graphics in general only goes so far and I'm having trouble dissecting this piece.
Ideally I'd like to understand this and also understand how I can take this pre-existing code and also add in functionality that can display two other glyph types that show a vector and/or scalar field such as
three-dimensional cones
three-dimensional ellipsoids
If I'm missing anything here, please let me know!
Some of the variables included in the above snippet:
const int DIM = 50; //size of simulation grid
int color_dir = 0; //use direction color-coding or not
float scalar;
int newI, newJ;
float temp;
float vec_scale = 1000; //scaling of hedgehogs
int draw_vecs = 1; //draw the vector field or not
The code snippet you have there could have been written simpler (also it takes some educated guessing what some of the variables and functions mean).
Let's break it down.
The first two lines are easy to understand, they're the standard stanza to iterate over a 2D array
for (i = 0; i < samplesX; i++)
for (j = 0; j < samplesY; j++)
i and j are running indices, that will iterate over every discrete coordinate tuple in (i,j) ∈ [i, samplesX) × [j, samplesY). The next two lines remap the 2D indices into into a new value range, specifically [i, samplesX)×[j, samplesY) → [0, DIM)×[0, DIM). A missing piece of information is, what type is DIM of. It would make for it to be some floating point type.
newI = DIM * i / samplesX;
newJ = DIM * j / samplesY;
The next line is bug prone. It translates newI and newJ into a running 1D index for a 1D array, that's addressed by i and j.
Why is this problematic? Because in the conversion to DIM-space information may have been lost. This kind of information loss may lead to security bugs(!), as a matter of fact, Skia, the rendering library used by Google Chrome, Android and other projects had exactly this kind of bug recently; the writeup is a worthwhile read: https://googleprojectzero.blogspot.com/2019/02/the-curious-case-of-convexity-confusion.html
The correct way to implement this is to have DIM be an integer and perform fixed point arithmetic on it, eventually truncating the fractional digits. But I digress. The next block is essentially performing a poor man's lookup table lookup. vx``vy and fx``fy are some flattened 2D arrays, accessed through an 1D index, and direction_to_color maps either to a value presumably to a call of glColor; the same probably also goes for set_colormap. This is a bad use of OpenGL.
The whole remapping from i and j to DIM and then the lookups are just poor implementation of a texture lookup. OpenGL already has textures. Just load as texture coordinate array and enable texturing.
Finally for each spine, two calls of glVertex are made, one with the staring point, which lies on grid centers (wn, hn), to an offset location (wn, hn) + (i, j).
My verdict of that code: Utter garbage! All of this could have been done far more elegantly, even back in 1994 with OpenGL-1.0, which is code seems to have been written for. If you want to implement your own vector field plot, don't use this as a starting point.
These days we have programmable GPUs with shaders. All of that bulk up there can be done is a few lines of shader code.

Cuda global to shared memory and constant memory

I just started learning cuda and I'm having an issue converting some code to use shared memory and another to use constant memory, for comparison purposes.
__global__ void CUDA(int *device_array_Image1, int *device_array_Image2,int *device_array_Image3, int *device_array_kernel, int *device_array_Result1,int *device_array_Result2,int *device_array_Result3){
int i = blockIdx.x;
int j = threadIdx.x;
int ArraySum1 = 0 ; // set sum = 0 initially
int ArraySum2 = 0 ;
int ArraySum3 = 0 ;
for (int N = -1 ; N <= 1 ; N++)
{
for (int M = -1 ; M <= 1 ; M++)
{
ArraySum1 = ArraySum1 + (device_array_Image1[(i + N) * Image_Size + (j + M)]* device_array_kernel[(N + 1) * 3 + (M + 1)]);
ArraySum2 = ArraySum2 + (device_array_Image2[(i + N) * Image_Size + (j + M)]* device_array_kernel[(N + 1) * 3 + (M + 1)]);
ArraySum3 = ArraySum3 + (device_array_Image3[(i + N) * Image_Size + (j + M)]* device_array_kernel[(N + 1) * 3 + (M + 1)]);
}
}
device_array_Result1[i * Image_Size + j] = ArraySum1;
device_array_Result2[i * Image_Size + j] = ArraySum2;
device_array_Result3[i * Image_Size + j] = ArraySum3;
}
This is what I have done so far but I'm having an issue understanding the shared and constant memory so if anyone could help with the code or point me in the right direction I'd be really grateful.
Thanks for any help.
a) Shared memory: This memory will be visible only to all threads in a block. This shared memory is useful if you are accessing data more than once from that block.So in squaring of a number it will not be useful but while matrix multiplication it is useful.
b) Constant memory: Data is stored in device global memory and data can be read through multiprocessor constant cache. 64KB constant memory and 8KB cache is given to each multiprocessor.Data is broadcast to all threads in a warp.So if all the threads in the warp request the same value, that value is delivered to in a single cycle.
Below links helped me in understanding constant and shared memory
1) http://cuda-programming.blogspot.in/2013/01/what-is-constant-memory-in-cuda.html
2) http://cuda-programming.blogspot.in/2013/01/shared-memory-and-synchronization-in.html
3) https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
Please refer this links.

Is something like this possible in CUDA

Let's say, I have a matrix with values of 0 or 1. It is in CUDA possible, to do something like this:
__global__ void kernel(float *matrix, float *count)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int column = blockIdx.x * blockDim.x + threadIdx.x;
if (row >= MATRIXSIZE || column >= MATRIXSIZE)
{
return;
}
if (matrix[MATRIXSIZE * row + column] == 1)
{
count[0]++;
}
}
So I get in the end ne number of ones in the matrix. I know, this is very simple example, but if this might be possible, so also other variants ...
There are highly optimized libraries for CUDA that perform these types of operations, called reductions. Look into CUDA Thrust or CUB. In Thrust, you can use reduce to sum up all the values or count to count number of instances of a particular value.
If you really want to do this. You should use atomic add: atomicadd. atomicadd(count[0],1)
But this may cause performance issue.

OpenCV Python binds incredibly slow iterations through image data

I recently took some code that tracked an object based on color in OpenCV c++ and rewrote it in the python bindings.
The overall results and method were the same minus syntax obviously. But, when I perform the below code on each frame of a video it takes almost 2-3 seconds to complete where as the c++ variant, also below, is instant in comparison and I can iterate between frames as fast as my finger can press a key.
Any ideas or comments?
cv.PyrDown(img, dsimg)
for i in range( 0, dsimg.height ):
for j in range( 0, dsimg.width):
if dsimg[i,j][1] > ( _RED_DIFF + dsimg[i,j][2] ) and dsimg[i,j][1] > ( _BLU_DIFF + dsimg[i,j][0] ):
res[i,j] = 255
else:
res[i,j] = 0
for( int i =0; i < (height); i++ )
{
for( int j = 0; j < (width); j++ )
{
if( ( (data[i * step + j * channels + 1]) > (RED_DIFF + data[i * step + j * channels + 2]) ) &&
( (data[i * step + j * channels + 1]) > (BLU_DIFF + data[i * step + j * channels]) ) )
data_r[i *step_r + j * channels_r] = 255;
else
data_r[i * step_r + j * channels_r] = 0;
}
}
Thanks
Try using numpy to do your calculation, rather than nested loops. You should get C-like performance for simple calculations like this from numpy.
For example, your nested for loops can be replaced with a couple of numpy expressions...
I'm not terribly familiar with opencv, but I think the python bindings now have a numpy array interface, so your example above should be as simple as:
cv.PyrDown(img, dsimg)
data = np.asarray(dsimg)
blue, green, red = data.T
res = (green > (_RED_DIFF + red)) & (green > (_BLU_DIFF + blue))
res = res.astype(np.uint8) * 255
res = cv.fromarray(res)
(Completely untested, of course...) Again, I'm really not terribly familar with opencv, but nested python for loops are not the way to go about modifying an image element-wise, regardless...
Hope that helps a bit, anyway!