Optimizing the CUDA kernel with stencil pattern - c++

In the process of writing a program for processing digital images, I wrote a CUDA kernel that runs slowly. The code is given below:
__global__ void Kernel ( int* inputArray, float* outputArray, float3* const col_image, int height, int width, int kc2 ) {
float G, h;
float fx[3];
float fy[3];
float g[2][2];
float k10 = 0.0;
float k11 = 0.0;
float k12 = 0.0;
float k20 = 0.0;
float k21 = 0.0;
float k22 = 0.0;
float k30 = 0.0;
float k31 = 0.0;
float k32 = 0.0;
int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
if ((xIndex < width - kc2/2) && (xIndex >= kc2/2) && (yIndex < height - kc2/2) && (yIndex >= kc2/2))
{
int idx0 = yIndex * width + xIndex;
if (inputArray[idx0] > 0)
{
for (int i = 0; i < kc2; i++)
{
for (int j = 0; j < kc2; j++)
{
int idx1 = (yIndex + i - kc2/2) * width + (xIndex + j - kc2/2);
float3 rgb = col_image[idx1];
k10 = k10 + constMat1[i * kc2 + j] * rgb.x;
k11 = k11 + constMat1[i * kc2 + j] * rgb.y;
k12 = k12 + constMat1[i * kc2 + j] * rgb.z;
k20 = k20 + constMat2[i * kc2 + j] * rgb.x;
k21 = k21 + constMat2[i * kc2 + j] * rgb.y;
k22 = k22 + constMat2[i * kc2 + j] * rgb.z;
k30 = k30 + constMat3[i * kc2 + j] * rgb.x;
k31 = k31 + constMat3[i * kc2 + j] * rgb.y;
k32 = k32 + constMat3[i * kc2 + j] * rgb.z;
}
}
fx[0] = kc2 * (k30 - k20);
fx[1] = kc2 * (k31 - k21);
fx[2] = kc2 * (k32 - k22);
fy[0] = kc2 * (k10 - k20);
fy[1] = kc2 * (k11 - k21);
fy[2] = kc2 * (k12 - k22);
g[0][0] = fx[0] * fx[0] + fx[1] * fx[1] + fx[2] * fx[2];
g[0][1] = fx[0] * fy[0] + fx[1] * fy[1] + fx[2] * fy[2];
g[1][0] = g[0][1];
g[1][1] = fy[0] * fy[0] + fy[1] * fy[1] + fy[2] * fy[2]
G = g[0][0] * g[1][1] - g[0][1] * g[1][0];
h = g[0][0] + g[1][1];
// Output
int idx2 = (yIndex - kc2/2) * (width - kc2) + (xIndex - kc2/2);
outputArray[idx2] = (h * h) / G;
}
}
}
Here some (non-negative) values of inputArray are processed. The array col-image contains color components in the RGB model. If the value of inputArray satisfies the condition, then we compute the special coefficients k_{ij} in a neighborhood of kc2 on kc2 with center at the considered point (the value of kc2 is either 3 or 5). The values of constMat[1,2,3] are stored in the device's constant memory:
__device__ __constant__ float constMat[];
Then we calculate the characteristics fx, fy, g_{ij}, h, G and write the resulting value in the corresponding cell of outputArray.
Importantly, all the data specified is stored in global memory, and the fact that the input array can be large enough (about 40 million points). All this directly affects the speed of the kernel.
How do we speed up the execution of this kernel (any techniques are welcome: use of shared memory / textures, use of stencil templates, etc.)?

What I would call a "standard" usage of shared memory to buffer a block of col_image for use (and reuse) by the threadblock would be a "standard" suggestion here.
According to my tests, it seems to offer a substantial improvement. Since you have not provided a complete code, or any sort of data set or results verification, I will skip all those also. What follows then is a not-really-tested implementation of shared memory into your existing code, to "buffer" a (threadblockwidth + kc2)*(threadblockheight+kc2) "patch" of the col_image input data into a shared memory buffer. Thereafter, during the double-nested for-loops, the data is read out of the shared memory buffer.
A 2D shared memory stencil operation like this is an exercise in indexing as well as an exercise in handling edge cases. Your code is somewhat simpler in that we need only consider the edges to the "right" and "downward" when considering the needed "halo" of data to be buffered into shared memory.
I have not attempted to verify that this code is perfect. However it should give you a "roadmap" for how to implement a 2D shared memory buffer system, with some motivation for the effort: I witness about a ~5x speedup by doing so, although YMMV, and its entirely possible I've made a performance mistake.
Here's a worked example, showing the speedup on Pascal Titan X, CUDA 8.0.61, Linux:
$ cat t390.cu
#include <stdio.h>
#include <iostream>
const int adim = 6000;
const int KC2 = 5;
const int thx = 32;
const int thy = 32;
__constant__ float constMat1[KC2*KC2];
__constant__ float constMat2[KC2*KC2];
__constant__ float constMat3[KC2*KC2];
__global__ void Kernel ( int* inputArray, float* outputArray, float3* const col_image, int height, int width, int kc2 ) {
float G, h;
float fx[3];
float fy[3];
float g[2][2];
float k10 = 0.0;
float k11 = 0.0;
float k12 = 0.0;
float k20 = 0.0;
float k21 = 0.0;
float k22 = 0.0;
float k30 = 0.0;
float k31 = 0.0;
float k32 = 0.0;
int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
int idx0 = yIndex * width + xIndex;
#ifdef USE_SHARED
__shared__ float3 s_col_image[thy+KC2][thx+KC2];
int idx = xIndex;
int idy = yIndex;
int DATAHSIZE= height;
int WSIZE = kc2;
int DATAWSIZE = width;
float3 *input = col_image;
int BLKWSIZE = thx;
int BLKHSIZE = thy;
if ((idx < DATAHSIZE+WSIZE) && (idy < DATAWSIZE+WSIZE))
s_col_image[threadIdx.y][threadIdx.x]=input[idx0];
if ((idx < DATAHSIZE+WSIZE) && (idy < DATAWSIZE) && (threadIdx.y > BLKWSIZE - WSIZE))
s_col_image[threadIdx.y + (WSIZE-1)][threadIdx.x] = input[idx0+(WSIZE-1)*width];
if ((idx < DATAHSIZE) && (idy < DATAWSIZE+WSIZE) && (threadIdx.x > BLKHSIZE - WSIZE))
s_col_image[threadIdx.y][threadIdx.x + (WSIZE-1)] = input[idx0+(WSIZE-1)];
if ((idx < DATAHSIZE) && (idy < DATAWSIZE) && (threadIdx.x > BLKHSIZE - WSIZE) && (threadIdx.y > BLKWSIZE - WSIZE))
s_col_image[threadIdx.y + (WSIZE-1)][threadIdx.x + (WSIZE-1)] = input[idx0+(WSIZE-1)*width + (WSIZE-1)];
__syncthreads();
#endif
if ((xIndex < width - kc2/2) && (xIndex >= kc2/2) && (yIndex < height - kc2/2) && (yIndex >= kc2/2))
{
if (inputArray[idx0] > 0)
{
for (int i = 0; i < kc2; i++)
{
for (int j = 0; j < kc2; j++)
{
#ifdef USE_SHARED
float3 rgb = s_col_image[threadIdx.y][threadIdx.x];
#else
int idx1 = (yIndex + i - kc2/2) * width + (xIndex + j - kc2/2);
float3 rgb = col_image[idx1];
#endif
k10 = k10 + constMat1[i * kc2 + j] * rgb.x;
k11 = k11 + constMat1[i * kc2 + j] * rgb.y;
k12 = k12 + constMat1[i * kc2 + j] * rgb.z;
k20 = k20 + constMat2[i * kc2 + j] * rgb.x;
k21 = k21 + constMat2[i * kc2 + j] * rgb.y;
k22 = k22 + constMat2[i * kc2 + j] * rgb.z;
k30 = k30 + constMat3[i * kc2 + j] * rgb.x;
k31 = k31 + constMat3[i * kc2 + j] * rgb.y;
k32 = k32 + constMat3[i * kc2 + j] * rgb.z;
}
}
fx[0] = kc2 * (k30 - k20);
fx[1] = kc2 * (k31 - k21);
fx[2] = kc2 * (k32 - k22);
fy[0] = kc2 * (k10 - k20);
fy[1] = kc2 * (k11 - k21);
fy[2] = kc2 * (k12 - k22);
g[0][0] = fx[0] * fx[0] + fx[1] * fx[1] + fx[2] * fx[2];
g[0][1] = fx[0] * fy[0] + fx[1] * fy[1] + fx[2] * fy[2];
g[1][0] = g[0][1];
g[1][1] = fy[0] * fy[0] + fy[1] * fy[1] + fy[2] * fy[2]; // had a missing semicolon
G = g[0][0] * g[1][1] - g[0][1] * g[1][0];
h = g[0][0] + g[1][1];
// Output
int idx2 = (yIndex - kc2/2) * (width - kc2) + (xIndex - kc2/2); // possible indexing bug here
outputArray[idx2] = (h * h) / G;
}
}
}
int main(){
int *d_inputArray;
int height = adim;
int width = adim;
float *d_outputArray;
float3 *d_col_image;
int kc2 = KC2;
cudaMalloc(&d_inputArray, height*width*sizeof(int));
cudaMemset(d_inputArray, 1, height*width*sizeof(int));
cudaMalloc(&d_col_image, (height+kc2)*(width+kc2)*sizeof(float3));
cudaMalloc(&d_outputArray, height*width*sizeof(float));
dim3 threads(thx,thy);
dim3 blocks((adim+threads.x-1)/threads.x, (adim+threads.y-1)/threads.y);
Kernel<<<blocks,threads>>>( d_inputArray, d_outputArray, d_col_image, height, width, kc2 );
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_61 -o t390 t390.cu
$ cuda-memcheck ./t390
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$ nvprof ./t390
==1473== NVPROF is profiling process 1473, command: ./t390
==1473== Profiling application: ./t390
==1473== Profiling result:
Time(%) Time Calls Avg Min Max Name
97.29% 34.705ms 1 34.705ms 34.705ms 34.705ms Kernel(int*, float*, float3*, int, int, int)
2.71% 965.14us 1 965.14us 965.14us 965.14us [CUDA memset]
==1473== API calls:
Time(%) Time Calls Avg Min Max Name
88.29% 310.69ms 3 103.56ms 550.23us 309.46ms cudaMalloc
9.86% 34.712ms 1 34.712ms 34.712ms 34.712ms cudaDeviceSynchronize
1.05% 3.6801ms 364 10.110us 247ns 453.59us cuDeviceGetAttribute
0.70% 2.4483ms 4 612.07us 547.62us 682.25us cuDeviceTotalMem
0.08% 284.32us 4 71.079us 63.098us 79.616us cuDeviceGetName
0.01% 29.533us 1 29.533us 29.533us 29.533us cudaMemset
0.01% 21.189us 1 21.189us 21.189us 21.189us cudaLaunch
0.00% 5.2730us 12 439ns 253ns 1.1660us cuDeviceGet
0.00% 3.4710us 6 578ns 147ns 2.4820us cudaSetupArgument
0.00% 3.1090us 3 1.0360us 340ns 2.1660us cuDeviceGetCount
0.00% 1.0370us 1 1.0370us 1.0370us 1.0370us cudaConfigureCall
ubuntu#titanxp-DiGiTS-Dev-Box:~/bobc/misc$ nvcc -arch=sm_61 -o t390 t390.cu -DUSE_SHARED
ubuntu#titanxp-DiGiTS-Dev-Box:~/bobc/misc$ cuda-memcheck ./t390
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$ nvprof ./t390
==1545== NVPROF is profiling process 1545, command: ./t390
==1545== Profiling application: ./t390
==1545== Profiling result:
Time(%) Time Calls Avg Min Max Name
86.17% 5.4181ms 1 5.4181ms 5.4181ms 5.4181ms Kernel(int*, float*, float3*, int, int, int)
13.83% 869.94us 1 869.94us 869.94us 869.94us [CUDA memset]
==1545== API calls:
Time(%) Time Calls Avg Min Max Name
96.13% 297.15ms 3 99.050ms 555.80us 295.90ms cudaMalloc
1.76% 5.4281ms 1 5.4281ms 5.4281ms 5.4281ms cudaDeviceSynchronize
1.15% 3.5664ms 364 9.7970us 247ns 435.92us cuDeviceGetAttribute
0.86% 2.6475ms 4 661.88us 642.85us 682.42us cuDeviceTotalMem
0.09% 266.42us 4 66.603us 62.005us 77.380us cuDeviceGetName
0.01% 29.624us 1 29.624us 29.624us 29.624us cudaMemset
0.01% 19.147us 1 19.147us 19.147us 19.147us cudaLaunch
0.00% 4.8560us 12 404ns 248ns 988ns cuDeviceGet
0.00% 3.3390us 6 556ns 134ns 2.3510us cudaSetupArgument
0.00% 3.1190us 3 1.0390us 331ns 2.0780us cuDeviceGetCount
0.00% 1.1940us 1 1.1940us 1.1940us 1.1940us cudaConfigureCall
$
We see that the kernel execution time is ~35ms in the non-shared case, and ~5.5ms in the shared case. For this case I set kc2=5. For the kc2=3 case, performance gains will be less.
A few notes:
Your posted code was missing a semicolon on one line. I've added that and marked the line in my code.
I suspect you may have an indexing error on the "output" write to outputArray. Your indexing is like this:
int idx2 = (yIndex - kc2/2) * (width - kc2) + (xIndex - kc2/2);
whereas I would have expected this:
int idx2 = (yIndex - kc2/2) * width + (xIndex - kc2/2);
however I haven't thought carefully about it, so I may be wrong here.
In the future, if you want help with a problem like this, I'd advise you to at least provide the level of complete code scaffolding and description that I have. Provide a complete code that somebody else could immediately pick up and test, without having to write their own code. Also define what platform you are on and what your performance measurement is.

Related

How do I get correct answers using my code with the barycentric formula?

My function getHeightOfTerrain() is calling a barycentric formula function that is not returning the correct height for the one set test height in : heightMapFromArray[][].
I've tried watching OpenGL JAVA Game tutorials 14,21, 22, by "thin matrix" and I am confused on how to use my array: heightMapforBaryCentric in both of the supplied functions, and how to set the arguments that are passed to the baryCentic() function in some sort of manner so that I can solve the problem.
int creaateTerrain(int height, int width)
{
float holderY[6] = { 0.f ,0.f,0.f,0.f,0.f,0.f };
float scaleit = 1.5f;
float holder[6] = { 0.f,0.f,0.f,0.f,0.f,0.f };
for (int z = 0, z2 =0; z < iterationofHeightMap;z2++)
{
//each loop is two iterations and creates one quad (two triangles)
//however because each iteration is by two (i.e. : x=x+2) om bottom
//the amount of triangles is half the x value
//
//number of vertices : 80 x 80 x 6.
//column
for (int x = 0, x2 = 0; x < iterationofHeightMap;x2++)
{
//relevant - A : first triangle - on left triangle
//[row] [colum[]
holder[0] = heightMapFromArray[z][x];
//holder[0] = (float)imageData[(z / 2 * MAP_Z + (x / 2)) * 3];
//holder[0] = holder[0] / 255;// *scaleit;
vertices.push_back(glm::vec3(x, holder[0], z));
//match height map with online barycentric use
heightMapforBaryCentric[x2][z2] = holder[0];
holder[1] = heightMapFromArray[z+2][x];
//holder[1] = (float)imageData[(((z + 2) / 2 * MAP_Z + ((x) / 2))) * 3];
//holder[1] = holder[1] / 255;// 6 * scaleit;
vertices.push_back(glm::vec3(x, holder[1], z + 2));
//match height map with online barycentric use
heightMapforBaryCentric[x2][z2+1] = holder[1];
holder[2] = heightMapFromArray[z+2][x+2];
//holder[2] = (float)imageData[(((z + 2) / 2 * MAP_Z + ((x + 2) / 2))) * 3];
//holder[2] = holder[2] / 255;// *scaleit;
vertices.push_back(glm::vec3(x + 2, holder[2], z + 2));
////match height map with online barycentric use
heightMapforBaryCentric[x2+1][z2+1] = holder[2];
//relevant - B - second triangle (on right side)
holder[3] = heightMapFromArray[z][x];
//holder[3] = (float)imageData[((z / 2)*MAP_Z + (x / 2)) * 3];
//holder[3] = holder[3] / 255;// 256 * scaleit;
vertices.push_back(glm::vec3(x, holder[3], z));
holder[4] = heightMapFromArray[x+2][z+2];
//holder[4] = (float)imageData[(((z + 2) / 2 * MAP_Z + ((x + 2) / 2))) * 3];
//holder[4] = holder[4] / 255;// *scaleit;
vertices.push_back(glm::vec3(x + 2, holder[4], z + 2));
holder[5] = heightMapFromArray[x+2][z];
//holder[5] = (float)imageData[((z / 2)*MAP_Z + ((x + 2) / 2)) * 3];
//holder[5] = holder[5] / 255;// *scaleit;
vertices.push_back(glm::vec3(x + 2, holder[5], z));
x = x + 2;
}
z = z + 2;
}
return(1);
}
float getHeightOfTerrain(float worldX, float worldZ) {
float terrainX = worldX;
float terrainZ = worldZ;
int gridSquareSize = 2.0f;
gridX = (int)floor(terrainX / gridSquareSize);
gridZ = (int)floor(terrainZ / gridSquareSize);
xCoord = ((float)(fmod(terrainX, gridSquareSize)) / (float)gridSquareSize);
zCoord = ((float)(fmod(terrainZ, gridSquareSize)) / (float)gridSquareSize);
if (xCoord <= (1 - zCoord))
{
answer = baryCentric(
//left triangle
glm::vec3(0.0f, heightMapforBaryCentric[gridX][gridZ], 0.0f),
glm::vec3(0.0f, heightMapforBaryCentric[gridX][gridZ+1], 1.0f),
glm::vec3(1.0f, heightMapforBaryCentric[gridX+1][gridZ+1], 1.0f),
glm::vec2(xCoord, zCoord));
// if (answer != 1)
// {
// fprintf(stderr, "Z:gridx: %d gridz: %d answer: %f\n", gridX, gridZ,answer);
//
// }
}
else
{
//right triangle
answer = baryCentric(glm::vec3(0, heightMapforBaryCentric[gridX][gridZ], 0),
glm::vec3(1,heightMapforBaryCentric[gridX+1][gridZ+1], 1),
glm::vec3(1,heightMapforBaryCentric[gridX+1][gridZ], 0),
glm::vec2(xCoord, zCoord));
}
if (answer == 1)
{
answer = 0;
}
//answer = abs(answer - 1);
return(answer);
}
float baryCentric(glm::vec3 p1, glm::vec3 p2, glm::vec3 p3 , glm::vec2 pos) {
float det = (p2.z - p3.z) * (p1.x - p3.x) + (p3.x - p2.x) * (p1.z - p3.z);
float l1 = ((p2.z - p3.z) * (pos.x - p3.x) + (p3.x - p2.x) * (pos.y - p3.z)) / det;
float l2 = ((p3.z - p1.z) * (pos.x - p3.x) + (p1.x - p3.x) * (pos.y - p3.z)) / det;
float l3 = 1.0f - l1 - l2;
return (l1 * p1.y + l2 * p2.y + l3 * p3.y);
}
My expected results were that the center of the test grid's height to be the set value .5 and gradually less as the heights declined. My results were the heights being all the same, varied, or increasing. Usually these heights were under the value of one.

What am I doing wrong when executing the sobel filter function in c++

Here is my sobel filter function performed on a grayscale image. Apparently I'm not doing my calculations correct because I keep getting an all black image. I have already turned in the project but it is bothering me that the results aren't right.
int sobelH[3][3] = { -1, 0, 1,
-2, 0, 2,
-1, 0, 1 },
sobelV[3][3] = { 1, 2, 1,
0, 0, 0,
-1, -2, -1 };
//variable declaration
int mag;
int pix_x, pix_y = 0;
int img_x, img_y;
for (img_x = 0; img_x < img->x; img_x++)
{
for (img_y = 0; img_y < img->y; img_y++)
{
pix_x = 0;
pix_y = 0;
//calculating the X and Y convolutions
for (int i = -1; i <= 1; i++)
{
for (int j = -1; j <= 1; j++)
{
pix_x += (img->data[img_y * img->x + img_x].red + img->data[img_y * img->x + img_x].green + img->data[img_y * img->x + img_x].blue) * sobelH[1 + i][1 + j];
pix_y += (img->data[img_y * img->x + img_x].red + img->data[img_y * img->x + img_x].green + img->data[img_y * img->x + img_x].blue) * sobelV[1 + i][1 + j];
}
}
//Gradient magnitude
mag = sqrt((pix_x * pix_x) + (pix_y * pix_y));
if (mag > RGB_COMPONENT_COLOR)
mag = 255;
if (mag < 0)
mag = 0;
//Setting the new pixel value
img->data[img_y * img->x + img_x].red = mag;
img->data[img_y * img->x + img_x].green = mag;
img->data[img_y * img->x + img_x].blue = mag;
}
}
Although your code could use some improvement, the main reason is that you compute the convolution at constant img_y and img_x. What you need to do is:
pix_x += (img->data[img_y * img->x + img_x + i].red + img->data[img_y * img->x + img_x + i].green + img->data[img_y * img->x + img_x + i].blue) * sobelH[1 + i][1 + j];
Indeed, the Sobel convolution is symmetric, so if you compute the convolution with a constant image, it will result in only black.
Note that in the above example I do not take into account the border of the image. You should make sure to not access pixels that are outside your pixel array.
Another mistake is that you're writing in the input image. You write at location (x,y), then compute the filter result for location (x+1,y) using the modified value at (x,y), which is the wrong value to use.
You need to write your result to a new image.

OpenCV: Custom pixelwise alpha compositing: is this correct?

OpenCV as I know, does not offer pixelwise add() but only addWeighted() that applies one scalar to all pixels. Using the C-style array access that is the fastest among all means of pixel access, my custom alpha compositing function is still slow as hell - it took nearly 2 seconds of operation for a 1400x900 image. I don't think building in release mode helps optimization... Is there a way to increase the speed?
I'm writing alphaCompositeLayers() - an alpha compositing function that multiplies each pixel of the background cv::Mat by the alpha value of the corresponding pixel of the foreground cv::Mat. Both cv::Mats areCV_8UC4` based (unsigned char, 4 channels):
// mat1 in foreground, mat0 in background
cv::Mat alphaCompositeLayers(cv::Mat mat0, cv::Mat mat1) {
cv::Mat res = mat0.clone();
int nRows = res.rows;
int nCols = res.cols * res.channels();
if (res.isContinuous()) {
nCols *= nRows;
nRows = 1;
}
for (int u = 0; u < nRows; u++) {
unsigned char *resrgb = res.ptr<unsigned char>(u);
unsigned char *matrgb = mat1.ptr<unsigned char>(u);
for (int v = 0; v < nCols; v += 4) {
unsigned char newalpha = cv::saturate_cast<unsigned char>(resrgb[v + 3] * (255.0f - matrgb[v + 3]) + matrgb[v + 3]);
resrgb[v] = cv::saturate_cast<unsigned char>((resrgb[v] * resrgb[v + 3] / 255.0f * (255 - matrgb[v + 3]) / 255.0f + matrgb[v] * matrgb[v + 3] / 255.0f)); // / newalpha);
resrgb[v + 1] = cv::saturate_cast<unsigned char>((resrgb[v + 1] * resrgb[v + 3] / 255.0f * (255 - matrgb[v + 3]) / 255.0f + matrgb[v + 1] * matrgb[v + 3] / 255.0f)); // / newalpha);
resrgb[v + 2] = cv::saturate_cast<unsigned char>((resrgb[v + 2] * resrgb[v + 3] / 255.0f * (255 - matrgb[v + 3]) / 255.0f + matrgb[v + 2] * matrgb[v + 3] / 255.0f)); // / newalpha);
resrgb[v + 3] = newalpha;
resrgb[v + 3] = cv::saturate_cast<unsigned char>(rand() % 256);
}
}
return res;
}
Here's another function multiplyLayerByAlpha() that multiplies each pixel by its alpha value (0% opacity = black, 100% opacity = pixel color):
cv::Mat multiplyLayerByAlpha(cv::Mat mat) {
cv::Mat res = mat.clone();
int nRows = res.rows;
int nCols = res.cols * res.channels();
if (res.isContinuous()) {
nCols *= nRows;
nRows = 1;
}
for (int u = 0; u < nRows; u++) {
unsigned char *resrgb = res.ptr<unsigned char>(u);
for (int v = 0; v < nCols; v += 4) {
resrgb[v] = cv::saturate_cast<unsigned char>(resrgb[v] * resrgb[v + 3] / 255.0f);
resrgb[v + 1] = cv::saturate_cast<unsigned char>(resrgb[v + 1] * resrgb[v + 3] / 255.0f);
resrgb[v + 2] = cv::saturate_cast<unsigned char>(resrgb[v + 2] * resrgb[v + 3] / 255.0f);
}
}
return res;
}
An array of cv::Mats, for example {mat0, mat1, mat2} with mat2 on foremost (on top of all 3), I basically run this:
cv::Mat resultingCvMat = multiplyLayerByAlpha(
alphaCompositeLayers(
mat0,
alphaCompositeLayers(mat1, mat2)
)
);
How can I make the program compute the resultingCvMat faster? With C++ ways like multi-threading (then how)? Or with OpenCV functions and ways (again, then how)?

how can i change the b-spline curves from 4 point to 6?

I have a code on C++ it's b-spline curve that has 4 points if I want to change it to 6 point what shall I change in the code?
You can check the code:
#include "graphics.h"
#include <math.h>
int main(void) {
int gd, gm, page = 0;
gd = VGA;
gm = VGAMED;
initgraph(&gd, &gm, "");
point2d pontok[4] = { 100, 100, 150, 200, 170, 130, 240, 270 }; //pontok means points
int ap;
for (;;) {
setactivepage(page);
cleardevice();
for (int i = 0; i < 4; i++)
circle(integer(pontok[i].x), integer(pontok[i].y), 3);
double t = 0;
moveto((1.0 / 6) * (pontok[0].x * pow(1 - t, 3) +
pontok[1].x * (3 * t * t * t - 6 * t * t + 4) +
pontok[2].x * (-3 * t * t * t + 3 * t * t + 3 * t + 1) +
pontok[3].x * t * t * t),
(1.0 / 6) * (pontok[0].y * pow(1 - t, 3) +
pontok[1].y * (3 * t * t * t - 6 * t * t + 4) +
pontok[2].y * (-3 * t * t * t + 3 * t * t + 3 * t + 1) +
pontok[3].y * t * t * t));
for (t = 0; t <= 1; t += 0.01)
lineto(
(1.0 / 6) * (pontok[0].x * pow(1 - t, 3) +
pontok[1].x * (3 * t * t * t - 6 * t * t + 4) +
pontok[2].x * (-3 * t * t * t + 3 * t * t + 3 * t + 1) +
pontok[3].x * t * t * t),
(1.0 / 6) * (pontok[0].y * pow(1 - t, 3) +
pontok[1].y * (3 * t * t * t - 6 * t * t + 4) +
pontok[2].y * (-3 * t * t * t + 3 * t * t + 3 * t + 1) +
pontok[3].y * t * t * t));
/* Egerkezeles */ //Egerkezeles means mouse event handling
if (!balgomb)
ap = getactivepoint((point2d *)pontok, 4, 5);
if (ap >= 0 && balgomb) { //balgomb means left mouse button
pontok[ap].x = egerx; //eger means mouse
pontok[ap].y = egery;
}
/* Egerkezeles vege */
setvisualpage(page);
page = 1 - page;
if (kbhit())
break;
}
getch();
closegraph();
return 0;
}
From your formula, it looks like you are trying to draw a cubic Bezier curve. But the formula does not seem entirely correct. You can google "cubic Bezier curve" to find the correct formula. The Wikipedia page contains the formula for any degree of Bezier curve. You can find the "6-points" formula from there by using degree = 5.

2d rotation opengl

Here is the code I am using.
#define ANGLETORADIANS 0.017453292519943295769236907684886f // PI / 180
#define RADIANSTOANGLE 57.295779513082320876798154814105f // 180 / PI
rotation = rotation *ANGLETORADIANS;
cosRotation = cos(rotation);
sinRotation = sin(rotation);
for(int i = 0; i < 3; i++)
{
px[i] = (vec[i].x + centerX) * (cosRotation - (vec[i].y + centerY)) * sinRotation;
py[i] = (vec[i].x + centerX) * (sinRotation + (vec[i].y + centerY)) * cosRotation;
printf("num: %i, px: %f, py: %f\n", i, px[i], py[i]);
}
so far it seams my Y value is being fliped.. say I enter the value of X = 1 and Y = 1 with a 45 rotation you should see about x = 0 and y = 1.25 ish but I get x = 0 y = -1.25.
Also my 90 degree rotation always return x = 0 and y = 0.
p.s I know I'm only centering my values and not putting them back where they came from. It's not needed to put them back as all I need to know is the value I'm getting now.
Your bracket placement doesn't look right to me. I would expect:
px[i] = (vec[i].x + centerX) * cosRotation - (vec[i].y + centerY) * sinRotation;
py[i] = (vec[i].x + centerX) * sinRotation + (vec[i].y + centerY) * cosRotation;
Your brackets are wrong. It should be
px[i] = ((vec[i].x + centerX) * cosRotation) - ((vec[i].y + centerY) * sinRotation);
py[i] = ((vec[i].x + centerX) * sinRotation) + ((vec[i].y + centerY) * cosRotation);
instead