convolution implementation in c++ - c++

I want to implement 2D convolution function in C++ by myself, without using filter2D(). I'm trying to iterate all pixels of input image and kernel, then, assign new value to each pixel of dst.
However, I got this error.
Thread 1: EXC_BAD_ACCESS (code=1, address=0x0)
I found that this error tells I'm accessing nullptr, but I could not solve the problem. Here is my c++ code.
cv::Mat_<float> spatialConvolution(const cv::Mat_<float>& src, const cv::Mat_<float>& kernel)
{
// declare variables
Mat_<float> dst;
Mat_<float> flipped_kernel;
float tmp = 0.0;
// flip kernel
flip(kernel, flipped_kernel, -1);
// multiply and integrate
// input rows
for(int i=0;i<src.rows;i++){
// input columns
for(int j=0;j<src.cols;j++){
// kernel rows
for(int k=0;k<flipped_kernel.rows;k++){
// kernel columns
for(int l=0;l<flipped_kernel.cols;l++){
tmp += src.at<float>(i,j) * flipped_kernel.at<float>(k,l);
}
}
dst.at<float>(i,j) = tmp;
}
}
return dst.clone();
}

To simplify let's suppose you have kernel 3x3
k(0,0) k(0,1) k(0,2)
k(1,0) k(1,1) k(1,2)
k(2,0) k(2,1) k(2,2)
to calculate convolution you are scanning input image (marked as I) from left to fright, from top to bottom
and for every pixel of input image you assign one value calculated from the formula below:
newValue(y,x) = I(y-1,x-1) * k(0,0) + I(y-1,x) * k(0,1) + I(y-1,x+1) * k(0,2)
+ I(y,x-1) * k(1,0) + I(y,x) * k(1,1) + I(y,x+1) * k(1,2) +
+ I(y+1,x-1) * k(2,0) + I(y+1,x) * k(2,1) + I(y+1,x+1) * k(2,2)
------------------x------------>
|
|
| [k(0,0) k(0,1) k(0,2)]
y [k(1,0) k(1,1) k(1,2)]
| [k(2,0) k(2,1) k(2,2)]
|
(y,x) of input Image (I) is anchor point of kernel, to assign new value to I(y,x)
you need to multiply every k coefficient by corresponding point of I - your code doesn't do it.
First you need to create dst matrix with dimenstion as original image, and the same type of pixel.
Then you need to rewrite your loops to reflect formula described above:
cv::Mat_<float> spatialConvolution(const cv::Mat_<float>& src, const cv::Mat_<float>& kernel)
{
Mat dst(src.rows,src.cols,src.type());
Mat_<float> flipped_kernel;
flip(kernel, flipped_kernel, -1);
const int dx = kernel.cols / 2;
const int dy = kernel.rows / 2;
for (int i = 0; i<src.rows; i++)
{
for (int j = 0; j<src.cols; j++)
{
float tmp = 0.0f;
for (int k = 0; k<flipped_kernel.rows; k++)
{
for (int l = 0; l<flipped_kernel.cols; l++)
{
int x = j - dx + l;
int y = i - dy + k;
if (x >= 0 && x < src.cols && y >= 0 && y < src.rows)
tmp += src.at<float>(y, x) * flipped_kernel.at<float>(k, l);
}
}
dst.at<float>(i, j) = saturate_cast<float>(tmp);
}
}
return dst.clone();
}

Your memory access error is presumably happening due to the line:
dst.at<float>(i,j) = tmp;
because dst is not initialized. You can't assign something to that index of the matrix if it has no size/data. Instead, initialize the matrix first, as Mat_<float> is a declaration, not an initialization. Use one of the initializations where you can specify a cv::Size or the rows/columns from the different constructors for Mat (see the docs). For example, you can initialize dst with:
Mat dst{src.size(), src.type()};

Related

is it possible to find a small image in a big image faster than this way?

I need to get the coordinate of a image on the screen.
I use gdi to capture the screen to get big image data, and load small image data from file.
I compare two images with follow code,but it is too slow.
Is it possible to find a faster way? And the way is no loss of accuracy.
Beacause i tried gray transform and comapre the hash of every column and other ways, they are faster but not precise.
// input:big image,samll image,sim,dfcolor,rc
// output:a POINT,{-1,-1} means not found
PBYTE pSrc = _src.getBytes(); // _src is big image, pSrc is big image data pointer
PBYTE pPic = pic->getBytes(); // pic is small image, pPic issamll image data pointer
int max_error = (1. - sim) * pic->width() * pic->height();
int error_count = 0;
bool bad = false;
// rc is a rect,because use multithreading,every thread handle a block of big image
for (int i = rc.y1; i < rc.y2; ++i) {
for (int j = rc.x1; j < rc.x2; ++j) {
// stop is a std::atomic_bool variable,to notify other threads to stop if found
if (stop) {
return { -1, -1 };
}
// image data is stored as bgra,i just compare rgb
// dfcolor is color deviation
for (int y1 = 0; y1 < pic->height() && !bad; ++y1) {
for (int x1 = 0; x1 < pic->width(); ++x1) {
int index1 = ((i + y1) * _src.width() + j + x1) << 2;
int index2 = (y1 * pic->width() + x1) << 2;
if (abs(*(pSrc + index1) - *(pPic + index2)) >= dfcolor.b ||
abs(*(pSrc + index1 + 1) - *(pPic + index2 + 1)) >= dfcolor.g ||
abs(*(pSrc + index1 + 2) - *(pPic + index2 + 2)) >= dfcolor.r) {
++error_count;
if (error_count > max_error) {
bad = true;
break;
}
}
}
}
// not found,continue
if (bad) {
error_count = 0;
bad = false;
continue;
}
// found
stop = true;
return { i, j };
}
}
return { -1,-1 };
Not sure how smart your compiler is, but your index1 and index2 in the innermost loop (VERY nested) are advancing by 4 bytes in each iteration.
You could simply calculate pointers into your Src and Pic images and advance those instead of pretty complicated math.
The effects are at least two-folds: you may save some time on that math AND (more important) compiler may notice that it can vectorize that if statement.

Issue with reading pixel RGB values in OpenCL

I need to read pixels from two parts (with same width and height) of image ( e.g. squares ([0,0], [300, 300]) and ([400,0], [700,300])) and make difference for each pixel.
This is C (pseudo)code:
/**
* #param img Input image
* #param pos Integer position of top left corner of the second square (in this case 400)
*/
double getSum(Image& img, int pos)
{
const int width_of_cut = 300;
int right_bottom = pos + width;
Rgb first, second;
double ret_val = 0.0;
for(int i=0; i < width_of_cut; i++)
{
for(int j=0; j < width_of_cut; j++)
{
first = img.getPixel( i, j );
second = img.getPixel( i + pos, j );
ret_val += ( first.R - second.R ) +
( first.G - second.G ) +
( first.B - second.B );
}
}
return ret_val;
}
But my kernel (with same arguments and the __global float* output is set to 0.0 in host code) is giving me completely different values:
__constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_CLAMP_TO_EDGE |
CLK_FILTER_NEAREST;
__kernel void getSum( __read_only image2d_t input,
const int x_coord,
__global float* output )
{
int width = get_image_width( input );
int height = get_image_height( input );
int2 pixelcoord = (int2) (get_global_id(0), get_global_id(1)); // image coordinates
const int width_of_cut = 300;
const int right_bottom = x_coord + width_of_cut;
int a,b;
a = (int)(pixelcoord.x + x_coord);
b = pixelcoord.y;
if( a < right_bottom && b < width_of_cut )
{
float4 first = read_imagef(input, sampler, pixelcoord);
float4 second = read_imagef(input, sampler, (int2)(a,b));
output[get_global_id(0)] += ((first.x - second.x) +
(first.y - second.y) +
(first.z - second.z));
}
}
I am new to OpenCL and I have no idea what am I doing wrong.
Update (1d image):
I changed the kernel code. Now I'm reading an 1d image in one loop, but I'm still not getting the correct values. I'm not sure that I know, how to read pixels from 1d image correctly.
__kernel void getSum( __read_only image1d_t input,
const int x_coord,
__global float* output,
const int img_width )
{
const int width_of_cut = 300;
int i = (int)(get_global_id(0));
for(int j=0; j < width_of_cut; j++)
{
int f = ( img_width*i + j );
int s = f + x_coord;
float4 first = read_imagef( input, sampler, f ); //pixel from 1st sq.
float4 second = read_imagef( input, sampler, s ); //pixel from 2nd sq.
output[get_global_id(0)] += ((first.x - second.x) +
(first.y - second.y) +
(first.z - second.z));
}
}
Race condition.
All vertical work items are accessing the same output memory (output[get_global_id(0)] +=) and not atomically. Therefore the result are likely incorrect (e.g., two threads read the same value, add something to it, and write it back. Only one wins).
If your device supports it, you could make this an atomic operation, but it would be slow. You'd be better off running a 1D kernel that has a loop accumulating these vertically (so, the j loop from your C example).

HOG optimization with using SIMD

There are several attempts to optimize calculation of HOG descriptor with using of SIMD instructions: OpenCV, Dlib, and Simd. All of them use scalar code to add resulting magnitude to HOG histogram:
float histogram[height/8][width/8][18];
float ky[height], kx[width];
int idx[size];
float val[size];
for(size_t i = 0; i < size; ++i)
{
histogram[y/8][x/8][idx[i]] += val[i]*ky[y]*kx[x];
histogram[y/8][x/8 + 1][idx[i]] += val[i]*ky[y]*kx[x + 1];
histogram[y/8 + 1][x/8][idx[i]] += val[i]*ky[y + 1]*kx[x];
histogram[y/8 + 1][x/8 + 1][idx[i]] += val[i]*ky[y + 1]*kx[x + 1];
}
There the value of size depends from implementation but in general the meaning is the same.
I know that problem of histogram calculation with using of SIMD does not have a simple and effective solution. But in this case we have small size (18) of histogram. Can it help in SIMD optimizations?
I have found solution. It is a temporal buffer. At first we sum histogram to temporary buffer (and this operation can be vectorized). Then we add the sum from buffer to output histogram (and this operation also can be vectorized):
float histogram[height/8][width/8][18];
float ky[height], kx[width];
int idx[size];
float val[size];
float buf[18][4];
for(size_t i = 0; i < size; ++i)
{
buf[idx[i]][0] += val[i]*ky[y]*kx[x];
buf[idx[i]][1] += val[i]*ky[y]*kx[x + 1];
buf[idx[i]][2] += val[i]*ky[y + 1]*kx[x];
buf[idx[i]][3] += val[i]*ky[y + 1]*kx[x + 1];
}
for(size_t i = 0; i < 18; ++i)
{
histogram[y/8][x/8][i] += buf[i][0];
histogram[y/8][x/8 + 1][i] += buf[i][1];
histogram[y/8 + 1][x/8][i] += buf[i][2];
histogram[y/8 + 1][x/8 + 1][i] += buf[i][3];
}
You can do a partial optimisation by using SIMD to calculate all the (flattened) histogram indices and the bin increments. Then process these in a scalar loop afterwards. You probably also want to strip-mine this such that you process one row at a time, in order to keep the temporary bin indices and increments in cache. It might appear that this would be inefficient, due to the use of temporary intermediate buffers, but in practice I have seen a useful overall gain in similar scenarios.
uint32_t i = 0;
for (y = 0; y < height; ++y) // for each row
{
uint32_t inds[width * 4]; // flattened histogram indices for this row
float vals[width * 4]; // histogram bin increments for this row
// SIMD loop for this row - calculate flattened histogram indices and bin
// increments (scalar code shown for reference - converting this loop to
// SIMD is left as an exercise for the reader...)
for (x = 0; x < width; ++x, ++i)
{
indices[4*x] = (y/8)*(width/8)*18+(x/8)*18+idx[i];
indices[4*x+1] = (y/8)*(width/8)*18+(x/8 + 1)*18+idx[i];
indices[4*x+2] = (y/8+1)*(width/8)*18+(x/8)*18+idx[i];
indices[4*x+3] = (y/8+1)*(width/8)*18+(x/8 + 1)*18+idx[i];
vals[4*x] = val[i]*ky[y]*kx[x];
vals[4*x+1] = val[i]*ky[y]*kx[x+1];
vals[4*x+2] = val[i]*ky[y+1]*kx[x];
vals[4*x+3] = val[i]*ky[y+1]*kx[x+1];
}
// scalar loop for this row
float * const histogram_base = &histogram[0][0][0]; // pointer to flattened histogram
for (x = 0; x < width * 4; ++x) // for each set of 4 indices/increments in this row
{
histogram_base[indices[x]] += vals[x]; // update the (flattened) histogram
}
}

Discrete Fourier Transform implementation gives different result than OpenCV DFT

We have implemented DFT and wanted to test it with OpenCV's implementation. The results are different.
our DFT's results are in order from smallest to biggest, whereas OpenCV's results are not in any order.
the first (0th) value is the same for both calculations, as in this case, the complex part is 0 (since e^0 = 1, in the formula). The other values are different, for example OpenCV's results contain negative values, whereas ours does not.
This is our implementation of DFT:
// complex number
std::complex<float> j;
j = -1;
j = std::sqrt(j);
std::complex<float> result;
std::vector<std::complex<float>> fourier; // output
// this->N = length of contour, 512 in our case
// foreach fourier descriptor
for (int n = 0; n < this->N; ++n)
{
// Summation in formula
for (int t = 0; t < this->N; ++t)
{
result += (this->centroidDistance[t] * std::exp((-j*PI2 *((float)n)*((float)t)) / ((float)N)));
}
fourier.push_back((1.0f / this->N) * result);
}
and this is how we calculate the DFT with OpenCV:
std::vector<std::complex<float>> fourierCV; // output
cv::dft(std::vector<float>(centroidDistance, centroidDistance + this->N), fourierCV, cv::DFT_SCALE | cv::DFT_COMPLEX_OUTPUT);
The variable centroidDistance is calculated in a previous step.
Note: please avoid answers saying use OpenCV instead of your own implementation.
You forgot to initialise result for each iteration of n:
for (int n = 0; n < this->N; ++n)
{
result = 0.0f; // initialise `result` to 0 here <<<
// Summation in formula
for (int t = 0; t < this->N; ++t)
{
result += (this->centroidDistance[t] * std::exp((-j*PI2 *((float)n)*((float)t)) / ((float)N)));
}
fourier.push_back((1.0f / this->N) * result);
}

Matrix the Rectangle Part transpose Cuda

im writing Cuda Program to Transpose Square Matrix, the idea is to do it in two parts depending on size of matrix; the matrix size cut into even size with Tile , and remain rectangle part left i transpose it separately Ex: 67 x 67 Matrix with Tile : 32, first part is 64x64 transposed, then second part is 3x67.
my problem is in the rectangle part,
first below code shows the main code with the defined values:
const int TILE_DIM = 32;
const int BLOCK_ROWS = 8;
const int NUM_REPS = 100;
const int Nx = 2024; //size of the matrix
const int Ny = 2024;
int main(int argc, char **argv)
{
const int nx = Nx;
const int ny = Ny; // Size of the Arrays
const int mem_size = nx*ny*sizeof(int);// Size of the Orig.Arr
int *h_idata = (int*)malloc(mem_size); // original Host Arr.
int *d_idata; //device Arr.
checkCuda(cudaMalloc(&d_idata, mem_size));
dim3 dimGridX(nx / TILE_DIM, 1, 1); //grid dimension used
dim3 dimBlockX(TILE_DIM, 1, 1); // number of threads used
// the Kernel Function for only the rectangle
EdgeTransposeX << < dimGrid, dimBlock >> >(d_idata);
cudaEventRecord(startEvent, 0);
cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);
cudaEventElapsedTime(&ms, startEvent, stopEvent);
cudaMemcpy(h_idata, d_idata, mem_size, cudaMemcpyDeviceToHost);
the Kernel Code i was advised not to use shared, so below is how ive done :
__global__ void EdgeTransposeX(int *idata)
{
int tile_C[Edge][Nx];
int tile_V[Nx][Edge];
int x = blockIdx.x * TILE_DIM + threadIdx.x;
if (x == (nEven - 1))
{
for (int j = 0; j < Nx; j++)
for (int i = 1; i <= Edge; i++)
{
tile_V[j][i - 1] = idata[j*Nx + (x + i)];
tile_C[i - 1][j] = idata[(x + i)*Nx + j];}
__syncthreads();
for (int j = 0; j < Nx; j++)
for (int i = 1; i <= Edge; i++)
{
idata[j*Nx + (x + i)] = tile_C[i - 1][j];
idata[(x + i)*Nx + j] = tile_V[j][i - 1];}
} }
the code works Okay until matrix size reaches 1025, after that it stops working, any idea why ? am i missing something here ?
your two-dimentional arrays tile_C and tile_V are fisically stored in GPU's local memory. The amount of local memory per thread is 512KB. Verify that you are not using more than 512KB of local memory per thread.
An automatic variable declared in device code without any of the device,
shared and constant qualifiers described in this section generally resides in a register. However in some cases the compiler might choose to place it in local memory. This fragment was taken from "CUDA C PROGRAMMING GUIDE 2015" pag 89.
My suggestion is that you use the visual profiler to check the occupancy, register and local memory usage.
This link may be helpful for you: link.
I implemented the Transpose of a Square Matrix using cuda surfaces in 2D, it works fine for sizes from 2 to 16384 with increments in power of two. If you dont mind implement a no tiled version, i recomend this approach.