Manual Implementation of Sobel Operator in OpenCV

Manual Implementation of Sobel Operator in OpenCV - c++

I am trying to apply a sobel operator by iterating through an image and applying a mask to surrounding pixels.
For now, I am trying to apply the vertical portion of the mask, which is:
-1 0 1
-2 0 2
-1 0 1
In my implementaiton, I am iterating through the rows and columns as follows:
for (int i = 1; i < image.rows-1; i++){
for (int j = 1; j < image.cols-1; j++){
int pixel1 = image.at<Vec3b>(i-1,j-1)[0] * -1;
int pixel2 = image.at<Vec3b>(i,j-1)[0] * 0;
int pixel3 = image.at<Vec3b>(i+1,j-1)[0] * 1;
int pixel4 = image.at<Vec3b>(i-1,j)[0] * -2;
int pixel5 = image.at<Vec3b>(i,j)[0] * 0;
int pixel6 = image.at<Vec3b>(i+1,j)[0] * 2;
int pixel7 = image.at<Vec3b>(i-1,j+1)[0] * -1;
int pixel8 = image.at<Vec3b>(i,j+1)[0] * 0;
int pixel9 = image.at<Vec3b>(i+1,j+1)[0] * 1;
int sum = pixel1 + pixel2 + pixel3 + pixel4 + pixel5 + pixel6 + pixel7 + pixel8 + pixel9;
verticalSobel.at<Vec3b>(i,j)[0] = sum;
verticalSobel.at<Vec3b>(i,j)[1] = sum;
verticalSobel.at<Vec3b>(i,j)[2] = sum;
}
}
Where the pixels are labeled as:
1 2 3
4 5 6
7 8 9
However, the resulting image is far off of what it should look like.
For reference, the resulting image is
Where it should look similar to:
The guide I am using is: https://www.tutorialspoint.com/dip/sobel_operator.htm
I am not sure if I am simply implementing the operator incorrectly, or just iterating through the image incorrectly.
Any help would be greatly appreciated. Thanks!

You seem to have problems where the sum is negative. Take the absolute value of sum, and clamp it to 255 (or instead of absolute value, clamp it to 0 - depending of what you want to achieve. A "full" sobel operator usually uses 2d distance formula, so a horizonal/vertical only variant should use the absolute value)

Related

convolution implementation in c++

I want to implement 2D convolution function in C++ by myself, without using filter2D(). I'm trying to iterate all pixels of input image and kernel, then, assign new value to each pixel of dst.
However, I got this error.
Thread 1: EXC_BAD_ACCESS (code=1, address=0x0)
I found that this error tells I'm accessing nullptr, but I could not solve the problem. Here is my c++ code.
cv::Mat_<float> spatialConvolution(const cv::Mat_<float>& src, const cv::Mat_<float>& kernel)
{
// declare variables
Mat_<float> dst;
Mat_<float> flipped_kernel;
float tmp = 0.0;
// flip kernel
flip(kernel, flipped_kernel, -1);
// multiply and integrate
// input rows
for(int i=0;i<src.rows;i++){
// input columns
for(int j=0;j<src.cols;j++){
// kernel rows
for(int k=0;k<flipped_kernel.rows;k++){
// kernel columns
for(int l=0;l<flipped_kernel.cols;l++){
tmp += src.at<float>(i,j) * flipped_kernel.at<float>(k,l);
}
}
dst.at<float>(i,j) = tmp;
}
}
return dst.clone();
}

To simplify let's suppose you have kernel 3x3
k(0,0) k(0,1) k(0,2)
k(1,0) k(1,1) k(1,2)
k(2,0) k(2,1) k(2,2)
to calculate convolution you are scanning input image (marked as I) from left to fright, from top to bottom
and for every pixel of input image you assign one value calculated from the formula below:
newValue(y,x) = I(y-1,x-1) * k(0,0) + I(y-1,x) * k(0,1) + I(y-1,x+1) * k(0,2)
+ I(y,x-1) * k(1,0) + I(y,x) * k(1,1) + I(y,x+1) * k(1,2) +
+ I(y+1,x-1) * k(2,0) + I(y+1,x) * k(2,1) + I(y+1,x+1) * k(2,2)
------------------x------------>
|
|
| [k(0,0) k(0,1) k(0,2)]
y [k(1,0) k(1,1) k(1,2)]
| [k(2,0) k(2,1) k(2,2)]
|
(y,x) of input Image (I) is anchor point of kernel, to assign new value to I(y,x)
you need to multiply every k coefficient by corresponding point of I - your code doesn't do it.
First you need to create dst matrix with dimenstion as original image, and the same type of pixel.
Then you need to rewrite your loops to reflect formula described above:
cv::Mat_<float> spatialConvolution(const cv::Mat_<float>& src, const cv::Mat_<float>& kernel)
{
Mat dst(src.rows,src.cols,src.type());
Mat_<float> flipped_kernel;
flip(kernel, flipped_kernel, -1);
const int dx = kernel.cols / 2;
const int dy = kernel.rows / 2;
for (int i = 0; i<src.rows; i++)
{
for (int j = 0; j<src.cols; j++)
{
float tmp = 0.0f;
for (int k = 0; k<flipped_kernel.rows; k++)
{
for (int l = 0; l<flipped_kernel.cols; l++)
{
int x = j - dx + l;
int y = i - dy + k;
if (x >= 0 && x < src.cols && y >= 0 && y < src.rows)
tmp += src.at<float>(y, x) * flipped_kernel.at<float>(k, l);
}
}
dst.at<float>(i, j) = saturate_cast<float>(tmp);
}
}
return dst.clone();
}

Your memory access error is presumably happening due to the line:
dst.at<float>(i,j) = tmp;
because dst is not initialized. You can't assign something to that index of the matrix if it has no size/data. Instead, initialize the matrix first, as Mat_<float> is a declaration, not an initialization. Use one of the initializations where you can specify a cv::Size or the rows/columns from the different constructors for Mat (see the docs). For example, you can initialize dst with:
Mat dst{src.size(), src.type()};

HOG optimization with using SIMD

There are several attempts to optimize calculation of HOG descriptor with using of SIMD instructions: OpenCV, Dlib, and Simd. All of them use scalar code to add resulting magnitude to HOG histogram:
float histogram[height/8][width/8][18];
float ky[height], kx[width];
int idx[size];
float val[size];
for(size_t i = 0; i < size; ++i)
{
histogram[y/8][x/8][idx[i]] += val[i]*ky[y]*kx[x];
histogram[y/8][x/8 + 1][idx[i]] += val[i]*ky[y]*kx[x + 1];
histogram[y/8 + 1][x/8][idx[i]] += val[i]*ky[y + 1]*kx[x];
histogram[y/8 + 1][x/8 + 1][idx[i]] += val[i]*ky[y + 1]*kx[x + 1];
}
There the value of size depends from implementation but in general the meaning is the same.
I know that problem of histogram calculation with using of SIMD does not have a simple and effective solution. But in this case we have small size (18) of histogram. Can it help in SIMD optimizations?

I have found solution. It is a temporal buffer. At first we sum histogram to temporary buffer (and this operation can be vectorized). Then we add the sum from buffer to output histogram (and this operation also can be vectorized):
float histogram[height/8][width/8][18];
float ky[height], kx[width];
int idx[size];
float val[size];
float buf[18][4];
for(size_t i = 0; i < size; ++i)
{
buf[idx[i]][0] += val[i]*ky[y]*kx[x];
buf[idx[i]][1] += val[i]*ky[y]*kx[x + 1];
buf[idx[i]][2] += val[i]*ky[y + 1]*kx[x];
buf[idx[i]][3] += val[i]*ky[y + 1]*kx[x + 1];
}
for(size_t i = 0; i < 18; ++i)
{
histogram[y/8][x/8][i] += buf[i][0];
histogram[y/8][x/8 + 1][i] += buf[i][1];
histogram[y/8 + 1][x/8][i] += buf[i][2];
histogram[y/8 + 1][x/8 + 1][i] += buf[i][3];
}

You can do a partial optimisation by using SIMD to calculate all the (flattened) histogram indices and the bin increments. Then process these in a scalar loop afterwards. You probably also want to strip-mine this such that you process one row at a time, in order to keep the temporary bin indices and increments in cache. It might appear that this would be inefficient, due to the use of temporary intermediate buffers, but in practice I have seen a useful overall gain in similar scenarios.
uint32_t i = 0;
for (y = 0; y < height; ++y) // for each row
{
uint32_t inds[width * 4]; // flattened histogram indices for this row
float vals[width * 4]; // histogram bin increments for this row
// SIMD loop for this row - calculate flattened histogram indices and bin
// increments (scalar code shown for reference - converting this loop to
// SIMD is left as an exercise for the reader...)
for (x = 0; x < width; ++x, ++i)
{
indices[4*x] = (y/8)*(width/8)*18+(x/8)*18+idx[i];
indices[4*x+1] = (y/8)*(width/8)*18+(x/8 + 1)*18+idx[i];
indices[4*x+2] = (y/8+1)*(width/8)*18+(x/8)*18+idx[i];
indices[4*x+3] = (y/8+1)*(width/8)*18+(x/8 + 1)*18+idx[i];
vals[4*x] = val[i]*ky[y]*kx[x];
vals[4*x+1] = val[i]*ky[y]*kx[x+1];
vals[4*x+2] = val[i]*ky[y+1]*kx[x];
vals[4*x+3] = val[i]*ky[y+1]*kx[x+1];
}
// scalar loop for this row
float * const histogram_base = &histogram[0][0][0]; // pointer to flattened histogram
for (x = 0; x < width * 4; ++x) // for each set of 4 indices/increments in this row
{
histogram_base[indices[x]] += vals[x]; // update the (flattened) histogram
}
}

fftshift / ifftshift in terms of circshift

and am trying to relate fftshift/ifftshift to circular shift.
N = 5
Y = 0:N-1
X = [0 1 2 3 4]
When I fftshift(X), I get
[3 4 0 1 2]
When I ifftshift(X), I get
[2 3 4 0 1]
How do I relate fftshift/ifftshift to circular shift? Is it simply moving the numbers in X about in different directions?
I need to know this as I'm trying to implement these two functions in terms of circular shift in C++, which is a function I already have done.
Many thanks.

After looking at the Matlab codes, which doesn't directly use circular shift, but rather Matlab syntax.
Say N = no. of elements
To implement fftshift,
circularShiftRightBy = floor(N/2)
To implement ifftshift,
circularShiftRightBy = ceil(N/2)
Being N/2, there is only a difference between fftshift and ifftshift if N is odd.
Where circular shift code is:
template<typename ty>
void circshift(ty *out, const ty *in, int xdim, int ydim, int xshift, int yshift)
{
for (int i =0; i < xdim; i++) {
int ii = (i + xshift) % xdim;
if (ii<0) ii = xdim + ii;
for (int j = 0; j < ydim; j++) {
int jj = (j + yshift) % ydim;
if (jj<0) jj = ydim + jj;
out[ii * ydim + jj] = in[i * ydim + j];
}
}
}
(modified from fftshift/ifftshift C/C++ source code to support left (-ve) shifting as well. )
EDIT: I've since found a better way to do this: https://kerpanic.wordpress.com/2016/04/08/more-efficient-ifftshift-fftshift-in-c/

How to optimize simple gaussian filter for performance?

I am trying to write an android app which needs to calculate gaussian and laplacian pyramids for multiple full resolution images, i wrote this it on C++ with NDK, the most critical part of the code is applying gaussian filter to images abd i am applying this filter with horizontally and vertically.
The filter is (0.0625, 0.25, 0.375, 0.25, 0.0625)
Since i am working on integers i am calculating (1, 4, 6, 4, 1)/16
dst[index] = ( src[index-2] + src[index-1]*4 + src[index]*6+src[index+1]*4+src[index+2])/16;
I have made a few simple optimization however it still is working slow than expected and i was wondering if there are any other optimization options that i am missing.
PS: I should mention that i have tried to write this filter part with inline arm assembly however it give 2x slower results.
//horizontal filter
for(unsigned y = 0; y < height; y++) {
for(unsigned x = 2; x < width-2; x++) {
int index = y*width+x;
dst[index].r = (src[index-2].r+ src[index+2].r + (src[index-1].r + src[index+1].r)*4 + src[index].r*6)>>4;
dst[index].g = (src[index-2].g+ src[index+2].g + (src[index-1].g + src[index+1].g)*4 + src[index].g*6)>>4;
dst[index].b = (src[index-2].b+ src[index+2].b + (src[index-1].b + src[index+1].b)*4 + src[index].b*6)>>4;
}
}
//vertical filter
for(unsigned y = 2; y < height-2; y++) {
for(unsigned x = 0; x < width; x++) {
int index = y*width+x;
dst[index].r = (src[index-2*width].r + src[index+2*width].r + (src[index-width].r + src[index+width].r)*4 + src[index].r*6)>>4;
dst[index].g = (src[index-2*width].g + src[index+2*width].g + (src[index-width].g + src[index+width].g)*4 + src[index].g*6)>>4;
dst[index].b = (src[index-2*width].b + src[index+2*width].b + (src[index-width].b + src[index+width].b)*4 + src[index].b*6)>>4;
}
}

The index multiplication can be factored out of the inner loop since the mulitplicatation only occurs when y is changed:
for (unsigned y ...
{
int index = y * width;
for (unsigned int x...
You may gain some speed by loading variables before you use them. This would make the processor load them in the cache:
for (unsigned x = ...
{
register YOUR_DATA_TYPE a, b, c, d, e;
a = src[index - 2].r;
b = src[index - 1].r;
c = src[index + 0].r; // The " + 0" is to show a pattern.
d = src[index + 1].r;
e = src[index + 2].r;
dest[index].r = (a + e + (b + d) * 4 + c * 6) >> 4;
// ...
Another trick would be to "cache" the values of the src so that only a new one is added each time because the value in src[index+2] may be used up to 5 times.
So here is a example of the concepts:
//horizontal filter
for(unsigned y = 0; y < height; y++)
{
int index = y*width + 2;
register YOUR_DATA_TYPE a, b, c, d, e;
a = src[index - 2].r;
b = src[index - 1].r;
c = src[index + 0].r; // The " + 0" is to show a pattern.
d = src[index + 1].r;
e = src[index + 2].r;
for(unsigned x = 2; x < width-2; x++)
{
dest[index - 2 + x].r = (a + e + (b + d) * 4 + c * 6) >> 4;
a = b;
b = c;
c = d;
d = e;
e = src[index + x].r;

I'm not sure how your compiler would optimize all this, but I tend to work in pointers. Assuming your struct is 3 bytes... You can start with pointers in the right places (the edge of the filter for source, and the destination for target), and just move them through using constant array offsets. I've also put in an optional OpenMP directive on the outer loop, as this can also improve things.
#pragma omp parallel for
for(unsigned y = 0; y < height; y++) {
const int rowindex = y * width;
char * dpos = (char*)&dest[rowindex+2];
char * spos = (char*)&src[rowindex];
const char *end = (char*)&src[rowindex+width-2];
for( ; spos != end; spos++, dpos++) {
*dpos = (spos[0] + spos[4] + ((spos[1] + src[3])<<2) + spos[2]*6) >> 4;
}
}
Similarly for the vertical loop.
const int scanwidth = width * 3;
const int row1 = scanwidth;
const int row2 = row1+scanwidth;
const int row3 = row2+scanwidth;
const int row4 = row3+scanwidth;
#pragma omp parallel for
for(unsigned y = 2; y < height-2; y++) {
const int rowindex = y * width;
char * dpos = (char*)&dest[rowindex];
char * spos = (char*)&src[rowindex-row2];
const char *end = spos + scanwidth;
for( ; spos != end; spos++, dpos++) {
*dpos = (spos[0] + spos[row4] + ((spos[row1] + src[row3])<<2) + spos[row2]*6) >> 4;
}
}
This is how I do convolutions, anyway. It sacrifices readability a little, and I've never tried measuring the difference. I just tend to write them that way from the outset. See if that gives you a speed-up. The OpenMP definitely will if you have a multicore machine, and the pointer stuff might.
I like the comment about using SSE for these operations.

Some of the more obvious optimizations are exploiting the symmetry of the kernel:
a=*src++; b=*src++; c=*src++; d=*src++; e=*src++; // init
LOOP (n/5) times:
z=(a+e)+(b+d)<<2+c*6; *dst++=z>>4; // then reuse the local variables
a=*src++;
z=(b+a)+(c+e)<<2+d*6; *dst++=z>>4; // registers have been read only once...
b=*src++;
z=(c+b)+(d+a)<<2+e*6; *dst++=z>>4;
e=*src++;
The second thing is that one can perform multiple additions using a single integer. When the values to be filtered are unsigned, one can fit two channels in a single 32-bit integer (or 4 channels in a 64-bit integer); it's the poor mans SIMD.
a= 0x[0011][0034] <-- split to two
b= 0x[0031][008a]
----------------------
sum 0042 00b0
>>4 0004 200b0 <-- mask off
mask 00ff 00ff
-------------------
0004 000b <-- result
(The Simulated SIMD shows one addition followed by a shift by 4)
Here's a kernel that calculates 3 rgb operations in parallel (easy to modify for 6 rgb operations in 64-bit architectures...)
#define MASK (255+(255<<10)+(255<<20))
#define KERNEL(a,b,c,d,e) { \
a=((a+e+(c<<1))>>2) & MASK; a=(a+b+c+d)>>2 & MASK; *DATA++ = a; a=DATA[4]; }
void calc_5_rgbs(unsigned int *DATA)
{
register unsigned int a = DATA[0], b=DATA[1], c=DATA[2], d=DATA[3], e=DATA[4];
KERNEL(a,b,c,d,e);
KERNEL(b,c,d,e,a);
KERNEL(c,d,e,a,b);
KERNEL(d,e,a,b,c);
KERNEL(e,a,b,c,d);
}
Works best on ARM and on 64-bit IA with 16 registers... Needs heavy assembler optimizations to overcome register shortage in 32-bit IA (e.g. use ebp as GPR). And just because of that it's an inplace algorithm...
There are just 2 guardian bits between every 8 bits of data, which is just enough to get exactly the same result as in integer calculation.
And BTW: it's faster to just run through the array byte per byte than by r,g,b elements
unsigned char *s=(unsigned char *) source_array;
unsigned char *d=(unsigned char *) dest_array;
for (j=0;j<3*N;j++) d[j]=(s[j]+s[j+16]+s[j+8]*6+s[j+4]*4+s[j+12]*4)>>4;

normalizing a list of doubles to range -1 to 1 or 0 - 255

I have a list of doubles in the range of anywhere between -1.396655 to 1.74707 could even be higher or lower, either way I would know what the Min and Max value is before normalizing. My question is How can I normalize these values between -1 to 1 or even better yet convert them from double values to char values of 0 to 255
Any help would be appreciated.
double range = (double)(max - min);
value = 255 * (value - min)/range

You need a mapping of the form y = mx + c, and you need to find an m and a c. You have two fixed data-points, i.e.:
1 = m * max + c
-1 = m * min + c
From there, it's simple algebra.

The easiest thing is to first shift all the values so that min is 0, by subtracting Min from each number. Then multiply by 255/(Max-Min), so that the shifted Max will get mapped to 255, and everything else will scale linearly. So I believe your equation would look like this:
newval = (unsigned char) ((oldval - Min)*(255/(Max-Min)))
You may want to round a bit more carefully before casting to char.

There are two changes to be made.
First, use 256 as the limit.
Second, make sure your range is scaled back slightly to avoid getting 256.
public int GetRangedValue(double value, double min, double max)
{
int outputLimit = 256;
double range = (max - min) - double.Epsilon; // Here we shorten the range slightly
// Then we build a range such that value >= 0 and value < 1
double rangedValue = (value - min) / range;
return min + (int)(outputLimit * rangedValue);
}
With these two changes, you will get the correct distribution in your output.

I solved this need when I dived into doing some convolution stuff using C++.
Hopefully my code can have you a useful reference :)
bool normalize(uint8_t*& dst, double* src, int width, int height) {
dst = new uint8_t[sizeof(uint8_t)*width*height];
if (dst == NULL)
return false;
memset(dst, 0, sizeof(uint8_t)*width*height);
double max = std::numeric_limits<double>::min();
double min = std::numeric_limits<double>::max();
double range = std::numeric_limits<double>::max();
double norm = 0.0;
//find the boundary
for (int j=0; j<height; j++) {
for (int i=0; i<width; i++) {
if (src[i+j*width] > max)
max = src[i+j*width];
else if (src[i+j*width] < min)
min = src[i+j*width];
}
}
//normalize double matrix to be an uint8_t matrix
range = max - min;
for (int j=0; j<height; j++) {
for (int i=0; i<width; i++) {
norm = src[i+j*width];
norm = 255.0*(norm-min)/range;
dst[i+j*width] = (uint8_t)norm;
}
}
return true;
}
Basically output (calley receives by 'dst') is around [0, 255].

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Manual Implementation of Sobel Operator in OpenCV - c++

Related

convolution implementation in c++

HOG optimization with using SIMD

fftshift / ifftshift in terms of circshift

How to optimize simple gaussian filter for performance?

normalizing a list of doubles to range -1 to 1 or 0 - 255

Categories

Resources