SSE optimization of Gaussian blur - c++

I'm working on a school project , I have to optimize part of code in SSE, but I'm stuck on one part for few days now.
I dont see any smart way of using vector SSE instructions(inline assembler / instric f) in this code(its a part of guassian blur algorithm). I would be glad if somebody could give me just a small hint
for (int x = x_start; x < x_end; ++x) // vertical blur...
float sum = image[x + (y_start - radius - 1)*image_w];
float dif = -sum;
for (int y = y_start - 2*radius - 1; y < y_end; ++y)
{ // inner vertical Radius loop
float p = (float)image[x + (y + radius)*image_w]; // next pixel
buffer[y + radius] = p; // buffer pixel
sum += dif + fRadius*p;
dif += p; // accumulate pixel blur
if (y >= y_start)
float s = 0, w = 0; // border blur correction
sum -= buffer[y - radius - 1]*fRadius; // addition for fraction blur
dif += buffer[y - radius] - 2*buffer[y]; // sum up differences: +1, -2, +1
// cut off accumulated blur area of pixel beyond the border
// assume: added pixel values beyond border = value at border
p = (float)(radius - y); // top part to cut off
if (p > 0)
p = p*(p-1)/2 + fRadius*p;
s += buffer[0]*p;
w += p;
p = (float)(y + radius - image_h + 1); // bottom part to cut off
if (p > 0)
p = p*(p-1)/2 + fRadius*p;
s += buffer[image_h - 1]*p;
w += p;
new_image[x + y*image_w] = (unsigned char)((sum - s)/(weight - w)); // set blurred pixel
else if (y + radius >= y_start)
dif -= 2*buffer[y];
} // for y
} // for x

One more feature you can use is logical operations and masks:
for example instead of:
// process only 1
if (p > 0)
p = p*(p-1)/2 + fRadius*p;
you can write
// processes 4 floats
const __m128 &mask = _mm_cmplt_ps(p,0);
const __m128 &notMask = _mm_cmplt_ps(0,p);
const __m128 &p_tmp = ( p*(p-1)/2 + fRadius*p );
p = _mm_add_ps(_mm_and_ps(p_tmp, mask), _mm_and_ps(p, notMask)); // = p_tmp & mask + p & !mask
Also I can recommend you to use a special libraries, which overloads instructions. For example:
dif variable makes iterations of inner loop dependent. You should try to parallelize the outer loop. But with out instructions overloading the code will become unmanageable then.
Also consider rethinking the whole algorithm. Current one doesn't look paralell. May be you can neglect precision, or increase scalar time a bit?


Drawing an image along a slope in OpenGL

I'm writing a program that can draw a line between two points with filled circles. The circles:
- shouldn't overlap each other
- be as close together as possible
- and the centre of each circle should be on the line.
I've written a function to produce the circles, however I'm having trouble calculating position of each circle so that they are correctly lined up
void addCircles(scrPt endPt1, scrPt endPt2)
float xLength, yLength, length, cSquare, slope;
int numberOfCircles;
// Get the x distance between the two points
xLength = abs(endPt1.x - endPt2.x);
// Get the y distance between the two points
yLength = abs(endPt1.y - endPt2.y);
// Get the length between the points
cSquare = pow(xLength, 2) + pow(yLength, 2);
length = sqrt(cSquare);
// calculate the slope
slope = (endPt2.y - endPt1.y) / (endPt2.x - endPt1.x);
// Find how many circles fit inside the length
numberOfCircles = round(length / (radius * 2) - 1);
// set the position of each circle
for (int i = 0; i < numberOfCircles; i++)
scrPt circPt;
circPt.x = endPt1.x + ((radius * 2) * i);
circPt.y = endPt1.y + (((radius * 2) * i) * slope);
drawCircle (circPt.x, circPt.y);
This is what the above code produces:
I'm quite certain that the issue lies with this line, which sets the y value of the circle:
circPt.y = endPt1.y + (((radius * 2) * i) * slope);
Any help would be greatly appreciated
I recommend to calculate the direction of the line as a unit vector:
float xDist = endPt2.x - endPt1.x;
float yDist = endPt2.y - endPt1.y;
float length = sqrt(xDist*xDist + yDist *yDist);
float xDir = xDist / length;
float yDir = yDist / length;
Calculate the distance from one center point to the next one, numberOfSegments is the number of sections and not the number of circles:
int numberOfSegments = (int)trunc( length / (radius * 2) );
float distCpt = numberOfSegments == 0 ? 0.0f : length / (float)numberOfSegments;
A center point of a circle is calculated by the adding a vector the the start point of the line. The vector pints in the direction of the line and its length is given, by the distance between 2 circles multiplied by the "index" of the circle:
for (int i = 0; i <= numberOfSegments; i++)
float cpt_x = endPt1.x + xDir * distCpt * (float)i;
float cpt_y = endPt1.y + yDir * distCpt * (float)i;
drawCircle(cpt_x , cpt_y);
Note, the last circle on a line may be redrawn, by the first circle of the next line. You can change this by changing the iteration expression of the for loop - change <= to <:
for (int i = 0; i < numberOfSegments; i++)
In this case at the end of the line won't be drawn any circle at all.

Transform images with bezier curves

I'm using this article: nonlingr as a font to understand non linear transformations, in the section GLYPHS ALONG A PATH he explains how to use a parametric curve to transform an image, i'm trying to apply a cubic bezier to an image, however i have been unsuccessfull, this is my code:
OUT.aloc(IN.width(), IN.height());
//get the control points...
wVector p0(values[vindex], values[vindex+1], 1);
wVector p1(values[vindex+2], values[vindex+3], 1);
wVector p2(values[vindex+4], values[vindex+5], 1);
wVector p3(values[vindex+6], values[vindex+7], 1);
//this is to calculate t based on x
double trange = 1 / (OUT.width()-1);
//curve coefficients
double A = (-p0[0] + 3*p1[0] - 3*p2[0] + p3[0]);
double B = (3*p0[0] - 6*p1[0] + 3*p2[0]);
double C = (-3*p0[0] + 3*p1[0]);
double D = p0[0];
double E = (-p0[1] + 3*p1[1] - 3*p2[1] + p3[1]);
double F = (3*p0[1] - 6*p1[1] + 3*p2[1]);
double G = (-3*p0[1] + 3*p1[1]);
double H = p0[1];
//apply the transformation
for(long i = 0; i < OUT.height(); i++){
for(long j = 0; j < OUT.width(); j++){
//t = x / width
double t = trange * j;
//apply the article given formulas
double x_path_d = 3*t*t*A + 2*t*B + C;
double y_path_d = 3*t*t*E + 2*t*F + G;
double angle = 3.14159265/2.0 + std::atan(y_path_d / x_path_d);
mapped_point.Set((t*t*t)*A + (t*t)*B + t*C + D + i*std::cos(angle),
(t*t*t)*E + (t*t)*F + t*G + H + i*std::sin(angle),
//test if the point is inside the image
if(mapped_point[0] < 0 ||
mapped_point[0] >= OUT.width() ||
mapped_point[1] < 0 ||
mapped_point[1] >= IN.height())
IN.getPixel(j, i));
Applying this code in a 300x196 rgb image all i get is a black screen no matter what control points i use, is hard to find information about this kind of transformation, searching for parametric curves all i find is how to draw them, not apply to images. Can someone help me on how to transform an image with a bezier curve?
IMHO applying a curve to an image sound like using a LUT. So you will need to check for the value of the curve for different image values and then switch the image value with the one on the curve, so, create a Look-Up-Table for each possible value in the image (e.g : 0, 1, ..., 255, for a gray value 8 bit image), that is a 2x256 matrix, first column has the values from 0 to 255 and the second one having the value of the curve.

c++ YUYV 422 Horizontal and Vertical Flipping

I have a uint8_t YUYV 422 (Interleaved) image array in memory and I want to be able to flip it both vertically and horizontally. I have successfully implemented a vertical flip but I'm having a problem with flipping both horizontally and vertically at the same time.
My code for the vertical flip, below, works perfectly.
int counter = 0;
int array_width = 2; // YUYV
for (int h = (m_Width * m_Height * array_width) - m_Width * array_width; h > 0; h -= m_Width * array_width)
for (int w = 0; w < m_Width * array_width; w++)
flipped[counter] = buffer[h + w];
However, the following vertical and horizontal flip code appears to work but there is a loss of definition. To better understand what I am referring to, please see my sample images.
int x = 0;
for (int n = m_Width * m_Height * 2 - 1; n >= 0; n -= 4)
flipped[x] = buffer[n - 3]; // Y0
flipped[x + 1] = buffer[n - 2]; // U
flipped[x + 2] = buffer[n - 1]; // Y1
flipped[x + 3] = buffer[n]; // V
x += 4;
As you can see, I am moving the YUYV components and keeping them in the same order. I don't believe that I am dropping pixels so I don't understand why I am losing definition. To reiterate, I don't see this problem when flipping vertically (Using the first code snippet).
Here is the reference image, please note the stem of the lamp:
This is the flipped image, the stem of the lamp has lost definition:
You also need to swap Y0 and Y1 in your loop.
int x = 0;
for (int n = m_Width * m_Height * 2 - 1; n >= 3; n -= 4)
flipped[x] = buffer[n - 1]; // Y1->Y0
flipped[x + 1] = buffer[n - 2]; // U
flipped[x + 2] = buffer[n - 3]; // Y0->Y1
flipped[x + 3] = buffer[n]; // V
x += 4;
While I was at it, since you're accessing n - 3 I changed the loop condition to be absolutely sure it was safe.
m_Width * m_Height * 2 is not a multiple of 4 (the number of data blocks in YUYV format. Try changing '2' into '4', an also array_width.

Meanshift algorithm for tracking objects issue computing centroid update of search window

I have been trying to implement the meanshift algorithm for tracking objects, and have gone through the concepts involved.
As per now I have managed to successfully generate a backprojected stream from my camera with a single channel hue roi histogram and a single channel hue video stream which seems fine, I know there is a meanshift function within the opencv library but I am trying to implement one myself using the data structures provided in opencv, calculating the moments and computing the mean centroid of the search window.
But for some reason I am unable to locate the problem within my code as it keeps on converging to the upper left corner of my video stream for any input roi (region of interest) to be tracked. Following is a code snippet of the function for calculating the centroid of the search window where I feel the problem lies but not sure what it is, I would really appreciate if someone can point me in the right direction:
void moment(Mat &backproj, Rect &win){
int x_c, y_c, x_c_new, y_c_new;
int idx_row, idx_col;
double m00 = 0.0 , m01 = 0.0 , m10 = 0.0 ;
double res = 1.0, TOL = 0.003 ;
//Set the center of search window as the center of the probabilistic image:
y_c = (int) backproj.rows / 2 ;
x_c = (int) backproj.cols / 2 ;
//Centroid search solver until residual below certain tolerance:
while (res > TOL){
win.width = (int) 80;
win.height = (int) 60;
//First array element at position (x,y) "lower left corner" of the search window:
win.x = (int) (x_c - win.width / 2) ;
win.y = (int) (y_c - win.height / 2);
//Modulo correction since modulo of negative integer is negative in C:
if (win.x < 0)
win.x = win.x % backproj.cols + backproj.cols ;
if (win.y < 0)
win.y = win.y % backproj.rows + backproj.rows ;
for (int i = 0; i < win.height; i++ ){
//Traverse along y-axis (height) i.e. rows ensuring wrap around top/bottom boundaries:
idx_row = (win.y + i) % (int)backproj.rows ;
for (int j = 0; j < win.width; j++ ){
//Traverse along x-axis (width) i.e. cols ensuring wrap around left/right boundaries:
idx_col = (win.x + j) % (int)backproj.cols ;
//Compute Moments:
m00 += (double)<uchar>(idx_row, idx_col) ;
m10 += (double)<uchar>(idx_row, idx_col) * i ;
m01 += (double)<uchar>(idx_row, idx_col) * j ;
//Compute new centroid coordinates of the search window:
x_c_new = (int) ( m10 / m00 ) ;
y_c_new = (int) ( m01 / m00 );
//Compute the residual:
res = sqrt( pow((x_c_new - x_c), 2.0) + pow((y_c_new - y_c), 2.0) ) ;
//Set new search window centroid coordinates:
x_c = x_c_new;
y_c = y_c_new;
It's my second ever query on stackoverflow so please excuse me for any guidelines that I forgot to follow.
changed m00 , m01 , m10 to block level variables within WHILE-LOOP instead of function level variables, thanks to Daniel Strul for pointing it out but the problem still remains. Now the search window jumps around the frame boundaries instead of focusing on the roi.
void moment(Mat &backproj, Rect &win){
int x_c, y_c, x_c_new, y_c_new;
int idx_row, idx_col;
double m00 , m01 , m10 ;
double res = 1.0, TOL = 0.003 ;
//Set the center of search window as the center of the probabilistic image:
y_c = (int) backproj.rows / 2 ;
x_c = (int) backproj.cols / 2 ;
//Centroid search solver until residual below certain tolerance:
while (res > TOL){
m00 = 0.0 , m01 = 0.0 , m10 = 0.0
win.width = (int) 80;
win.height = (int) 60;
//First array element at position (x,y) "lower left corner" of the search window:
win.x = (int) (x_c - win.width / 2) ;
win.y = (int) (y_c - win.height / 2);
//Modulo correction since modulo of negative integer is negative in C:
if (win.x < 0)
win.x = win.x % backproj.cols + backproj.cols ;
if (win.y < 0)
win.y = win.y % backproj.rows + backproj.rows ;
for (int i = 0; i < win.height; i++ ){
//Traverse along y-axis (height) i.e. rows ensuring wrap around top/bottom boundaries:
idx_row = (win.y + i) % (int)backproj.rows ;
for (int j = 0; j < win.width; j++ ){
//Traverse along x-axis (width) i.e. cols ensuring wrap around left/right boundaries:
idx_col = (win.x + j) % (int)backproj.cols ;
//Compute Moments:
m00 += (double)<uchar>(idx_row, idx_col) ;
m10 += (double)<uchar>(idx_row, idx_col) * i ;
m01 += (double)<uchar>(idx_row, idx_col) * j ;
//Compute new centroid coordinates of the search window:
x_c_new = (int) ( m10 / m00 ) ;
y_c_new = (int) ( m01 / m00 );
//Compute the residual:
res = sqrt( pow((x_c_new - x_c), 2.0) + pow((y_c_new - y_c), 2.0) ) ;
//Set new search window centroid coordinates:
x_c = x_c_new;
y_c = y_c_new;
The reason your algorithms always converges to the upper left corner independently of the input data is that m00, m10 and m01 are never reset to zero:
On iteration 0, for each moment variable m00, m10 and m01, you compute the right value m0
Between iteration 0 and iteration 1 , the moments variables are not reset and keep their previous value
Thus, on iteration 1, for each moment variable m00, m10 and m01, you actually sum the new moment with the old one and obtain ( m0 + m1 )
On iteration 2, you carry on summing the new moments on top of the previous ones and obtain ( m0 + m1 + m2 )
And so on, iteration by iteration.
At the very least, the moment variables should be reset at the beginning of each iteration.
Ideally, they should not be function-level variables but should rather be block-level variables, as they have no use outside the loop iterations (except for debugging purpose):
while (res > TOL){
double m00 = 0.0, m01 = 0.0, m10 = 0.0;
for (int i = 0; i < win.height; i++ ){
The reason for the second problem you encounter (the ROI jumping all around the place) is that the computations of the moments are based on the relative coordinates i and j.
Thus, what you compute is [ avg(j) , avg(i) ], wher as what you really want is [ avg(y) , avg(x) ]. To solve this issue, I had proposed a first solution. I"ve replaced it by a much simpler solution below.
The simplest solution is to add the coordinates of the ROI corner right at the end of each iteration:
x_c_new = win.x + (int) ( m10 / m00 ) ;
y_c_new = win.y + (int) ( m01 / m00 );

efficient way of accessing opencv Mat elements

I'm trying to play around with some OpenCV and thought up an interesting little scenario to work on.
Basically, I want to take a pixel, add the colour values from the 3 neighbouring pixels (so (x, y), (x+1, y) (x, y+1) and (x+1, y+1)) and divide the result by 4 to get an average colour value. Then the next set of pixels I process is (x+2, y+2) with it's 3 neighbours.
I then also want to be able to do a similar thing, but with 9 pixels (with the chosen co-ordinate to work from being the centre).
Initially I started with a gaussian blur type masking, but that's not the result I want to acheive. As from those calculations, I just want to get 1 pixel value. So the output image will be 1/4 or a 1/9 of the size. So for now I've got it working where I've literally written out the calculation in a for loop as:
for (int i = 1; i < myImage.rows -1; i++)
b = 0;
for (int k = 1; k < myImage.cols -1; k++)
//9 pixel radius<Vec3b>(a, b)[1] = (<Vec3b>(i-1, k-1)[1]<Vec3b>(i-1, k)[1]<Vec3b>(i+1, k)[1] +<Vec3b>(i, k)[1]<Vec3b>(i, k-1)[1]<Vec3b>(i, k+1)[1] +<Vec3b>(i + 1, k+1)[1] +<Vec3b>(i-1, k + 1)[1] +<Vec3b>(i + 1, k - 1)[1]) / 9;<Vec3b>(a, b)[2] = (<Vec3b>(i-1, k-1)[2]<Vec3b>(i-1, k)[2]<Vec3b>(i+1, k)[2] +<Vec3b>(i, k)[2]<Vec3b>(i, k-1)[2]<Vec3b>(i, k+1)[2] +<Vec3b>(i + 1, k+1)[2] +<Vec3b>(i-1, k + 1)[2] +<Vec3b>(i + 1, k - 1)[2]) / 9;<Vec3b>(a, b)[0] = (<Vec3b>(i-1, k-1)[0]<Vec3b>(i-1, k)[0]<Vec3b>(i+1, k)[0] +<Vec3b>(i, k)[0]<Vec3b>(i, k-1)[0]<Vec3b>(i, k+1)[0] +<Vec3b>(i + 1, k+1)[0] +<Vec3b>(i-1, k + 1)[0] +<Vec3b>(i + 1, k - 1)[0]) / 9;
//4 pixel radius
//<Vec3b>(a, b)[1] = (<Vec3b>(i, k)[1] +<Vec3b>(i + 1, k)[1] +<Vec3b>(i, k + 1)[1] +<Vec3b>(i, k - 1)[1] +<Vec3b>(i - 1, k)[1]) / 5;
//<Vec3b>(a, b)[2] = (<Vec3b>(i, k)[2] +<Vec3b>(i + 1, k)[2] +<Vec3b>(i, k + 1)[2] +<Vec3b>(i, k - 1)[2] +<Vec3b>(i - 1, k)[2]) / 5;
//<Vec3b>(a, b)[0] = (<Vec3b>(i, k)[0] +<Vec3b>(i + 1, k)[0] +<Vec3b>(i, k + 1)[0] +<Vec3b>(i, k - 1)[0] +<Vec3b>(i - 1, k)[0]) / 5;
Obviously, it's possible to setup the two options as different function that is called, but I'm just wondering if there's a more efficient way of achieveing this, that would let the size of the mask be changed.
Thanks for any help!
I'm assuming that you want to do this all without built-in functions (like resize, mean, or filter2d) and just want to directly address the image using at. There are further optimizations that can be made, but this is intended as a reasonable and understandable improvement on the original code.
Also, it should be noted that I ignore any extra rows/columns when the image size is not exactly divisible by the scale factor. You'll need to specify the expected behavior if you want something different.
The first thing I'd do is change what you think of as the target pixel. Assume you have a 3x3 neighborhood like so:
1 2 3
4 5 6
7 8 9
We're going to take the mean value of all of these pixels anyway, so whether we call pixel 5 the target or pixel 1 makes no difference to the resulting image. I'm going to call pixel 1 the target because it makes the math cleaner.
The 1 pixel will always be on coordinates divisible by the scaling factor. If the scaling factor is 2, the coordinates of 1 will always be even.
Second, rather than loop over the original image dimensions, which actually results in recalculating the same pixel in Result numerous times, I'm going to loop over the dimensions of Result and figure out which pixels in the original image contribute to each pixel in the result.
So to find neighborhood in the original image that corresponds to pixel (x, y) in the result image, we just have to look for pixel 1 of that neighborhood. Since it's a multiple of the scaling factor, it's just
(x * scaleFactor, y * scaleFactor)
Finally, we need to add two more nested loops to loop over the scaleFactor x scaleFactor window. This is the part the avoids having to type out those long calculations.
In the 3x3 example above, for example, pixel 9 in the neighborhood of (x, y) will be:
(x * scaleFactor + 2, y * scaleFactor + 2)
I also do the mean calculation directly in a vector rather than doing each channel individually. This means that our results will overflow a uchar, so I use Vec3i and cast it back to a Vec3b after the division. This is one place where you should consider using a built-in function mean to calculate the average over the window as it will remove the need for these new loops.
So, if our original image is myImage, we have:
int scaleFactor = 3;
Mat Result(myImage.rows/scaleFactor, myImage.rows/scaleFactor,
myImage.type(), Scalar::all(0));
for (int i = 0; i < Result.rows; i++)
for (int k = 0; k < Result.cols; k++)
// make sum an int vector so it can hold
// value = scaleFactor x scaleFactor x 255
Vec3i areaSum = Vec3i(0,0,0);
for (int m = 0; m < scaleFactor; m++)
for (int n = 0; n < scaleFactor; n++)
areaSum +=<Vec3b>(i*scaleFactor+m, k*scaleFactor+n);
}<Vec3b>(i,k) = Vec3b(areaSum/(scaleFactor*scaleFactor));
Here are a couple of samples...
scaleFactor = 2:
scaleFactor = 3:
scaleFactor = 5: