c++ YUYV 422 Horizontal and Vertical Flipping - c++

I have a uint8_t YUYV 422 (Interleaved) image array in memory and I want to be able to flip it both vertically and horizontally. I have successfully implemented a vertical flip but I'm having a problem with flipping both horizontally and vertically at the same time.
My code for the vertical flip, below, works perfectly.
int counter = 0;
int array_width = 2; // YUYV
for (int h = (m_Width * m_Height * array_width) - m_Width * array_width; h > 0; h -= m_Width * array_width)
{
for (int w = 0; w < m_Width * array_width; w++)
{
flipped[counter] = buffer[h + w];
counter++;
}
}
However, the following vertical and horizontal flip code appears to work but there is a loss of definition. To better understand what I am referring to, please see my sample images.
int x = 0;
for (int n = m_Width * m_Height * 2 - 1; n >= 0; n -= 4)
{
flipped[x] = buffer[n - 3]; // Y0
flipped[x + 1] = buffer[n - 2]; // U
flipped[x + 2] = buffer[n - 1]; // Y1
flipped[x + 3] = buffer[n]; // V
x += 4;
}
As you can see, I am moving the YUYV components and keeping them in the same order. I don't believe that I am dropping pixels so I don't understand why I am losing definition. To reiterate, I don't see this problem when flipping vertically (Using the first code snippet).
Here is the reference image, please note the stem of the lamp:
This is the flipped image, the stem of the lamp has lost definition:

You also need to swap Y0 and Y1 in your loop.
int x = 0;
for (int n = m_Width * m_Height * 2 - 1; n >= 3; n -= 4)
{
flipped[x] = buffer[n - 1]; // Y1->Y0
flipped[x + 1] = buffer[n - 2]; // U
flipped[x + 2] = buffer[n - 3]; // Y0->Y1
flipped[x + 3] = buffer[n]; // V
x += 4;
}
While I was at it, since you're accessing n - 3 I changed the loop condition to be absolutely sure it was safe.

m_Width * m_Height * 2 is not a multiple of 4 (the number of data blocks in YUYV format. Try changing '2' into '4', an also array_width.

Related

Image Rotation without cropping

Dears,
With the below code, I rotate my cv::Mat object (I'm not using any Cv's functions, apart from load/save/convertionColor.., as this is a academic project) and I receive a cropped Image
rotation function:
float rads = angle*3.1415926/180.0;
float _cos = cos(-rads);
float _sin = sin(-rads);
float xcenter = (float)(src.cols)/2.0;
float ycenter = (float)(src.rows)/2.0;
for(int i = 0; i < src.rows; i++)
for(int j = 0; j < src.cols; j++){
int x = ycenter + ((float)(i)-ycenter)*_cos - ((float)(j)-xcenter)*_sin;
int y = xcenter + ((float)(i)-ycenter)*_sin + ((float)(j)-xcenter)*_cos;
if (x >= 0 && x < src.rows && y >= 0 && y < src.cols) {
dst.at<cv::Vec4b>(i ,j) = src.at<cv::Vec4b>(x, y);
}
else {
dst.at<cv::Vec4b>(i ,j)[3] = 0;
}
}
I would like to know, How I can keep my Full image every time I want to rotate it.
Am I missing something in my function maybe?
thanks in advance
The rotated image usually has to be large than the old image to store all pixel values.
Each point (x,y) is translated to
(x', y') = (x*cos(rads) - y*sin(rads), x*sin(rads) + y*cos(rads))
An image with height h and width w, center at (0,0) and corners at
(h/2, w/2)
(h/2, -w/2)
(-h/2, w/2)
(-h/2, -w/2)
has a new height of
h' = 2*y' = 2 * (w/2*sin(rads) + h/2*cos(rads))
and a new width of
w' = 2*x' = 2 * (w/2*cos(rads) + h/2*sin(rads))
for 0 <= rads <= pi/4. It is x * y <= x' * y' and for rads != k*pi/2 with k = 1, 2, ... it is x * y < x' * y'
In any case the area of the rotated image is same size as or larger than the area of the old image.
If you use the old size, you cut off the corners.
Example:
Your image has h=1, w=1 and rads=pi/4. You need a new image with h'=sqrt(2)=1.41421356237 and w'=sqrt(2)=1.41421356237 to store all pixel values. The pixel from (1,1) is translated to (0, sqrt(2)).

Is there a chance to make the bilinear interpolation faster?

First I want to provide you with some context.
I have two kind of images I need to merge. The first image is the background image with the format 8BppGrey and a resolution of 320x240. The second image is the forground image with the format 32BppRGBA and a resolution of 64x48.
Update
The github repo with an MVP is at the bottom of the question.
To do it I resize the second image with bilinear interpolation to the same size as the first one and then use blending to merge both to one image. Blending only happens when the alpha value of the second image is greater then 0.
I need to do it as fast as possible so my idea was to combine the resize and merge / blend process.
To achieve this I used the resize function from the writeablebitmapex repository and added merging / blending.
Everything works as expected but I want to decrease the execution time.
This are the current debug timings:
// CPU: Intel(R) Core(TM) i7-4810MQ CPU # 2.80GHz
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 5 ms.
MediaServer: Execution time in c++ 4 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 3 ms
MediaServer: Resizing took 3 ms.
MediaServer: Execution time in c++ 4 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 6 ms
MediaServer: Resizing took 6 ms.
MediaServer: Execution time in c++ 3 ms
MediaServer: Resizing took 3 ms.
Do I have any chance to increase the performance and lower the execution time of the resize / merge / blend process?
Are there some parts I maybe can parallelize?
Do I maybe have a chance to use some processor features?
A huge performance hit is the nested loop but I have no idea how I could write it better.
I would like to reach 1 or 2 ms for the whole process. Is this even possible?
Here's the modified visual c++ function I use.
pd is the backbuffer of the writeable bitmap I use to display the
result in wpf. The format I use is the default 32BppRGBA.
pixels is the int[] array of the 64x48 32BppRGBA image
widthSource and heightSource is the size of the pixels image
width and height is the target size of the output image
baseImage is the int[] array of the 320x240 8BppGray image
VC++ code:
unsigned int Resize(int* pd, int* pixels, int widthSource, int heightSource, int width, int height, byte* baseImage)
{
unsigned int start = clock();
float xs = (float)widthSource / width;
float ys = (float)heightSource / height;
float fracx, fracy, ifracx, ifracy, sx, sy, l0, l1, rf, gf, bf;
int c, x0, x1, y0, y1;
byte c1a, c1r, c1g, c1b, c2a, c2r, c2g, c2b, c3a, c3r, c3g, c3b, c4a, c4r, c4g, c4b;
byte a, r, g, b;
// Bilinear
int srcIdx = 0;
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x++)
{
sx = x * xs;
sy = y * ys;
x0 = (int)sx;
y0 = (int)sy;
// Calculate coordinates of the 4 interpolation points
fracx = sx - x0;
fracy = sy - y0;
ifracx = 1.0f - fracx;
ifracy = 1.0f - fracy;
x1 = x0 + 1;
if (x1 >= widthSource)
{
x1 = x0;
}
y1 = y0 + 1;
if (y1 >= heightSource)
{
y1 = y0;
}
// Read source color
c = pixels[y0 * widthSource + x0];
c1a = (byte)(c >> 24);
c1r = (byte)(c >> 16);
c1g = (byte)(c >> 8);
c1b = (byte)(c);
c = pixels[y0 * widthSource + x1];
c2a = (byte)(c >> 24);
c2r = (byte)(c >> 16);
c2g = (byte)(c >> 8);
c2b = (byte)(c);
c = pixels[y1 * widthSource + x0];
c3a = (byte)(c >> 24);
c3r = (byte)(c >> 16);
c3g = (byte)(c >> 8);
c3b = (byte)(c);
c = pixels[y1 * widthSource + x1];
c4a = (byte)(c >> 24);
c4r = (byte)(c >> 16);
c4g = (byte)(c >> 8);
c4b = (byte)(c);
// Calculate colors
// Alpha
l0 = ifracx * c1a + fracx * c2a;
l1 = ifracx * c3a + fracx * c4a;
a = (byte)(ifracy * l0 + fracy * l1);
// Write destination
if (a > 0)
{
// Red
l0 = ifracx * c1r + fracx * c2r;
l1 = ifracx * c3r + fracx * c4r;
rf = ifracy * l0 + fracy * l1;
// Green
l0 = ifracx * c1g + fracx * c2g;
l1 = ifracx * c3g + fracx * c4g;
gf = ifracy * l0 + fracy * l1;
// Blue
l0 = ifracx * c1b + fracx * c2b;
l1 = ifracx * c3b + fracx * c4b;
bf = ifracy * l0 + fracy * l1;
// Cast to byte
float alpha = a / 255.0f;
r = (byte)((rf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
g = (byte)((gf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
b = (byte)((bf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
pd[srcIdx++] = (255 << 24) | (r << 16) | (g << 8) | b;
}
else
{
// Alpha, Red, Green, Blue
pd[srcIdx++] = (255 << 24) | (baseImage[srcIdx] << 16) | (baseImage[srcIdx] << 8) | baseImage[srcIdx];
}
}
}
unsigned int end = clock() - start;
return end;
}
Github repo
One action that may speed up your code is to avoid type conversions from integer to float and vice versa. This can be achieved by having an int value in the suitable range instead of floats on range 0..1
Something like this:
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x++)
{
int sx1 = x * widthSource ;
int x0 = sx1 / width;
int fracx = (sx1 % width) ; // range 0..width - 1
which turns into something like
l0 = (fracx * c2a + (width - fracx) * c1a) / width ;
And so on. A bit tricky but doable
Thank you for all the help but the problem was the managed c++ project. I transfered the function now to my native c++ library and used the managed c++ part only as a wrapper for the c# application.
After the compiler optimization the function is now finished in 1ms.
Edit:
I will mark my own answer for now as the solution because the optimization from #marom leads to a broken image.
The common way to speedup a resize operation with bilinear interpolation is to:
Exploit the fact that x0 and fracx are independent from the row and that y0and fracy are independent from the column. Even though you haven't pulled out the computation of y0 and fracy out of the x-loop, compiler optimization should take care of that. However, for x0 and fracx, one needs to pre-compute the values for all columns and store them in an array. Complexity for computing x0 and fracx becomes O(width) compared to O(width*height) without pre-computation.
Do the whole processing with integers by replacing floating point arithmetics by integer arithmetics, thereby using shift operations instead of integer divisions.
For better readability, I did not implement the pre-computation of x0 and fracx in the following code. Pre-computation is straight-forward anyways.
Note that FACTOR = 2048 is the max you can do with 32-bit signed integers here (2048 * 2048 * 255 is just fine). For higher precision, you should switch to int64_t and then increase FACTOR and SHIFT, respectively.
I placed the border check into the inner loop for better readability. For an optimized implementation one should remove it by iterating in both loops just before this case happens and add special handling for the border pixels.
In case someone is wondering what the + (FACTOR * FACTOR / 2) is for, it is for rounding in conjunction with the subsequent division.
Finally note that (FACTOR * FACTOR / 2) and 2 * SHIFT are evaluated at compile time.
#define FACTOR 2048
#define SHIFT 11
const int xs = (int) ((double) FACTOR * widthSource / width + 0.5);
const int ys = (int) ((double) FACTOR * heightSource / height + 0.5);
for (int y = 0; y < height; y++)
{
const int sy = y * ys;
const int y0 = sy >> SHIFT;
const int fracy = sy - (y0 << SHIFT);
for (int x = 0; x < width; x++)
{
const int sx = x * xs;
const int x0 = sx >> SHIFT;
const int fracx = sx - (x0 << SHIFT);
if (x0 >= widthSource - 1 || y0 >= heightSource - 1)
{
// insert special handling here
continue;
}
const int offset = y0 * widthSource + x0;
target[y * width + x] = (unsigned char)
((source[offset] * (FACTOR - fracx) * (FACTOR - fracy) +
source[offset + 1] * fracx * (FACTOR - fracy) +
source[offset + widthSource] * (FACTOR - fracx) * fracy +
source[offset + widthSource + 1] * fracx * fracy +
(FACTOR * FACTOR / 2)) >> (2 * SHIFT));
}
}
For clarification, to match the variables used by the OP, for instance, in the case of the alpha channel it is:
a = (unsigned char)
((c1a * (FACTOR - fracx) * (FACTOR - fracy) +
c2a * fracx * (FACTOR - fracy) +
c3a * (FACTOR - fracx) * fracy +
c4a * fracx * fracy +
(FACTOR * FACTOR / 2)) >> (2 * SHIFT));

Bilinear interpolation in 2D transformation Qt

I'm currently working on 2D transformations (translation, scaling, shearing and rotation) in Qt. I have a problem with bilinear interpolation, which I want to use to cover the 'black pixels' in output image. I'm using matrix calculations to get new coordinates of pixels of input image. Then I use reverse matrix calculation to check which pixel of input image responds to output pixel. Result of that is some float number which I use to interpolation. I check the four neighbour points and calculate the value (color) of output pixel. I have checked my calculations 'by hand' and they seem to be good.
Can anyone find any bug in that code? (I cut out the parts of code which are responsible for interface such as sliders).
Geometric::Geometric(QWidget* parent) : QWidget(parent) {
resize(1000, 800);
displayLogoDefault = true;
a = shx = shy = x0 = y0 = 0;
scx = scy = 1;
tx = ty = 0;
x = 200, y = 200;
paintT = paintSc = paintR = paintShx = paintShy = false;
img = new QImage(600,600,QImage::Format_RGB32);
img2 = new QImage("logo.jpeg");
}
Geometric::~Geometric() {
delete img;
delete img2;
img = NULL;
img2 = NULL;
}
void Geometric::makeChange() {
displayLogoDefault = false;
// iteration through whole input image
for(int i = 0; i < img2->width(); i++) {
for(int j = 0; j < img2->height(); j++) {
// calculate new coordinates basing on given 2D transformations values
//I calculated that formula eariler by multiplying/adding matrixes
x = cos(a)*scx*(i-x0) - sin(a)*scy*(j-y0) + shx*sin(a)*scx*(i-x0) + shx*cos(a)*scy*(j-y0);
y = shy*(x) + sin(a)*scx*(i-x0) + cos(a)*scy*(j-y0);
// tx and ty goes for translation. scx and scy for scaling
// shx and shy for shearing and a is angle for rotation
x += (x0 + tx);
y += (y0 + ty);
if(x >= 0 && y >= 0 && x < img->width() && y < img->height()) {
// reverse matrix calculation formula to find proper pixel from input image
float tmx = x - x0 - tx;
float tmy = y - y0 - ty;
float recX = 1/scx * ( cos(-a)*( (tmx + shx*shy*tmx - shx*tmx) ) + sin(-a)*( shy*tmx - tmy ) ) + x0 ;
float recY = 1/scy * ( sin(-a)*(tmx + shx*shy*tmx - shx*tmx) - cos(-a)*(shy*tmx-tmy) ) + y0;
// here the interpolation starts. I calculate the color basing on four points from input image
// that points are taken from the reverse matrix calculation
float a = recX - floorf(recX);
float b = recY - floorf (recY);
if(recX + 1 > img2->width()) recX -= 1;
if(recY + 1 > img2->height()) recY -= 1;
QColor c1 = QColor(img2->pixel(recX, recY));
QColor c2 = QColor(img2->pixel(recX + 1, recY));
QColor c3 = QColor(img2->pixel(recX , recY + 1));
QColor c4 = QColor(img2->pixel(recX + 1, recY + 1));
float colR = b * ((1.0 - a) * (float)c3.red() + a * (float)c4.red()) + (1.0 - b) * ((1.0 - a) * (float)c1.red() + a * (float)c2.red());
float colG = b * ((1.0 - a) * (float)c3.green() + a * (float)c4.green()) + (1.0 - b) * ((1.0 - a) * (float)c1.green() + a * (float)c2.green());
float colB = b * ((1.0 - a) * (float)c3.blue() + a * (float)c4.blue()) + (1.0 - b) * ((1.0 - a) * (float)c1.blue() + a * (float)c2.blue());
if(colR > 255) colR = 255; if(colG > 255) colG = 255; if(colB > 255) colB = 255;
if(colR < 0 ) colR = 0; if(colG < 0 ) colG = 0; if(colB < 0 ) colB = 0;
paintPixel(x, y, colR, colG, colB);
}
}
}
// x0 and y0 are the starting point of image
x0 = abs(x-tx);
y0 = abs(y-ty);
repaint();
}
// function painting a pixel. It works directly on memory
void Geometric::paintPixel(int i, int j, int r, int g, int b) {
unsigned char *ptr = img->bits();
ptr[4 * (img->width() * j + i)] = b;
ptr[4 * (img->width() * j + i) + 1] = g;
ptr[4 * (img->width() * j + i) + 2] = r;
}
void Geometric::paintEvent(QPaintEvent*) {
QPainter p(this);
p.drawImage(0, 0, *img);
if (displayLogoDefault == true) p.drawImage(0, 0, *img2);
}

efficient way of accessing opencv Mat elements

I'm trying to play around with some OpenCV and thought up an interesting little scenario to work on.
Basically, I want to take a pixel, add the colour values from the 3 neighbouring pixels (so (x, y), (x+1, y) (x, y+1) and (x+1, y+1)) and divide the result by 4 to get an average colour value. Then the next set of pixels I process is (x+2, y+2) with it's 3 neighbours.
I then also want to be able to do a similar thing, but with 9 pixels (with the chosen co-ordinate to work from being the centre).
Initially I started with a gaussian blur type masking, but that's not the result I want to acheive. As from those calculations, I just want to get 1 pixel value. So the output image will be 1/4 or a 1/9 of the size. So for now I've got it working where I've literally written out the calculation in a for loop as:
for (int i = 1; i < myImage.rows -1; i++)
{
b = 0;
for (int k = 1; k < myImage.cols -1; k++)
{
//9 pixel radius
Result.at<Vec3b>(a, b)[1] = (myImage.at<Vec3b>(i-1, k-1)[1]+myImage.at<Vec3b>(i-1, k)[1]+myImage.at<Vec3b>(i+1, k)[1] + myImage.at<Vec3b>(i, k)[1]+myImage.at<Vec3b>(i, k-1)[1]+myImage.at<Vec3b>(i, k+1)[1] + myImage.at<Vec3b>(i + 1, k+1)[1] + myImage.at<Vec3b>(i-1, k + 1)[1] + myImage.at<Vec3b>(i + 1, k - 1)[1]) / 9;
Result.at<Vec3b>(a, b)[2] = (myImage.at<Vec3b>(i-1, k-1)[2]+myImage.at<Vec3b>(i-1, k)[2]+myImage.at<Vec3b>(i+1, k)[2] + myImage.at<Vec3b>(i, k)[2]+myImage.at<Vec3b>(i, k-1)[2]+myImage.at<Vec3b>(i, k+1)[2] + myImage.at<Vec3b>(i + 1, k+1)[2] + myImage.at<Vec3b>(i-1, k + 1)[2] + myImage.at<Vec3b>(i + 1, k - 1)[2]) / 9;
Result.at<Vec3b>(a, b)[0] = (myImage.at<Vec3b>(i-1, k-1)[0]+myImage.at<Vec3b>(i-1, k)[0]+myImage.at<Vec3b>(i+1, k)[0] + myImage.at<Vec3b>(i, k)[0]+myImage.at<Vec3b>(i, k-1)[0]+myImage.at<Vec3b>(i, k+1)[0] + myImage.at<Vec3b>(i + 1, k+1)[0] + myImage.at<Vec3b>(i-1, k + 1)[0] + myImage.at<Vec3b>(i + 1, k - 1)[0]) / 9;
//4 pixel radius
// Result.at<Vec3b>(a, b)[1] = (myImage.at<Vec3b>(i, k)[1] + myImage.at<Vec3b>(i + 1, k)[1] + myImage.at<Vec3b>(i, k + 1)[1] + myImage.at<Vec3b>(i, k - 1)[1] + myImage.at<Vec3b>(i - 1, k)[1]) / 5;
// Result.at<Vec3b>(a, b)[2] = (myImage.at<Vec3b>(i, k)[2] + myImage.at<Vec3b>(i + 1, k)[2] + myImage.at<Vec3b>(i, k + 1)[2] + myImage.at<Vec3b>(i, k - 1)[2] + myImage.at<Vec3b>(i - 1, k)[2]) / 5;
// Result.at<Vec3b>(a, b)[0] = (myImage.at<Vec3b>(i, k)[0] + myImage.at<Vec3b>(i + 1, k)[0] + myImage.at<Vec3b>(i, k + 1)[0] + myImage.at<Vec3b>(i, k - 1)[0] + myImage.at<Vec3b>(i - 1, k)[0]) / 5;
b++;
}
a++;
}
Obviously, it's possible to setup the two options as different function that is called, but I'm just wondering if there's a more efficient way of achieveing this, that would let the size of the mask be changed.
Thanks for any help!
I'm assuming that you want to do this all without built-in functions (like resize, mean, or filter2d) and just want to directly address the image using at. There are further optimizations that can be made, but this is intended as a reasonable and understandable improvement on the original code.
Also, it should be noted that I ignore any extra rows/columns when the image size is not exactly divisible by the scale factor. You'll need to specify the expected behavior if you want something different.
The first thing I'd do is change what you think of as the target pixel. Assume you have a 3x3 neighborhood like so:
1 2 3
4 5 6
7 8 9
We're going to take the mean value of all of these pixels anyway, so whether we call pixel 5 the target or pixel 1 makes no difference to the resulting image. I'm going to call pixel 1 the target because it makes the math cleaner.
The 1 pixel will always be on coordinates divisible by the scaling factor. If the scaling factor is 2, the coordinates of 1 will always be even.
Second, rather than loop over the original image dimensions, which actually results in recalculating the same pixel in Result numerous times, I'm going to loop over the dimensions of Result and figure out which pixels in the original image contribute to each pixel in the result.
So to find neighborhood in the original image that corresponds to pixel (x, y) in the result image, we just have to look for pixel 1 of that neighborhood. Since it's a multiple of the scaling factor, it's just
(x * scaleFactor, y * scaleFactor)
Finally, we need to add two more nested loops to loop over the scaleFactor x scaleFactor window. This is the part the avoids having to type out those long calculations.
In the 3x3 example above, for example, pixel 9 in the neighborhood of (x, y) will be:
(x * scaleFactor + 2, y * scaleFactor + 2)
I also do the mean calculation directly in a vector rather than doing each channel individually. This means that our results will overflow a uchar, so I use Vec3i and cast it back to a Vec3b after the division. This is one place where you should consider using a built-in function mean to calculate the average over the window as it will remove the need for these new loops.
So, if our original image is myImage, we have:
int scaleFactor = 3;
Mat Result(myImage.rows/scaleFactor, myImage.rows/scaleFactor,
myImage.type(), Scalar::all(0));
for (int i = 0; i < Result.rows; i++)
{
for (int k = 0; k < Result.cols; k++)
{
// make sum an int vector so it can hold
// value = scaleFactor x scaleFactor x 255
Vec3i areaSum = Vec3i(0,0,0);
for (int m = 0; m < scaleFactor; m++)
{
for (int n = 0; n < scaleFactor; n++)
{
areaSum += myImage.at<Vec3b>(i*scaleFactor+m, k*scaleFactor+n);
}
}
Result.at<Vec3b>(i,k) = Vec3b(areaSum/(scaleFactor*scaleFactor));
}
}
Here are a couple of samples...
Original:
scaleFactor = 2:
scaleFactor = 3:
scaleFactor = 5:

SSE optimization of Gaussian blur

I'm working on a school project , I have to optimize part of code in SSE, but I'm stuck on one part for few days now.
I dont see any smart way of using vector SSE instructions(inline assembler / instric f) in this code(its a part of guassian blur algorithm). I would be glad if somebody could give me just a small hint
for (int x = x_start; x < x_end; ++x) // vertical blur...
{
float sum = image[x + (y_start - radius - 1)*image_w];
float dif = -sum;
for (int y = y_start - 2*radius - 1; y < y_end; ++y)
{ // inner vertical Radius loop
float p = (float)image[x + (y + radius)*image_w]; // next pixel
buffer[y + radius] = p; // buffer pixel
sum += dif + fRadius*p;
dif += p; // accumulate pixel blur
if (y >= y_start)
{
float s = 0, w = 0; // border blur correction
sum -= buffer[y - radius - 1]*fRadius; // addition for fraction blur
dif += buffer[y - radius] - 2*buffer[y]; // sum up differences: +1, -2, +1
// cut off accumulated blur area of pixel beyond the border
// assume: added pixel values beyond border = value at border
p = (float)(radius - y); // top part to cut off
if (p > 0)
{
p = p*(p-1)/2 + fRadius*p;
s += buffer[0]*p;
w += p;
}
p = (float)(y + radius - image_h + 1); // bottom part to cut off
if (p > 0)
{
p = p*(p-1)/2 + fRadius*p;
s += buffer[image_h - 1]*p;
w += p;
}
new_image[x + y*image_w] = (unsigned char)((sum - s)/(weight - w)); // set blurred pixel
}
else if (y + radius >= y_start)
{
dif -= 2*buffer[y];
}
} // for y
} // for x
One more feature you can use is logical operations and masks:
for example instead of:
// process only 1
if (p > 0)
p = p*(p-1)/2 + fRadius*p;
you can write
// processes 4 floats
const __m128 &mask = _mm_cmplt_ps(p,0);
const __m128 &notMask = _mm_cmplt_ps(0,p);
const __m128 &p_tmp = ( p*(p-1)/2 + fRadius*p );
p = _mm_add_ps(_mm_and_ps(p_tmp, mask), _mm_and_ps(p, notMask)); // = p_tmp & mask + p & !mask
Also I can recommend you to use a special libraries, which overloads instructions. For example: http://code.compeng.uni-frankfurt.de/projects/vc
dif variable makes iterations of inner loop dependent. You should try to parallelize the outer loop. But with out instructions overloading the code will become unmanageable then.
Also consider rethinking the whole algorithm. Current one doesn't look paralell. May be you can neglect precision, or increase scalar time a bit?