Is there a chance to make the bilinear interpolation faster? - c++

First I want to provide you with some context.
I have two kind of images I need to merge. The first image is the background image with the format 8BppGrey and a resolution of 320x240. The second image is the forground image with the format 32BppRGBA and a resolution of 64x48.
Update
The github repo with an MVP is at the bottom of the question.
To do it I resize the second image with bilinear interpolation to the same size as the first one and then use blending to merge both to one image. Blending only happens when the alpha value of the second image is greater then 0.
I need to do it as fast as possible so my idea was to combine the resize and merge / blend process.
To achieve this I used the resize function from the writeablebitmapex repository and added merging / blending.
Everything works as expected but I want to decrease the execution time.
This are the current debug timings:
// CPU: Intel(R) Core(TM) i7-4810MQ CPU # 2.80GHz
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 5 ms.
MediaServer: Execution time in c++ 4 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 3 ms
MediaServer: Resizing took 3 ms.
MediaServer: Execution time in c++ 4 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 6 ms
MediaServer: Resizing took 6 ms.
MediaServer: Execution time in c++ 3 ms
MediaServer: Resizing took 3 ms.
Do I have any chance to increase the performance and lower the execution time of the resize / merge / blend process?
Are there some parts I maybe can parallelize?
Do I maybe have a chance to use some processor features?
A huge performance hit is the nested loop but I have no idea how I could write it better.
I would like to reach 1 or 2 ms for the whole process. Is this even possible?
Here's the modified visual c++ function I use.
pd is the backbuffer of the writeable bitmap I use to display the
result in wpf. The format I use is the default 32BppRGBA.
pixels is the int[] array of the 64x48 32BppRGBA image
widthSource and heightSource is the size of the pixels image
width and height is the target size of the output image
baseImage is the int[] array of the 320x240 8BppGray image
VC++ code:
unsigned int Resize(int* pd, int* pixels, int widthSource, int heightSource, int width, int height, byte* baseImage)
{
unsigned int start = clock();
float xs = (float)widthSource / width;
float ys = (float)heightSource / height;
float fracx, fracy, ifracx, ifracy, sx, sy, l0, l1, rf, gf, bf;
int c, x0, x1, y0, y1;
byte c1a, c1r, c1g, c1b, c2a, c2r, c2g, c2b, c3a, c3r, c3g, c3b, c4a, c4r, c4g, c4b;
byte a, r, g, b;
// Bilinear
int srcIdx = 0;
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x++)
{
sx = x * xs;
sy = y * ys;
x0 = (int)sx;
y0 = (int)sy;
// Calculate coordinates of the 4 interpolation points
fracx = sx - x0;
fracy = sy - y0;
ifracx = 1.0f - fracx;
ifracy = 1.0f - fracy;
x1 = x0 + 1;
if (x1 >= widthSource)
{
x1 = x0;
}
y1 = y0 + 1;
if (y1 >= heightSource)
{
y1 = y0;
}
// Read source color
c = pixels[y0 * widthSource + x0];
c1a = (byte)(c >> 24);
c1r = (byte)(c >> 16);
c1g = (byte)(c >> 8);
c1b = (byte)(c);
c = pixels[y0 * widthSource + x1];
c2a = (byte)(c >> 24);
c2r = (byte)(c >> 16);
c2g = (byte)(c >> 8);
c2b = (byte)(c);
c = pixels[y1 * widthSource + x0];
c3a = (byte)(c >> 24);
c3r = (byte)(c >> 16);
c3g = (byte)(c >> 8);
c3b = (byte)(c);
c = pixels[y1 * widthSource + x1];
c4a = (byte)(c >> 24);
c4r = (byte)(c >> 16);
c4g = (byte)(c >> 8);
c4b = (byte)(c);
// Calculate colors
// Alpha
l0 = ifracx * c1a + fracx * c2a;
l1 = ifracx * c3a + fracx * c4a;
a = (byte)(ifracy * l0 + fracy * l1);
// Write destination
if (a > 0)
{
// Red
l0 = ifracx * c1r + fracx * c2r;
l1 = ifracx * c3r + fracx * c4r;
rf = ifracy * l0 + fracy * l1;
// Green
l0 = ifracx * c1g + fracx * c2g;
l1 = ifracx * c3g + fracx * c4g;
gf = ifracy * l0 + fracy * l1;
// Blue
l0 = ifracx * c1b + fracx * c2b;
l1 = ifracx * c3b + fracx * c4b;
bf = ifracy * l0 + fracy * l1;
// Cast to byte
float alpha = a / 255.0f;
r = (byte)((rf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
g = (byte)((gf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
b = (byte)((bf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
pd[srcIdx++] = (255 << 24) | (r << 16) | (g << 8) | b;
}
else
{
// Alpha, Red, Green, Blue
pd[srcIdx++] = (255 << 24) | (baseImage[srcIdx] << 16) | (baseImage[srcIdx] << 8) | baseImage[srcIdx];
}
}
}
unsigned int end = clock() - start;
return end;
}
Github repo

One action that may speed up your code is to avoid type conversions from integer to float and vice versa. This can be achieved by having an int value in the suitable range instead of floats on range 0..1
Something like this:
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x++)
{
int sx1 = x * widthSource ;
int x0 = sx1 / width;
int fracx = (sx1 % width) ; // range 0..width - 1
which turns into something like
l0 = (fracx * c2a + (width - fracx) * c1a) / width ;
And so on. A bit tricky but doable

Thank you for all the help but the problem was the managed c++ project. I transfered the function now to my native c++ library and used the managed c++ part only as a wrapper for the c# application.
After the compiler optimization the function is now finished in 1ms.
Edit:
I will mark my own answer for now as the solution because the optimization from #marom leads to a broken image.

The common way to speedup a resize operation with bilinear interpolation is to:
Exploit the fact that x0 and fracx are independent from the row and that y0and fracy are independent from the column. Even though you haven't pulled out the computation of y0 and fracy out of the x-loop, compiler optimization should take care of that. However, for x0 and fracx, one needs to pre-compute the values for all columns and store them in an array. Complexity for computing x0 and fracx becomes O(width) compared to O(width*height) without pre-computation.
Do the whole processing with integers by replacing floating point arithmetics by integer arithmetics, thereby using shift operations instead of integer divisions.
For better readability, I did not implement the pre-computation of x0 and fracx in the following code. Pre-computation is straight-forward anyways.
Note that FACTOR = 2048 is the max you can do with 32-bit signed integers here (2048 * 2048 * 255 is just fine). For higher precision, you should switch to int64_t and then increase FACTOR and SHIFT, respectively.
I placed the border check into the inner loop for better readability. For an optimized implementation one should remove it by iterating in both loops just before this case happens and add special handling for the border pixels.
In case someone is wondering what the + (FACTOR * FACTOR / 2) is for, it is for rounding in conjunction with the subsequent division.
Finally note that (FACTOR * FACTOR / 2) and 2 * SHIFT are evaluated at compile time.
#define FACTOR 2048
#define SHIFT 11
const int xs = (int) ((double) FACTOR * widthSource / width + 0.5);
const int ys = (int) ((double) FACTOR * heightSource / height + 0.5);
for (int y = 0; y < height; y++)
{
const int sy = y * ys;
const int y0 = sy >> SHIFT;
const int fracy = sy - (y0 << SHIFT);
for (int x = 0; x < width; x++)
{
const int sx = x * xs;
const int x0 = sx >> SHIFT;
const int fracx = sx - (x0 << SHIFT);
if (x0 >= widthSource - 1 || y0 >= heightSource - 1)
{
// insert special handling here
continue;
}
const int offset = y0 * widthSource + x0;
target[y * width + x] = (unsigned char)
((source[offset] * (FACTOR - fracx) * (FACTOR - fracy) +
source[offset + 1] * fracx * (FACTOR - fracy) +
source[offset + widthSource] * (FACTOR - fracx) * fracy +
source[offset + widthSource + 1] * fracx * fracy +
(FACTOR * FACTOR / 2)) >> (2 * SHIFT));
}
}
For clarification, to match the variables used by the OP, for instance, in the case of the alpha channel it is:
a = (unsigned char)
((c1a * (FACTOR - fracx) * (FACTOR - fracy) +
c2a * fracx * (FACTOR - fracy) +
c3a * (FACTOR - fracx) * fracy +
c4a * fracx * fracy +
(FACTOR * FACTOR / 2)) >> (2 * SHIFT));

Related

Converting 1-d array to 2d

I am trying to understand this code:
void stencil(const int nx, const int ny, const int width, const int height,
double* image, double* tmp_image)
{
for (int j = 1; j < ny + 1; ++j) {
for (int i = 1; i < nx + 1; ++i) {
tmp_image[j + i * height] = image[j + i * height] * 3.0 / 5.0;
tmp_image[j + i * height] += image[j + (i - 1) * height] * 0.5 / 5.0;
tmp_image[j + i * height] += image[j + (i + 1) * height] * 0.5 / 5.0;
tmp_image[j + i * height] += image[j - 1 + i * height] * 0.5 / 5.0;
tmp_image[j + i * height] += image[j + 1 + i * height] * 0.5 / 5.0;
}
}
}
The 1-d array notation is very confusing. I am trying to convert it to a 2-d notation (which I find easier to read). Could someone point me in the right direction as to how I can accomplish this?
All this code is doing is creating a new image from an original image by taking 60% from the corresponding pixel and 10% from each neighboring pixel.
When you see tmp_image[j + i * height], read it as tmp_image[i][j].
Changing the code to literally use 2D syntax may require knowing at least one of the dimensions at compile time, whereas now it is a runtime argument. So that might be a non-starter, unless you're using C++ and want to write or use a matrix class instead of plain arrays.

C++ and Eigen: How do I handle this 1x1 matrix?

Consider this excerpt:
for(int i = 0; i < 600*100*100; i++) {
( 1 / 2 * (1 - a) / a * x.transpose() * y * (z + (1 - a) *
z.transpose() * y(i) / z.sum() ) * x.transpose() * z );
}
In the code above, x, y, z are objects of the class MatrixXd in Eigen and a is a double. Through these multiplications, eventually the outcome is a scalar. The entire forloop took less than a second.
However, if I change my code:
for(int i = 0; i < 600*100*100; i++) {
F(i) = F(i) + ( 1 / 2 * (1 - a) / a * x.transpose() * y * (z + (1 - a) *
z.transpose() * y(i) / z.sum() ) * x.transpose() * z );
}
The forloop then takes 6 seconds. F is an ArrayXd. I'm trying to update each element of F through a loop and in each iteration I would do a series of simple matrix multiplications (which would result in a scalar).
I'm not sure what's wrong. How can I speed it up? I tried to use .noalias() but that didn't help. This could have to do with the fact that the outcome of the series of matrix multiplication results in a 1x1 MatrixXd and Eigen is having issues adding a MatrixXd to a number.
Update
Per #mars, I tried eval():
for(int i = 0; i < 600*100*100; i++) {
( 1 / 2 * (1 - a) / a * x.transpose() * y * (z + (1 - a) *
z.transpose() * y(i) / z.sum() ) * x.transpose() * z ).eval();
}
And it takes ~6 seconds as well. Does that mean there's no way to optimize?
Also, I used -O3to compile.

Bilinear interpolation in 2D transformation Qt

I'm currently working on 2D transformations (translation, scaling, shearing and rotation) in Qt. I have a problem with bilinear interpolation, which I want to use to cover the 'black pixels' in output image. I'm using matrix calculations to get new coordinates of pixels of input image. Then I use reverse matrix calculation to check which pixel of input image responds to output pixel. Result of that is some float number which I use to interpolation. I check the four neighbour points and calculate the value (color) of output pixel. I have checked my calculations 'by hand' and they seem to be good.
Can anyone find any bug in that code? (I cut out the parts of code which are responsible for interface such as sliders).
Geometric::Geometric(QWidget* parent) : QWidget(parent) {
resize(1000, 800);
displayLogoDefault = true;
a = shx = shy = x0 = y0 = 0;
scx = scy = 1;
tx = ty = 0;
x = 200, y = 200;
paintT = paintSc = paintR = paintShx = paintShy = false;
img = new QImage(600,600,QImage::Format_RGB32);
img2 = new QImage("logo.jpeg");
}
Geometric::~Geometric() {
delete img;
delete img2;
img = NULL;
img2 = NULL;
}
void Geometric::makeChange() {
displayLogoDefault = false;
// iteration through whole input image
for(int i = 0; i < img2->width(); i++) {
for(int j = 0; j < img2->height(); j++) {
// calculate new coordinates basing on given 2D transformations values
//I calculated that formula eariler by multiplying/adding matrixes
x = cos(a)*scx*(i-x0) - sin(a)*scy*(j-y0) + shx*sin(a)*scx*(i-x0) + shx*cos(a)*scy*(j-y0);
y = shy*(x) + sin(a)*scx*(i-x0) + cos(a)*scy*(j-y0);
// tx and ty goes for translation. scx and scy for scaling
// shx and shy for shearing and a is angle for rotation
x += (x0 + tx);
y += (y0 + ty);
if(x >= 0 && y >= 0 && x < img->width() && y < img->height()) {
// reverse matrix calculation formula to find proper pixel from input image
float tmx = x - x0 - tx;
float tmy = y - y0 - ty;
float recX = 1/scx * ( cos(-a)*( (tmx + shx*shy*tmx - shx*tmx) ) + sin(-a)*( shy*tmx - tmy ) ) + x0 ;
float recY = 1/scy * ( sin(-a)*(tmx + shx*shy*tmx - shx*tmx) - cos(-a)*(shy*tmx-tmy) ) + y0;
// here the interpolation starts. I calculate the color basing on four points from input image
// that points are taken from the reverse matrix calculation
float a = recX - floorf(recX);
float b = recY - floorf (recY);
if(recX + 1 > img2->width()) recX -= 1;
if(recY + 1 > img2->height()) recY -= 1;
QColor c1 = QColor(img2->pixel(recX, recY));
QColor c2 = QColor(img2->pixel(recX + 1, recY));
QColor c3 = QColor(img2->pixel(recX , recY + 1));
QColor c4 = QColor(img2->pixel(recX + 1, recY + 1));
float colR = b * ((1.0 - a) * (float)c3.red() + a * (float)c4.red()) + (1.0 - b) * ((1.0 - a) * (float)c1.red() + a * (float)c2.red());
float colG = b * ((1.0 - a) * (float)c3.green() + a * (float)c4.green()) + (1.0 - b) * ((1.0 - a) * (float)c1.green() + a * (float)c2.green());
float colB = b * ((1.0 - a) * (float)c3.blue() + a * (float)c4.blue()) + (1.0 - b) * ((1.0 - a) * (float)c1.blue() + a * (float)c2.blue());
if(colR > 255) colR = 255; if(colG > 255) colG = 255; if(colB > 255) colB = 255;
if(colR < 0 ) colR = 0; if(colG < 0 ) colG = 0; if(colB < 0 ) colB = 0;
paintPixel(x, y, colR, colG, colB);
}
}
}
// x0 and y0 are the starting point of image
x0 = abs(x-tx);
y0 = abs(y-ty);
repaint();
}
// function painting a pixel. It works directly on memory
void Geometric::paintPixel(int i, int j, int r, int g, int b) {
unsigned char *ptr = img->bits();
ptr[4 * (img->width() * j + i)] = b;
ptr[4 * (img->width() * j + i) + 1] = g;
ptr[4 * (img->width() * j + i) + 2] = r;
}
void Geometric::paintEvent(QPaintEvent*) {
QPainter p(this);
p.drawImage(0, 0, *img);
if (displayLogoDefault == true) p.drawImage(0, 0, *img2);
}

c++ YUYV 422 Horizontal and Vertical Flipping

I have a uint8_t YUYV 422 (Interleaved) image array in memory and I want to be able to flip it both vertically and horizontally. I have successfully implemented a vertical flip but I'm having a problem with flipping both horizontally and vertically at the same time.
My code for the vertical flip, below, works perfectly.
int counter = 0;
int array_width = 2; // YUYV
for (int h = (m_Width * m_Height * array_width) - m_Width * array_width; h > 0; h -= m_Width * array_width)
{
for (int w = 0; w < m_Width * array_width; w++)
{
flipped[counter] = buffer[h + w];
counter++;
}
}
However, the following vertical and horizontal flip code appears to work but there is a loss of definition. To better understand what I am referring to, please see my sample images.
int x = 0;
for (int n = m_Width * m_Height * 2 - 1; n >= 0; n -= 4)
{
flipped[x] = buffer[n - 3]; // Y0
flipped[x + 1] = buffer[n - 2]; // U
flipped[x + 2] = buffer[n - 1]; // Y1
flipped[x + 3] = buffer[n]; // V
x += 4;
}
As you can see, I am moving the YUYV components and keeping them in the same order. I don't believe that I am dropping pixels so I don't understand why I am losing definition. To reiterate, I don't see this problem when flipping vertically (Using the first code snippet).
Here is the reference image, please note the stem of the lamp:
This is the flipped image, the stem of the lamp has lost definition:
You also need to swap Y0 and Y1 in your loop.
int x = 0;
for (int n = m_Width * m_Height * 2 - 1; n >= 3; n -= 4)
{
flipped[x] = buffer[n - 1]; // Y1->Y0
flipped[x + 1] = buffer[n - 2]; // U
flipped[x + 2] = buffer[n - 3]; // Y0->Y1
flipped[x + 3] = buffer[n]; // V
x += 4;
}
While I was at it, since you're accessing n - 3 I changed the loop condition to be absolutely sure it was safe.
m_Width * m_Height * 2 is not a multiple of 4 (the number of data blocks in YUYV format. Try changing '2' into '4', an also array_width.

How to speed up bilinear interpolation of image?

I'm trying to rotate image with interpolation, but it's too slow for real time for big images.
the code something like:
for(int y=0;y<dst_h;++y)
{
for(int x=0;x<dst_w;++x)
{
//do inverse transform
fPoint pt(Transform(Point(x, y)));
//in coor of src
int x1= (int)floor(pt.x);
int y1= (int)floor(pt.y);
int x2= x1+1;
int y2= y1+1;
if((x1>=0&&x1<src_w&&y1>=0&&y1<src_h)&&(x2>=0&&x2<src_w&&y2>=0&&y2<src_h))
{
Mask[y][x]= 1; //show pixel
float dx1= pt.x-x1;
float dx2= 1-dx1;
float dy1= pt.y-y1;
float dy2= 1-dy1;
//bilinear
pd[x].blue= (dy2*(ps[y1*src_w+x1].blue*dx2+ps[y1*src_w+x2].blue*dx1)+
dy1*(ps[y2*src_w+x1].blue*dx2+ps[y2*src_w+x2].blue*dx1));
pd[x].green= (dy2*(ps[y1*src_w+x1].green*dx2+ps[y1*src_w+x2].green*dx1)+
dy1*(ps[y2*src_w+x1].green*dx2+ps[y2*src_w+x2].green*dx1));
pd[x].red= (dy2*(ps[y1*src_w+x1].red*dx2+ps[y1*src_w+x2].red*dx1)+
dy1*(ps[y2*src_w+x1].red*dx2+ps[y2*src_w+x2].red*dx1));
//nearest neighbour
//pd[x]= ps[((int)pt.y)*src_w+(int)pt.x];
}
else
Mask[y][x]= 0; //transparent pixel
}
pd+= dst_w;
}
How I can speed up this code, I try to parallelize this code but it seems there is no speed up because of memory access pattern (?).
The key is to do most of your computations as ints. The only thing that is necessary to do as a float is the weighting. See here for a good resource.
From that same resource:
int px = (int)x; // floor of x
int py = (int)y; // floor of y
const int stride = img->width;
const Pixel* p0 = img->data + px + py * stride; // pointer to first pixel
// load the four neighboring pixels
const Pixel& p1 = p0[0 + 0 * stride];
const Pixel& p2 = p0[1 + 0 * stride];
const Pixel& p3 = p0[0 + 1 * stride];
const Pixel& p4 = p0[1 + 1 * stride];
// Calculate the weights for each pixel
float fx = x - px;
float fy = y - py;
float fx1 = 1.0f - fx;
float fy1 = 1.0f - fy;
int w1 = fx1 * fy1 * 256.0f;
int w2 = fx * fy1 * 256.0f;
int w3 = fx1 * fy * 256.0f;
int w4 = fx * fy * 256.0f;
// Calculate the weighted sum of pixels (for each color channel)
int outr = p1.r * w1 + p2.r * w2 + p3.r * w3 + p4.r * w4;
int outg = p1.g * w1 + p2.g * w2 + p3.g * w3 + p4.g * w4;
int outb = p1.b * w1 + p2.b * w2 + p3.b * w3 + p4.b * w4;
int outa = p1.a * w1 + p2.a * w2 + p3.a * w3 + p4.a * w4;
wow you are doing a lot inside most inner loop like:
1.float to int conversions
can do all on floats ...
they are these days pretty fast
the conversion is what is killing you
also you are mixing float and ints together (if i see it right) which is the same ...
2.transform(x,y)
any unnecessary call makes heap trashing and slow things down
instead add 2 variables xx,yy and interpolate them insde your for loops
3.if ....
why to heck are you adding if ?
limit the for ranges before loop and not inside ...
the background can be filled with other fors before or later