Scaling issues with OpenMP - c++

I have written a code for a special type of 3D CFD Simulation, the Lattice-Boltzmann Method (quite similar to a code supplied with the Book "The Lattice Boltzmann Method" by Timm Krüger et alii).
Multithreading the program with OpenMP I have experienced issues that I can't quite understand: The results prove to be strongly dependent on the overall domain size.
The basic principle is that each cell of a 3D domain gets assigned certain values for 19 distribution functions (0-18) in discrete directions. They are laid down in two linear arrays allocated in the heap (one population is layed out in a separte array): The 18 populations of a certain cell are contiguous in memory, the values of consecutive x-values lay next to each other and so on (so sort of row-major: populations->x->y->z).
Those distribution functions redistribute according to certain values within the cell and then get streamed to the neighbouring cells. For this reason I have two populations f1 and f2. The algorithm takes the values from f1, redistributes them and copies them into f2. Then the pointers are swapped and the algorithm starts again.
The code is working perfectly fine on a single core but when I try to parallelise it on multiple cores I get a performance that depends on the overall size of the domain: For very small domains (10^3 cells) the algorithm is comparably slow with 15 million cells per second, for quite small domains (30^3 cells) the algorithm is about quite fast with over 60 million cells per second and for anything larger than that the performance drops again to about 30 million cells per second. Executing the code on a single core only leads to the same performance of about 15 million cells per second. These results of course vary between different processors but qualitatively the same issue remains!
The core of the code boils down to this parallelised loop that is executed over and over again and the pointers to f1 and f2 are swapped:
#pragma omp parallel for default(none) shared(f0,f1,f2) schedule(static)
for(unsigned int z = 0; z < NZ; ++z)
{
for(unsigned int y = 0; y < NY; ++y)
{
for(unsigned int x = 0; x < NX; ++x)
{
/// temporary populations
double ft0 = f0[D3Q19_ScalarIndex(x,y,z)];
double ft1 = f1[D3Q19_FieldIndex(x,y,z,1)];
double ft2 = f1[D3Q19_FieldIndex(x,y,z,2)];
double ft3 = f1[D3Q19_FieldIndex(x,y,z,3)];
double ft4 = f1[D3Q19_FieldIndex(x,y,z,4)];
double ft5 = f1[D3Q19_FieldIndex(x,y,z,5)];
double ft6 = f1[D3Q19_FieldIndex(x,y,z,6)];
double ft7 = f1[D3Q19_FieldIndex(x,y,z,7)];
double ft8 = f1[D3Q19_FieldIndex(x,y,z,8)];
double ft9 = f1[D3Q19_FieldIndex(x,y,z,9)];
double ft10 = f1[D3Q19_FieldIndex(x,y,z,10)];
double ft11 = f1[D3Q19_FieldIndex(x,y,z,11)];
double ft12 = f1[D3Q19_FieldIndex(x,y,z,12)];
double ft13 = f1[D3Q19_FieldIndex(x,y,z,13)];
double ft14 = f1[D3Q19_FieldIndex(x,y,z,14)];
double ft15 = f1[D3Q19_FieldIndex(x,y,z,15)];
double ft16 = f1[D3Q19_FieldIndex(x,y,z,16)];
double ft17 = f1[D3Q19_FieldIndex(x,y,z,17)];
double ft18 = f1[D3Q19_FieldIndex(x,y,z,18)];
/// microscopic to macroscopic
double r = ft0 + ft1 + ft2 + ft3 + ft4 + ft5 + ft6 + ft7 + ft8 + ft9 + ft10 + ft11 + ft12 + ft13 + ft14 + ft15 + ft16 + ft17 + ft18;
double rinv = 1.0/r;
double u = rinv*(ft1 - ft2 + ft7 + ft8 + ft9 + ft10 - ft11 - ft12 - ft13 - ft14);
double v = rinv*(ft3 - ft4 + ft7 - ft8 + ft11 - ft12 + ft15 + ft16 - ft17 - ft18);
double w = rinv*(ft5 - ft6 + ft9 - ft10 + ft13 - ft14 + ft15 - ft16 + ft17 - ft18);
/// collision & streaming
double trw0 = omega*r*w0; //temporary variables
double trwc = omega*r*wc;
double trwd = omega*r*wd;
double uu = 1.0 - 1.5*(u*u+v*v+w*w);
double bu = 3.0*u;
double bv = 3.0*v;
double bw = 3.0*w;
unsigned int xp = (x + 1) % NX; //calculate x,y,z coordinates of neighbouring cells
unsigned int yp = (y + 1) % NY;
unsigned int zp = (z + 1) % NZ;
unsigned int xm = (NX + x - 1) % NX;
unsigned int ym = (NY + y - 1) % NY;
unsigned int zm = (NZ + z - 1) % NZ;
f0[D3Q19_ScalarIndex(x,y,z)] = bomega*ft0 + trw0*(uu); //redistribute distribution functions and stream to neighbouring cells
double cu = bu;
f2[D3Q19_FieldIndex(xp,y, z, 1)] = bomega*ft1 + trwc*(uu + cu*(1.0 + 0.5*cu));
cu = -bu;
f2[D3Q19_FieldIndex(xm,y, z, 2)] = bomega*ft2 + trwc*(uu + cu*(1.0 + 0.5*cu));
cu = bv;
f2[D3Q19_FieldIndex(x, yp,z, 3)] = bomega*ft3 + trwc*(uu + cu*(1.0 + 0.5*cu));
cu = -bv;
f2[D3Q19_FieldIndex(x, ym,z, 4)] = bomega*ft4 + trwc*(uu + cu*(1.0 + 0.5*cu));
cu = bw;
f2[D3Q19_FieldIndex(x, y, zp, 5)] = bomega*ft5 + trwc*(uu + cu*(1.0 + 0.5*cu));
cu = -bw;
f2[D3Q19_FieldIndex(x, y, zm, 6)] = bomega*ft6 + trwc*(uu + cu*(1.0 + 0.5*cu));
cu = bu+bv;
f2[D3Q19_FieldIndex(xp,yp,z, 7)] = bomega*ft7 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = bu-bv;
f2[D3Q19_FieldIndex(xp,ym,z, 8)] = bomega*ft8 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = bu+bw;
f2[D3Q19_FieldIndex(xp,y, zp, 9)] = bomega*ft9 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = bu-bw;
f2[D3Q19_FieldIndex(xp,y, zm,10)] = bomega*ft10 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = -bu+bv;
f2[D3Q19_FieldIndex(xm,yp,z, 11)] = bomega*ft11 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = -bu-bv;
f2[D3Q19_FieldIndex(xm,ym,z, 12)] = bomega*ft12 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = -bu+bw;
f2[D3Q19_FieldIndex(xm,y, zp,13)] = bomega*ft13 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = -bu-bw;
f2[D3Q19_FieldIndex(xm,y, zm,14)] = bomega*ft14 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = bv+bw;
f2[D3Q19_FieldIndex(x, yp,zp,15)] = bomega*ft15 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = bv-bw;
f2[D3Q19_FieldIndex(x, yp,zm,16)] = bomega*ft16 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = -bv+bw;
f2[D3Q19_FieldIndex(x, ym,zp,17)] = bomega*ft17 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = -bv-bw;
f2[D3Q19_FieldIndex(x, ym,zm,18)] = bomega*ft18 + trwd*(uu + cu*(1.0 + 0.5*cu));
}
}
}
It would be awesome if someone could give me tips on how to find the reason for this particular behaviour or even has an idea what could cause this problem.
If needed I can supply a full version of the simplified code!
Thanks a lot in advance!

Achieving scaling on shared memory systems (threaded code on a single machine) is quite tricky, and often requires large amounts of tuning. What's likely happening in your code is that part of the domain for each thread fits into cache for the "quite small" problem size, but as the problem size increases in NX and NY, the data per thread stops fitting into cache.
To avoid issues like this, it is better to decompose the domain into fixed size blocks that do not change in size with the domain, but rather in number.
const unsigned int numBlocksZ = std::ceil(static_cast<double>(NZ) / BLOCK_SIZE);
const unsigned int numBlocksY = std::ceil(static_cast<double>(NY) / BLOCK_SIZE);
const unsigned int numBlocksX = std::ceil(static_cast<double>(NX) / BLOCK_SIZE);
#pragma omp parallel for default(none) shared(f0,f1,f2) schedule(static,1)
for(unsigned int block = 0; block < numBlocks; ++block)
{
unsigned int startZ = BLOCK_SIZE* (block / (numBlocksX*numBlocksY));
unsigned int endZ = std::min(startZ + BLOCK_SIZE, NZ);
for(unsigned int z = startZ; z < endZ; ++z) {
unsigned int startY = BLOCK_SIZE*(((block % (numBlocksX*numBlocksY)) / numBlocksX);
unsigned int endY = std::min(startY + BLOCK_SIZE, NY);
for(unsigned int y = startY; y < endY; ++y)
{
unsigned int startX = BLOCK_SIZE(block % numBlocksX);
unsigned int endX = std::min(startX + BLOCK_SIZE, NX);
for(unsigned int x = startX; x < endX; ++x)
{
...
}
}
}
An approach like the above should also increase cache locality by using 3d blocking (assuming this is a 3d stencil operation), and further improve your performance. You'll need to tune BLOCK_SIZE to find what gives you the best performance on a given system (I'd start small and increase in powers of two, e.g., 4, 8, 16...).

Related

Converting 1-d array to 2d

I am trying to understand this code:
void stencil(const int nx, const int ny, const int width, const int height,
double* image, double* tmp_image)
{
for (int j = 1; j < ny + 1; ++j) {
for (int i = 1; i < nx + 1; ++i) {
tmp_image[j + i * height] = image[j + i * height] * 3.0 / 5.0;
tmp_image[j + i * height] += image[j + (i - 1) * height] * 0.5 / 5.0;
tmp_image[j + i * height] += image[j + (i + 1) * height] * 0.5 / 5.0;
tmp_image[j + i * height] += image[j - 1 + i * height] * 0.5 / 5.0;
tmp_image[j + i * height] += image[j + 1 + i * height] * 0.5 / 5.0;
}
}
}
The 1-d array notation is very confusing. I am trying to convert it to a 2-d notation (which I find easier to read). Could someone point me in the right direction as to how I can accomplish this?
All this code is doing is creating a new image from an original image by taking 60% from the corresponding pixel and 10% from each neighboring pixel.
When you see tmp_image[j + i * height], read it as tmp_image[i][j].
Changing the code to literally use 2D syntax may require knowing at least one of the dimensions at compile time, whereas now it is a runtime argument. So that might be a non-starter, unless you're using C++ and want to write or use a matrix class instead of plain arrays.

How do I resolve a collision's position properly in 2D collision detection?

My current implementation looks like this:
if (shapesCollide) {
if (velocity.y > 0) entity.position.y = other.position.y - entity.size.y;
else entity.position.y = other.position.y + other.size.y;
velocity.y = 0;
if (velocity.x > 0) entity.position.x = other.position.x - entity.size.x;
else entity.position.x = other.position.x + other.size.x;
velocity.x = 0;
}
However, this leads to weird handling when movement is happening on both axes - for example, having entity moving downward to the left of object, and then moving it to collide with object, will correctly resolve the horizontal collision, but will break the vertical movement.
I previously simply went
if (shapesCollide) {
position = oldPosition;
velocity = { 0, 0 };
}
But this lead to another multi-axis issue: if I have my entity resting atop the object, it will be unable to move, as the gravity-induced movement will constantly cancel out both velocities. I also tried considering both axes separately, but this lead to issues whenever the collision only occurs when both velocities are taken into account.
What is the best solution to resolving collision on two axes?
I assume that the entities can be considered to be more or less round and that size is the radius of the entities?
We probably need a little vector math to resolve this. (I don't know the square-root function in c++, so be aware at sqrt.) Try replacing your code inside if(shapesCollide) with this and see how it works for you.
float rEntity = sqrt(entity.size.x * entity.size.x + entity.size.y * entity.size.y);
float rOther = sqrt(other.size.x * other.size.x + other.size.y * other.size.y);
float midX = (entity.position.x + other.position.x) / 2.0;
float midY = (entity.position.y + other.position.y) / 2.0;
float dx = entity.position.x - midX;
float dy = entity.position.y - midY;
float D = sqrt(dx * dx + dy * dy);
rEntity and rOther are the radii of the objects, and midX and midY are their center coordinates. dx and dy are the distances to the center from the entity.
Then do:
entity.position.x = midX + dx * rEntity / D;
entity.position.y = midY + dy * rEntity / D;
other.position.x = midX - dx * rOther / D;
other.position.y = midY - dy * rOther / D;
You should probably check that D is not 0, and if it is, just set dx = 1, dy = 0, D = 1 or something like that.
You should also still do:
velocity.x = 0;
velocity.y = 0;
if you want the entities to stop.
For more accurate modelling, you could also try the following:
float rEntity = sqrt(entity.size.x * entity.size.x + entity.size.y * entity.size.y);
float rOther = sqrt(other.size.x * other.size.x + other.size.y * other.size.y);
float midX = (entity.position.x * rOther + other.position.x * rEntity) / (rEntity + rOther);
float midY = (entity.position.y * rOther + other.position.y * rEntity) / (rEntity + rOther);
float dxEntity = entity.position.x - midX;
float dyEntity = entity.position.y - midY;
float dEntity = sqrt(dxEntity * dxEntity + dyEntity * dyEntity);
float dxOther = other.position.x - midX;
float dyOther = other.position.y - midY;
float dOther = sqrt(dxOther * dxOther + dyOther * dyOther);
entity.position.x = midX + dxEntity * rEntity / dEntity;
entity.position.y = midY + dyEntity * rEntity / dEntity;
other.position.x = midX + dxOther * rOther / dOther;
other.position.y = midY + dyOther * rOther / dOther;
which finds the midpoints when the radii are taken into account. But I won't guarantee that that works. Also, the signs on the last additions are important.
I hope this helps (and works). Let me know if something is unclear.

Is there a chance to make the bilinear interpolation faster?

First I want to provide you with some context.
I have two kind of images I need to merge. The first image is the background image with the format 8BppGrey and a resolution of 320x240. The second image is the forground image with the format 32BppRGBA and a resolution of 64x48.
Update
The github repo with an MVP is at the bottom of the question.
To do it I resize the second image with bilinear interpolation to the same size as the first one and then use blending to merge both to one image. Blending only happens when the alpha value of the second image is greater then 0.
I need to do it as fast as possible so my idea was to combine the resize and merge / blend process.
To achieve this I used the resize function from the writeablebitmapex repository and added merging / blending.
Everything works as expected but I want to decrease the execution time.
This are the current debug timings:
// CPU: Intel(R) Core(TM) i7-4810MQ CPU # 2.80GHz
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 5 ms.
MediaServer: Execution time in c++ 4 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 3 ms
MediaServer: Resizing took 3 ms.
MediaServer: Execution time in c++ 4 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 6 ms
MediaServer: Resizing took 6 ms.
MediaServer: Execution time in c++ 3 ms
MediaServer: Resizing took 3 ms.
Do I have any chance to increase the performance and lower the execution time of the resize / merge / blend process?
Are there some parts I maybe can parallelize?
Do I maybe have a chance to use some processor features?
A huge performance hit is the nested loop but I have no idea how I could write it better.
I would like to reach 1 or 2 ms for the whole process. Is this even possible?
Here's the modified visual c++ function I use.
pd is the backbuffer of the writeable bitmap I use to display the
result in wpf. The format I use is the default 32BppRGBA.
pixels is the int[] array of the 64x48 32BppRGBA image
widthSource and heightSource is the size of the pixels image
width and height is the target size of the output image
baseImage is the int[] array of the 320x240 8BppGray image
VC++ code:
unsigned int Resize(int* pd, int* pixels, int widthSource, int heightSource, int width, int height, byte* baseImage)
{
unsigned int start = clock();
float xs = (float)widthSource / width;
float ys = (float)heightSource / height;
float fracx, fracy, ifracx, ifracy, sx, sy, l0, l1, rf, gf, bf;
int c, x0, x1, y0, y1;
byte c1a, c1r, c1g, c1b, c2a, c2r, c2g, c2b, c3a, c3r, c3g, c3b, c4a, c4r, c4g, c4b;
byte a, r, g, b;
// Bilinear
int srcIdx = 0;
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x++)
{
sx = x * xs;
sy = y * ys;
x0 = (int)sx;
y0 = (int)sy;
// Calculate coordinates of the 4 interpolation points
fracx = sx - x0;
fracy = sy - y0;
ifracx = 1.0f - fracx;
ifracy = 1.0f - fracy;
x1 = x0 + 1;
if (x1 >= widthSource)
{
x1 = x0;
}
y1 = y0 + 1;
if (y1 >= heightSource)
{
y1 = y0;
}
// Read source color
c = pixels[y0 * widthSource + x0];
c1a = (byte)(c >> 24);
c1r = (byte)(c >> 16);
c1g = (byte)(c >> 8);
c1b = (byte)(c);
c = pixels[y0 * widthSource + x1];
c2a = (byte)(c >> 24);
c2r = (byte)(c >> 16);
c2g = (byte)(c >> 8);
c2b = (byte)(c);
c = pixels[y1 * widthSource + x0];
c3a = (byte)(c >> 24);
c3r = (byte)(c >> 16);
c3g = (byte)(c >> 8);
c3b = (byte)(c);
c = pixels[y1 * widthSource + x1];
c4a = (byte)(c >> 24);
c4r = (byte)(c >> 16);
c4g = (byte)(c >> 8);
c4b = (byte)(c);
// Calculate colors
// Alpha
l0 = ifracx * c1a + fracx * c2a;
l1 = ifracx * c3a + fracx * c4a;
a = (byte)(ifracy * l0 + fracy * l1);
// Write destination
if (a > 0)
{
// Red
l0 = ifracx * c1r + fracx * c2r;
l1 = ifracx * c3r + fracx * c4r;
rf = ifracy * l0 + fracy * l1;
// Green
l0 = ifracx * c1g + fracx * c2g;
l1 = ifracx * c3g + fracx * c4g;
gf = ifracy * l0 + fracy * l1;
// Blue
l0 = ifracx * c1b + fracx * c2b;
l1 = ifracx * c3b + fracx * c4b;
bf = ifracy * l0 + fracy * l1;
// Cast to byte
float alpha = a / 255.0f;
r = (byte)((rf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
g = (byte)((gf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
b = (byte)((bf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
pd[srcIdx++] = (255 << 24) | (r << 16) | (g << 8) | b;
}
else
{
// Alpha, Red, Green, Blue
pd[srcIdx++] = (255 << 24) | (baseImage[srcIdx] << 16) | (baseImage[srcIdx] << 8) | baseImage[srcIdx];
}
}
}
unsigned int end = clock() - start;
return end;
}
Github repo
One action that may speed up your code is to avoid type conversions from integer to float and vice versa. This can be achieved by having an int value in the suitable range instead of floats on range 0..1
Something like this:
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x++)
{
int sx1 = x * widthSource ;
int x0 = sx1 / width;
int fracx = (sx1 % width) ; // range 0..width - 1
which turns into something like
l0 = (fracx * c2a + (width - fracx) * c1a) / width ;
And so on. A bit tricky but doable
Thank you for all the help but the problem was the managed c++ project. I transfered the function now to my native c++ library and used the managed c++ part only as a wrapper for the c# application.
After the compiler optimization the function is now finished in 1ms.
Edit:
I will mark my own answer for now as the solution because the optimization from #marom leads to a broken image.
The common way to speedup a resize operation with bilinear interpolation is to:
Exploit the fact that x0 and fracx are independent from the row and that y0and fracy are independent from the column. Even though you haven't pulled out the computation of y0 and fracy out of the x-loop, compiler optimization should take care of that. However, for x0 and fracx, one needs to pre-compute the values for all columns and store them in an array. Complexity for computing x0 and fracx becomes O(width) compared to O(width*height) without pre-computation.
Do the whole processing with integers by replacing floating point arithmetics by integer arithmetics, thereby using shift operations instead of integer divisions.
For better readability, I did not implement the pre-computation of x0 and fracx in the following code. Pre-computation is straight-forward anyways.
Note that FACTOR = 2048 is the max you can do with 32-bit signed integers here (2048 * 2048 * 255 is just fine). For higher precision, you should switch to int64_t and then increase FACTOR and SHIFT, respectively.
I placed the border check into the inner loop for better readability. For an optimized implementation one should remove it by iterating in both loops just before this case happens and add special handling for the border pixels.
In case someone is wondering what the + (FACTOR * FACTOR / 2) is for, it is for rounding in conjunction with the subsequent division.
Finally note that (FACTOR * FACTOR / 2) and 2 * SHIFT are evaluated at compile time.
#define FACTOR 2048
#define SHIFT 11
const int xs = (int) ((double) FACTOR * widthSource / width + 0.5);
const int ys = (int) ((double) FACTOR * heightSource / height + 0.5);
for (int y = 0; y < height; y++)
{
const int sy = y * ys;
const int y0 = sy >> SHIFT;
const int fracy = sy - (y0 << SHIFT);
for (int x = 0; x < width; x++)
{
const int sx = x * xs;
const int x0 = sx >> SHIFT;
const int fracx = sx - (x0 << SHIFT);
if (x0 >= widthSource - 1 || y0 >= heightSource - 1)
{
// insert special handling here
continue;
}
const int offset = y0 * widthSource + x0;
target[y * width + x] = (unsigned char)
((source[offset] * (FACTOR - fracx) * (FACTOR - fracy) +
source[offset + 1] * fracx * (FACTOR - fracy) +
source[offset + widthSource] * (FACTOR - fracx) * fracy +
source[offset + widthSource + 1] * fracx * fracy +
(FACTOR * FACTOR / 2)) >> (2 * SHIFT));
}
}
For clarification, to match the variables used by the OP, for instance, in the case of the alpha channel it is:
a = (unsigned char)
((c1a * (FACTOR - fracx) * (FACTOR - fracy) +
c2a * fracx * (FACTOR - fracy) +
c3a * (FACTOR - fracx) * fracy +
c4a * fracx * fracy +
(FACTOR * FACTOR / 2)) >> (2 * SHIFT));

Bilinear interpolation in 2D transformation Qt

I'm currently working on 2D transformations (translation, scaling, shearing and rotation) in Qt. I have a problem with bilinear interpolation, which I want to use to cover the 'black pixels' in output image. I'm using matrix calculations to get new coordinates of pixels of input image. Then I use reverse matrix calculation to check which pixel of input image responds to output pixel. Result of that is some float number which I use to interpolation. I check the four neighbour points and calculate the value (color) of output pixel. I have checked my calculations 'by hand' and they seem to be good.
Can anyone find any bug in that code? (I cut out the parts of code which are responsible for interface such as sliders).
Geometric::Geometric(QWidget* parent) : QWidget(parent) {
resize(1000, 800);
displayLogoDefault = true;
a = shx = shy = x0 = y0 = 0;
scx = scy = 1;
tx = ty = 0;
x = 200, y = 200;
paintT = paintSc = paintR = paintShx = paintShy = false;
img = new QImage(600,600,QImage::Format_RGB32);
img2 = new QImage("logo.jpeg");
}
Geometric::~Geometric() {
delete img;
delete img2;
img = NULL;
img2 = NULL;
}
void Geometric::makeChange() {
displayLogoDefault = false;
// iteration through whole input image
for(int i = 0; i < img2->width(); i++) {
for(int j = 0; j < img2->height(); j++) {
// calculate new coordinates basing on given 2D transformations values
//I calculated that formula eariler by multiplying/adding matrixes
x = cos(a)*scx*(i-x0) - sin(a)*scy*(j-y0) + shx*sin(a)*scx*(i-x0) + shx*cos(a)*scy*(j-y0);
y = shy*(x) + sin(a)*scx*(i-x0) + cos(a)*scy*(j-y0);
// tx and ty goes for translation. scx and scy for scaling
// shx and shy for shearing and a is angle for rotation
x += (x0 + tx);
y += (y0 + ty);
if(x >= 0 && y >= 0 && x < img->width() && y < img->height()) {
// reverse matrix calculation formula to find proper pixel from input image
float tmx = x - x0 - tx;
float tmy = y - y0 - ty;
float recX = 1/scx * ( cos(-a)*( (tmx + shx*shy*tmx - shx*tmx) ) + sin(-a)*( shy*tmx - tmy ) ) + x0 ;
float recY = 1/scy * ( sin(-a)*(tmx + shx*shy*tmx - shx*tmx) - cos(-a)*(shy*tmx-tmy) ) + y0;
// here the interpolation starts. I calculate the color basing on four points from input image
// that points are taken from the reverse matrix calculation
float a = recX - floorf(recX);
float b = recY - floorf (recY);
if(recX + 1 > img2->width()) recX -= 1;
if(recY + 1 > img2->height()) recY -= 1;
QColor c1 = QColor(img2->pixel(recX, recY));
QColor c2 = QColor(img2->pixel(recX + 1, recY));
QColor c3 = QColor(img2->pixel(recX , recY + 1));
QColor c4 = QColor(img2->pixel(recX + 1, recY + 1));
float colR = b * ((1.0 - a) * (float)c3.red() + a * (float)c4.red()) + (1.0 - b) * ((1.0 - a) * (float)c1.red() + a * (float)c2.red());
float colG = b * ((1.0 - a) * (float)c3.green() + a * (float)c4.green()) + (1.0 - b) * ((1.0 - a) * (float)c1.green() + a * (float)c2.green());
float colB = b * ((1.0 - a) * (float)c3.blue() + a * (float)c4.blue()) + (1.0 - b) * ((1.0 - a) * (float)c1.blue() + a * (float)c2.blue());
if(colR > 255) colR = 255; if(colG > 255) colG = 255; if(colB > 255) colB = 255;
if(colR < 0 ) colR = 0; if(colG < 0 ) colG = 0; if(colB < 0 ) colB = 0;
paintPixel(x, y, colR, colG, colB);
}
}
}
// x0 and y0 are the starting point of image
x0 = abs(x-tx);
y0 = abs(y-ty);
repaint();
}
// function painting a pixel. It works directly on memory
void Geometric::paintPixel(int i, int j, int r, int g, int b) {
unsigned char *ptr = img->bits();
ptr[4 * (img->width() * j + i)] = b;
ptr[4 * (img->width() * j + i) + 1] = g;
ptr[4 * (img->width() * j + i) + 2] = r;
}
void Geometric::paintEvent(QPaintEvent*) {
QPainter p(this);
p.drawImage(0, 0, *img);
if (displayLogoDefault == true) p.drawImage(0, 0, *img2);
}

How to speed up bilinear interpolation of image?

I'm trying to rotate image with interpolation, but it's too slow for real time for big images.
the code something like:
for(int y=0;y<dst_h;++y)
{
for(int x=0;x<dst_w;++x)
{
//do inverse transform
fPoint pt(Transform(Point(x, y)));
//in coor of src
int x1= (int)floor(pt.x);
int y1= (int)floor(pt.y);
int x2= x1+1;
int y2= y1+1;
if((x1>=0&&x1<src_w&&y1>=0&&y1<src_h)&&(x2>=0&&x2<src_w&&y2>=0&&y2<src_h))
{
Mask[y][x]= 1; //show pixel
float dx1= pt.x-x1;
float dx2= 1-dx1;
float dy1= pt.y-y1;
float dy2= 1-dy1;
//bilinear
pd[x].blue= (dy2*(ps[y1*src_w+x1].blue*dx2+ps[y1*src_w+x2].blue*dx1)+
dy1*(ps[y2*src_w+x1].blue*dx2+ps[y2*src_w+x2].blue*dx1));
pd[x].green= (dy2*(ps[y1*src_w+x1].green*dx2+ps[y1*src_w+x2].green*dx1)+
dy1*(ps[y2*src_w+x1].green*dx2+ps[y2*src_w+x2].green*dx1));
pd[x].red= (dy2*(ps[y1*src_w+x1].red*dx2+ps[y1*src_w+x2].red*dx1)+
dy1*(ps[y2*src_w+x1].red*dx2+ps[y2*src_w+x2].red*dx1));
//nearest neighbour
//pd[x]= ps[((int)pt.y)*src_w+(int)pt.x];
}
else
Mask[y][x]= 0; //transparent pixel
}
pd+= dst_w;
}
How I can speed up this code, I try to parallelize this code but it seems there is no speed up because of memory access pattern (?).
The key is to do most of your computations as ints. The only thing that is necessary to do as a float is the weighting. See here for a good resource.
From that same resource:
int px = (int)x; // floor of x
int py = (int)y; // floor of y
const int stride = img->width;
const Pixel* p0 = img->data + px + py * stride; // pointer to first pixel
// load the four neighboring pixels
const Pixel& p1 = p0[0 + 0 * stride];
const Pixel& p2 = p0[1 + 0 * stride];
const Pixel& p3 = p0[0 + 1 * stride];
const Pixel& p4 = p0[1 + 1 * stride];
// Calculate the weights for each pixel
float fx = x - px;
float fy = y - py;
float fx1 = 1.0f - fx;
float fy1 = 1.0f - fy;
int w1 = fx1 * fy1 * 256.0f;
int w2 = fx * fy1 * 256.0f;
int w3 = fx1 * fy * 256.0f;
int w4 = fx * fy * 256.0f;
// Calculate the weighted sum of pixels (for each color channel)
int outr = p1.r * w1 + p2.r * w2 + p3.r * w3 + p4.r * w4;
int outg = p1.g * w1 + p2.g * w2 + p3.g * w3 + p4.g * w4;
int outb = p1.b * w1 + p2.b * w2 + p3.b * w3 + p4.b * w4;
int outa = p1.a * w1 + p2.a * w2 + p3.a * w3 + p4.a * w4;
wow you are doing a lot inside most inner loop like:
1.float to int conversions
can do all on floats ...
they are these days pretty fast
the conversion is what is killing you
also you are mixing float and ints together (if i see it right) which is the same ...
2.transform(x,y)
any unnecessary call makes heap trashing and slow things down
instead add 2 variables xx,yy and interpolate them insde your for loops
3.if ....
why to heck are you adding if ?
limit the for ranges before loop and not inside ...
the background can be filled with other fors before or later