Compare roots of quadratic functions - c++

I need function to fast compare root of quadratic function and a given value and function to fast compare two roots of two quadratic functions.
I write first function
bool isRootLessThanValue (bool sqrtDeltaSign, int a, int b, int c, int value) {
bool ret;
if(a < 0){
ret = (2*a*value + b < 0) || (a*value*value + b*value + c > 0);
ret = (2*a*value + b > 0) && (a*value*value + b*value + c > 0);
if(a < 0){
ret = (2*a*value + b < 0) && (a*value*value + b*value + c < 0);
ret = (2*a*value + b > 0) || (a*value*value + b*value + c < 0);
return ret;
When i try to write this for second function it grow to very big and complicated...
bool isRoot1LessThanRoot2 (bool sqrtDeltaSign1, int a1, int b1, int c1, bool sqrtDeltaSign2, int a2, int b2, int c2) {
Have u any suggestions how can i simplify this function?
If you think thats stupid idea for optimizations, please tell me why :)

I give a simpified version of the first part of your code by comparing the greater root of the quadratic function with a given value as follows:
#include <iostream>
#include <cmath> // for main testing
int isRootLessThanValue (int a, int b, int c, int value)
if (a<0){ b *= -1; c *= -1; a *= -1;}
int xt, delta;
xt = 2 * a * value + b;
if (xt < 0) return false; // value is left to reflection point
delta = b*b - 4*a*c;
// compare square distance between value and the root
return ( (xt * xt) > delta )? true: false;
In the test main() program, the roots are first calculate for clarity purpose:
int main()
int a, b, c, v;
a = -2;
b = 4;
c = 3;
double r1, r2, r, dt;
dt = std::sqrt(b*b-4.0*a*c);
r1 = (-b + dt) / (2.0*a);
r2 = (-b - dt) / (2.0*a);
r = (r1>r2)? r1 : r2;
while (1)
std::cout << "Input the try value = ";
std::cin >> v;
if (isRootLessThanValue(a,b,c,v)) std::cout << v <<" > " << r << std::endl;
else std::cout << v <<" < " << r << std::endl;
return 0;
A test run

The following assumes that both quadratics have real, mutually distinct roots, and a1 = a2 = 1. This keeps the notations simpler, though similar logic can be used in the general case.
Suppose f(x) = x^2 + b1 x + c1 has the real roots u1 < u2, and g(x) = x^2 + b2 x + c2 has the real roots v1 < v2. Then there are 6 possible sort orders.
(1)   u1 < u2 < v1 < v2
(2)   u1 < v1 < u2 < v2
(3)   u1 < v1 < v2 < u2
(4)   v1 < u1 < u2 < v2
(5)   v1 < u1 < v2 < u2
(6)   v1 < v2 < u1 < u2
Let v be a root of g so that g(v) = v^2 + b2 v + c2 = 0 then v^2 = -b2 v - c2 and therefore f(v) = (b1 - b2) v + c1 - c2 = b12 v + c12 where b12 = b1 - b2 and c12 = c1 - c2.
It follows that Sf = f(v1) + f(v2) = b12(v1 + v2) + 2 c12 and Pf = f(v1) f(v2) = b12^2 v1 v2 + b12 c12 (v1 + v2) + c12^2. Using Vieta's relations v1 v2 = c2 and v1 + v2 = -b2 so in the end Sf = f(v1) + f(v2) = -b12 b2 + 2 c2 and Pf = f(v1) f(v2) = b12^2 c2 - b12 c12 b2 + c12^2. Similar expressions can be calculated for Sg = g(u1) + g(u2) and Pg = g(u1) g(u2).
(Should be noted that Sf, Pf, Sg, Pg above are arithmetic expressions in the coefficients, not involving sqrt square roots. There is, however, the potential for integer overflow. If that is an actual concern, then the calculations would have to be done in floating point instead of integers.)
If Pf = f(v1) f(v2) < 0 then exactly one root of f is between the roots v1, v2 of g.
If the axis of f is to the left of the g one, meaning -b1 < -b2, then that's the smaller root u1 of f which is between v1, v2 i.e. case (5).
Otherwise if -b1 > -b2 then that's the larger root i.e. case (2).
If Pf = f(v1) f(v2) > 0 then either both or none of the roots of f are between the roots of g. In this case f(v1) and f(v2) must have the same sign, and they will either be both negative if Sf = f(v1) + f(v2) < 0 or both positive if Sf > 0.
If f(v1) < 0 and f(v2) < 0 then both roots v1, v2 of g are between the roots of f i.e. case (3).
By symmetry, if Pg > 0 and Sg < 0 then g(u1) < 0 and g(u2) < 0, so both roots u1, u2 of f are between the roots of g i.e. case (4).
Otherwise the last combination left is f(v1), f(v2) > 0 and g(u1), g(u2) > 0 where the intervals (u1, u2) and (v1, v2) do not overlap. If -b1 < -b2 the axis of f is to the left of the g one i.e. case (1) else it's case (6).
Once the sort order between all roots is determined, comparing any particular pair of roots follows.

We are definitely talking micro-optimization here, but consider making calculations before performing the comparison:
bool isRootLessThanValue (bool sqrtDeltaSign, int a, int b, int c, int value)
const int a_value = a * value;
const int two_a_b_value = 2 * a_value + b;
const int a_squared_b = a_value * value + b * value + c;
const bool two_ab_less_zero = (two_a_b_value < 0);
bool ret = false;
const bool a_squared_b_greater_zero = (a_squared_b > 0);
if (a < 0)
ret = two_ab_less_zero || a_squared_b_greater_zero;
ret = !two_ab_less_zero && a_squared_b_greater_zero;//(edited)
const bool a_squared_b_less_zero = (a_squared_b < 0);
if (a < 0)
ret = two_ab_less_zero && a_squared_b_less_zero;
ret = !two_ab_less_zero || a_squared_b_less_zero;//(edited)
return ret;
Another note, is that the boolean expression is calculated and stored in a variable, thus could be counted as a data processing instruction (depending on the compiler and processor).
Compare the assembly language of this function to yours. Also benchmark. As I said, I'm not expecting much time savings here, but I don't know how many times this function is called in your code.

Im reorganising my code and have found some facilities :)
When calculate a, b and c i can keep structure to get only a > 0 :)
and i know that i want small or big root :)
so function to compare root to value is regresed to the form below
bool isRootMinLessThanValue (int a, int b, int c, int value) {
const int a_value = a * value;
const int u = 2*a_value + b;
const int v = a_value*value + b*value + c;
return u > 0 || v < 0 ;
bool isRootMaxLessThanValue (int a, int b, int c, int value) {
const int a_value = a*value;
const int u = 2*a_value + b;
const int v = a_value*value + b*value + c;
return u > 0 && v > 0;
when im testing benchmark its faster than calculate roots traditionaly (by assumptions I cannot say how much)
Below code for fast (and slow traditionaly) compare root to value without assumptions
bool isRootLessThanValue (bool sqrtDeltaSign, int a, int b, int c, int value) {
const int a_value = a*value;
const int u = 2*a_value + b;
const int v = a_value*value + b*value + c;
const bool s = sqrtDeltaSign;
return ( a < 0 && s && u < 0 ) ||
( a < 0 && s && v > 0 ) ||
( a < 0 && !s && u < 0 && v < 0) ||
(!(a < 0) && !s && u > 0 ) ||
(!(a < 0) && !s && v < 0 ) ||
(!(a < 0) && s && u > 0 && v > 0);
bool isRootLessThanValueTraditional (bool sqrtDeltaSign, int a, int b, int c, int value) {
double delta = b*b - 4.0*a*c;
double calculatedRoot = sqrtDeltaSign ? (-b + sqrt(delta))/(2.0*a) : (-b - sqrt(delta))/(2.0*a);
return calculatedRoot < value;
benchmark results below:
isRootLessThanValue (optimized): 10000000000 compares in 152.922s
isRootLessThanValueTraditional : 10000000000 compares in 196.168s
Any suggestions how can i simplify even more isRootLessThanValue function? :)
I will try to prepare function to compare two roots of different equations
bool isRootLessThanValue (bool sqrtDeltaSign, int a, int b, int c, int value) {
const int a_value = a*value;
const int u = 2*a_value + b;
const int v = a_value*value + b*value + c;
return sqrtDeltaSign ?
(( a < 0 && (u < 0 || v > 0) ) || (u > 0 && v > 0)) :
(( a > 0 && (u > 0 || v < 0) ) || (u < 0 && v < 0));


Looking for nbit adder in c++

I was trying to build 17bit adder, when overflow occurs it should round off should appear just like int32.
eg: In int32 add, If a = 2^31 -1
int res = a+1
res= -2^31-1
Code I tried, this is not working & is there a better way. Do I need to convert decimal to binary & then perform 17bit operation
int addOvf(int32_t result, int32_t a, int32_t b)
int max = (-(0x01<<16))
int min = ((0x01<<16) -1)
int range_17bit = (0x01<<17);
if (a >= 0 && b >= 0 && (a > max - b)) {
printf("...OVERFLOW.........a=%0d b=%0d",a,b);
else if (a < 0 && b < 0 && (a < min - b)) {
printf("...UNDERFLOW.........a=%0d b=%0d",a,b);
result = a+b;
if(result<min) {
while(result<min){ result=result + range_17bit; }
else if(result>min){
while(result>max){ result=result - range_17bit; }
return result;
int main()
int32_t res,x,y;
res =addOvf(res,x,y);
printf("Value of x=%0d y=%0d res=%0d",x,y,res);
return 0;
You have your constants for max/min int17 reversed and off by one. They should be
max_int17 = (1 << 16) - 1 = 65535
min_int17 = -(1 << 16) = -65536.
Then I believe that max_int_n + m == min_int_n + (m-1) and min_int_n - m == max_int_n - (m-1), where n is the bit count and m is some integer in [min_int_n, ... ,max_int_n]. So putting that all together the function to treat two int32's as though they are int17's and add them would be like
int32_t add_as_int17(int32_t a, int32_t b) {
static const int32_t max_int17 = (1 << 16) - 1;
static const int32_t min_int17 = -(1 << 16);
auto sum = a + b;
if (sum < min_int17) {
auto m = min_int17 - sum;
return max_int17 - (m - 1);
} else if (sum > max_int17) {
auto m = sum - max_int17;
return min_int17 + (m - 1);
return sum;
There is probably some more clever way to do that but I believe the above is correct, assuming I understand what you want.

Absolute value in objective function of linear optimization

I'm trying to find the solution for the following expression
Objective function:
minimize(| x - c0 | + | y - c1 |)
0 < x < A
0 < y < B
where c0, c1, A, B are positive constants
Following the conversion given in
I reworded the expression to
(x - c0) <= xbar
-1 *(x - c0) <= xbar
(y - c1) <= ybar
-1 *(y - c1) <= ybar
0 < x < A
0 < y < B
Objective function:
minimize(xbar + ybar)
However, I'm not able to implement this.
I tried the following snippet
#include "ortools/linear_solver/linear_solver.h"
#include "ortools/linear_solver/linear_expr.h"
MPSolver solver("distanceFinder", MPSolver::GLOP_LINEAR_PROGRAMMING);
MPVariable* x = solver.MakeNumVar(0, A, "x");
MPVariable* y = solver.MakeNumVar(0, B, "y");
const LinearExpr e = x;
const LinearExpr f = y;
LinearExpr X;
LinearExpr Y;
LinearRange Z = slope * e + offset == f; // Where 'slope' & 'offset' are real numbers.
const LinearRange r = -1 * (e - c0) <= X;
const LinearRange s = (e - c0]) <= X ;
const LinearRange m = -1 * (f - c1) <= Y;
const LinearRange k = (f - c1) <= Y ;
MPObjective* const objective = solver.MutableObjective();
I'm getting the error,
E0206 16:41:08.889048 80935] No solution exists. MPSolverInterface::result_status_ = MPSOLVER_INFEASIBLE
My use cases always produce feasible solutions (I'm trying to find the least manhattan distance between a point and a line).
I'm very new to using GOOGLE-OR tools. Please suggest any simpler solution I might have overlooked
Any help will be appreciated
Here is a working example. You mixed up variables in your code
const double A = 10.0;
const double B = 8.0;
const double c0 = 6.0;
const double c1 = 3.5;
MPSolver solver("distanceFinder", MPSolver::GLOP_LINEAR_PROGRAMMING);
MPVariable* x = solver.MakeNumVar(0, A, "x");
MPVariable* y = solver.MakeNumVar(0, B, "y");
MPVariable* xbar = solver.MakeNumVar(0, A, "xbar");
MPVariable* ybar = solver.MakeNumVar(0, B, "ybar");
LinearExpr X(x);
LinearExpr Y(y);
const LinearRange r = -1 * (X - c0) <= xbar;
const LinearRange s = (X - c0) <= xbar;
const LinearRange m = -1 * (Y - c1) <= ybar;
const LinearRange k = (Y - c1) <= ybar;
MPObjective *const objective = solver.MutableObjective();
objective->MinimizeLinearExpr(LinearExpr(xbar) + LinearExpr(ybar));
It computes
x = 6
y = 3.5
xbar = 0
ybar = -0

Square Root in C/C++

I am trying to implement my own square root function which gives square root's integral part only e.g. square root of 3 = 1.
I saw the method here and tried to implement the method
int mySqrt(int x)
int n = x;
x = pow(2, ceil(log(n) / log(2)) / 2);
int y=0;
while (y < x)
y = (x + n / x) / 2;
x = y;
return x;
The above method fails for input 8. Also, I don't get why it should work.
Also, I tried the method here
int mySqrt(int x)
if (x == 0) return 0;
int x0 = pow(2, (log(x) / log(2))/2) ;
int y = x0;
int diff = 10;
while (diff>0)
x0 = (x0 + x / x0) / 2; diff = y - x0;
y = x0;
if (diff<0) diff = diff * (-1);
return x0;
In this second way, for input 3 the loop continues ... indefinitely (x0 toggles between 1 and 2).
I am aware that both are essentially versions of Netwon's method but I can't figure out why they fail in certain cases and how could I make them work for all cases. I guess i have the correct logic in implementation. I debugged my code but still I can't find a way to make it work.
This one works for me:
uintmax_t zsqrt(uintmax_t x)
if(x==0) return 0;
uintmax_t yn = x; // The 'next' estimate
uintmax_t y = 0; // The result
uintmax_t yp; // The previous estimate
yp = y;
y = yn;
yn = (y + x/y) >> 1; // Newton step
}while(yn ^ yp); // (yn != yp) shortcut for dumb compilers
return y;
returns floor(sqrt(x))
Instead of testing for 0 with a single estimate, test with 2 estimates.
When I was writing this, I noticed the result estimate would sometimes oscillate. This is because, if the exact result is a fraction, the algorithm could only jump between the two nearest values. So, terminating when the next estimate is the same as the previous will prevent an infinite loop.
Try this
int n,i;//n is the input number
cout<<"The number has exact root : "<<i<<endl;
else if((i*i)>n)
cout<<"The integer part is "<<(i-1)<<endl;
Hope this helps.
You can try there C sqrt implementations :
// return the number that was multiplied by itself to reach N.
unsigned square_root_1(const unsigned num) {
unsigned a, b, c, d;
for (b = a = num, c = 1; a >>= 1; ++c);
for (c = 1 << (c & -2); c; c >>= 2) {
d = a + c;
a >>= 1;
if (b >= d)
b -= d, a += c;
return a;
// return the number that was multiplied by itself to reach N.
unsigned square_root_2(unsigned n){
unsigned a = n > 0, b;
if (n > 3)
for (a = n >> 1, b = (a + n / a) >> 1; b < a; a = b, b = (a + n / a) >> 1);
return a ;
Example of usage :
#include <assert.h>
int main(void){
unsigned num, res ;
num = 1847902954, res = square_root_1(num), assert(res == 42987);
num = 2, res = square_root_2(num), assert(res == 1);
num = 0, res = square_root_2(num), assert(res == 0);

CUDA not returning result

I am trying to make a fraction calculator that calculates on a cuda devise, below is first the sequential version and then my try for a parallel version.
It runs without error, but for some reason do it not give the result back, I have been trying to get this to work for 2 weeks now, but can’t find the error!
Serilized version
int f(int x, int c, int n);
int gcd(unsigned int u, unsigned int v);
int main ()
clock_t start = clock();
srand ( time(NULL) );
int x = 1;
int y = 2;
int d = 1;
int c = rand() % 100;
int n = 323;
if(n % y == 0)
d = y;
while(d == 1)
x = f(x, c, n);
y = f(f(y, c, n), c, n);
int abs = x - y;
if(abs < 0)
abs = abs * -1;
d = gcd(abs, n);
if(d == n)
printf("\nd == n");
c = 0;
while(c == 0 || c == -2)
c = rand() % 100;
x = 2;
y = 2;
int d2 = n/d;
printf("\nTime elapsed: %f", ((double)clock() - start) / CLOCKS_PER_SEC);
printf("\nResult: %d", d);
printf("\nResult2: %d", d2);
int dummyReadForPause;
int f(int x, int c, int n)
return (int)(pow((float)x, 2) + c) % n;
int gcd(unsigned int u, unsigned int v){
int shift;
/ * GCD(0,x) := x * /
if (u == 0 || v == 0)
return u | v;
/ * Let shift := lg K, where K is the greatest power of 2
dividing both u and v. * /
for (shift = 0; ((u | v) & 1) == 0; ++shift) {
u >>= 1;
v >>= 1;
while ((u & 1) == 0)
u >>= 1;
/ * From here on, u is always odd. * /
do {
while ((v & 1) == 0) / * Loop X * /
v >>= 1;
/ * Now u and v are both odd, so diff(u, v) is even.
Let u = min(u, v), v = diff(u, v)/2. * /
if (u < v) {
v -= u;
} else {
int diff = u - v;
u = v;
v = diff;
v >>= 1;
} while (v != 0);
return u << shift;
parallel version
#define threads 512
#define MaxBlocks 65535
#define RunningTheads (512*100)
__device__ int gcd(unsigned int u, unsigned int v)
int shift;
if (u == 0 || v == 0)
return u | v;
for (shift = 0; ((u | v) & 1) == 0; ++shift) {
u >>= 1;
v >>= 1;
while ((u & 1) == 0)
u >>= 1;
do {
while ((v & 1) == 0)
v >>= 1;
if (u < v) {
v -= u;
} else {
int diff = u - v;
u = v;
v = diff;
v >>= 1;
} while (v != 0);
return u << shift;
__device__ bool cuda_found;
__global__ void cudaKernal(int *cArray, int n, int *outr)
int index = blockIdx.x * threads + threadIdx.x;
int x = 1;
int y = 2;
int d = 4;
int c = cArray[index];
while(d == 1 && !cuda_found)
x = (int)(pow((float)x, 2) + c) % n;
y = (int)(pow((float)y, 2) + c) % n;
y = (int)(pow((float)y, 2) + c) % n;
int abs = x - y;
if(abs < 0)
abs = abs * -1;
d = gcd(abs, n);
if(d != 1 && !cuda_found)
cuda_found = true;
outr = &d;
int main ()
int n = 323;
int cArray[RunningTheads];
cArray[0] = 1;
for(int i = 1; i < RunningTheads-1; i++)
cArray[i] = i+2;
int dresult = 0;
int *dev_cArray;
int *dev_result;
HANDLE_ERROR(cudaMalloc((void**)&dev_cArray, RunningTheads*sizeof(int)));
HANDLE_ERROR(cudaMalloc((void**)&dev_result, sizeof(int)));
HANDLE_ERROR(cudaMemcpy(dev_cArray, cArray, RunningTheads*sizeof(int), cudaMemcpyHostToDevice));
int TotalBlocks = ceil((float)RunningTheads/(float)threads);
if(TotalBlocks > MaxBlocks)
TotalBlocks = MaxBlocks;
printf("Blocks: %d\n", TotalBlocks);
printf("Threads: %d\n\n", threads);
cudaKernal<<<TotalBlocks,threads>>>(dev_cArray, n, dev_result);
HANDLE_ERROR(cudaMemcpy(&dresult, dev_result, sizeof(int), cudaMemcpyDeviceToHost));
if(dresult == 0)
dresult = 1;
int d2 = n/dresult;
printf("\nResult: %d", dresult);
printf("\nResult2: %d", d2);
int dummyReadForPause;
Lets have a look at your kernel code:
__global__ void cudaKernal(int *cArray, int n, int *outr)
int index = blockIdx.x * threads + threadIdx.x;
int x = 1;
int y = 2;
int d = 4;
int c = cArray[index];
while(d == 1 && !cuda_found) // always false because d is always 4
x = (int)(pow((float)x, 2) + c) % n;
y = (int)(pow((float)y, 2) + c) % n;
y = (int)(pow((float)y, 2) + c) % n;
int abs = x - y;
if(abs < 0)
abs = abs * -1;
d = gcd(abs, n); // never writes to d because the loop won't
// be executed
if(d != 1 && !cuda_found) // maybe true if cuda_found was initalized
// with false
cuda_found = true; // Memory race here.
outr = &d; // you are changing the adresse where outr
// points to; the host code does not see this
// change. your cudaMemcpy dev -> host will copy
// the exact values back from device that have
// been uploaded by cudaMemcpy host -> dev
// if you want to set outr to 4 than write:
// *outr = d;
One of the problems is you don't return the result. In your code you just change outr which has local scope in your kernel function (i.e. changes are not seen outside this function). You should write *outr = d; to change the value of memory you're pointing with outr.
and I'm not sure if CUDA initializes global variables with zero. I mean are you sure cuda_found is always initialized with false?

Optimizing C++ code for performance

Can you think of some way to optimize this piece of code? It's meant to execute in an ARMv7 processor (Iphone 3GS):
4.0% inline float BoxIntegral(IplImage *img, int row, int col, int rows, int cols)
0.7% float *data = (float *) img->imageData;
1.4% int step = img->widthStep/sizeof(float);
// The subtraction by one for row/col is because row/col is inclusive.
1.1% int r1 = std::min(row, img->height) - 1;
1.0% int c1 = std::min(col, img->width) - 1;
2.7% int r2 = std::min(row + rows, img->height) - 1;
3.7% int c2 = std::min(col + cols, img->width) - 1;
float A(0.0f), B(0.0f), C(0.0f), D(0.0f);
8.5% if (r1 >= 0 && c1 >= 0) A = data[r1 * step + c1];
11.7% if (r1 >= 0 && c2 >= 0) B = data[r1 * step + c2];
7.6% if (r2 >= 0 && c1 >= 0) C = data[r2 * step + c1];
9.2% if (r2 >= 0 && c2 >= 0) D = data[r2 * step + c2];
21.9% return std::max(0.f, A - B - C + D);
3.8% }
All this code is taken from the OpenSURF library. Here's the context of the function (some people were asking for the context):
//! Calculate DoH responses for supplied layer
void FastHessian::buildResponseLayer(ResponseLayer *rl)
float *responses = rl->responses; // response storage
unsigned char *laplacian = rl->laplacian; // laplacian sign storage
int step = rl->step; // step size for this filter
int b = (rl->filter - 1) * 0.5 + 1; // border for this filter
int l = rl->filter / 3; // lobe for this filter (filter size / 3)
int w = rl->filter; // filter size
float inverse_area = 1.f/(w*w); // normalisation factor
float Dxx, Dyy, Dxy;
for(int r, c, ar = 0, index = 0; ar < rl->height; ++ar)
for(int ac = 0; ac < rl->width; ++ac, index++)
// get the image coordinates
r = ar * step;
c = ac * step;
// Compute response components
Dxx = BoxIntegral(img, r - l + 1, c - b, 2*l - 1, w)
- BoxIntegral(img, r - l + 1, c - l * 0.5, 2*l - 1, l)*3;
Dyy = BoxIntegral(img, r - b, c - l + 1, w, 2*l - 1)
- BoxIntegral(img, r - l * 0.5, c - l + 1, l, 2*l - 1)*3;
Dxy = + BoxIntegral(img, r - l, c + 1, l, l)
+ BoxIntegral(img, r + 1, c - l, l, l)
- BoxIntegral(img, r - l, c - l, l, l)
- BoxIntegral(img, r + 1, c + 1, l, l);
// Normalise the filter responses with respect to their size
Dxx *= inverse_area;
Dyy *= inverse_area;
Dxy *= inverse_area;
// Get the determinant of hessian response & laplacian sign
responses[index] = (Dxx * Dyy - 0.81f * Dxy * Dxy);
laplacian[index] = (Dxx + Dyy >= 0 ? 1 : 0);
#ifdef RL_DEBUG
// create list of the image coords for each response
Some questions:
Is it a good idea that the function is inline?
Would using inline assembly provide a significant speedup?
Specialize for the edges so that you don't need to check for them in every row and column. I assume that this call is in a nested loop and is called a lot. This function would become:
inline float BoxIntegralNonEdge(IplImage *img, int row, int col, int rows, int cols)
float *data = (float *) img->imageData;
int step = img->widthStep/sizeof(float);
// The subtraction by one for row/col is because row/col is inclusive.
int r1 = row - 1;
int c1 = col - 1;
int r2 = row + rows - 1;
int c2 = col + cols - 1;
float A(data[r1 * step + c1]), B(data[r1 * step + c2]), C(data[r2 * step + c1]), D(data[r2 * step + c2]);
return std::max(0.f, A - B - C + D);
You get rid of a conditional and branch for each min and two conditionals and a branch for each if. You can only call this function if you already meet the conditions -- check that in the caller for the whole row once instead of each pixel.
I wrote up some tips for optimizing image processing when you have to do work on each pixel:
Other things from the blog:
You are recalculating a position in the image data with 2 multiplies (indexing is multiplication) -- you should be incrementing a pointer.
Instead of passing in img, row, row, col and cols, pass in pointers to the exact pixels to process -- which you get from incrementing pointers, not indexing.
If you don't do the above, step is the same for all pixels, calculate it in the caller and pass it in. If you do 1 and 2, you won't need step at all.
There are a few places to reuse temporary variables, but whether it would improve performance would have to be measured as dirkgently stated:
if (r1 >= 0 && c1 >= 0) A = data[r1 * step + c1];
if (r1 >= 0 && c2 >= 0) B = data[r1 * step + c2];
if (r2 >= 0 && c1 >= 0) C = data[r2 * step + c1];
if (r2 >= 0 && c2 >= 0) D = data[r2 * step + c2];
if (r1 >= 0) {
int r1Step = r1 * step;
if (c1 >= 0) A = data[r1Step + c1];
if (c2 >= 0) B = data[r1Step + c2];
if (r2 >= 0) {
int r2Step = r2 * step;
if (c1 >= 0) C = data[r2Step + c1];
if (c2 >= 0) D = data[r2Step + c2];
You may actually end up doing the temp multiplactions too often in case your if statements rarely provides true.
You aren't interested in four variables A, B, C, D, but only the combination A - B - C + D.
float result(0.0f);
if (r1 >= 0 && c1 >= 0) result += data[r1 * step + c1];
if (r1 >= 0 && c2 >= 0) result -= data[r1 * step + c2];
if (r2 >= 0 && c1 >= 0) result -= data[r2 * step + c1];
if (r2 >= 0 && c2 >= 0) result += data[r2 * step + c2];
if (result > 0f) return result;
return 0f;
The compiler probably handles inling automatically where it's proper.
Without any knowledge about the context. Is the if(r1 >= 0 && c1 >= 0) check necessary?
Isn't it required that the row and col parameters are > 0?
float BoxIntegral(IplImage *img, int row, int col, int rows, int cols)
assert(row > 0 && col > 0);
float *data = (float*)img->imageData; // Don't use C-style casts
int step = img->widthStep/sizeof(float);
// Is the min check rly necessary?
int r1 = std::min(row, img->height) - 1;
int c1 = std::min(col, img->width) - 1;
int r2 = std::min(row + rows, img->height) - 1;
int c2 = std::min(col + cols, img->width) - 1;
int r1_step = r1 * step;
int r2_step = r2 * step;
float A = data[r1_step + c1];
float B = data[r1_step + c2];
float C = data[r2_step + c1];
float D = data[r2_step + c2];
return std::max(0.0f, A - B - C + D);
Some of the examples say to initialize A, B, C and D directly and skip the initialization with 0, but this is functionally different than your original code in some ways. I would do this however:
inline float BoxIntegral(IplImage *img, int row, int col, int rows, int cols) {
const float *data = (float *) img->imageData;
const int step = img->widthStep/sizeof(float);
// The subtraction by one for row/col is because row/col is inclusive.
const int r1 = std::min(row, img->height) - 1;
const int r2 = std::min(row + rows, img->height) - 1;
const int c1 = std::min(col, img->width) - 1;
const int c2 = std::min(col + cols, img->width) - 1;
const float A = (r1 >= 0 && c1 >= 0) ? data[r1 * step + c1] : 0.0f;
const float B = (r1 >= 0 && c2 >= 0) ? data[r1 * step + c2] : 0.0f;
const float C = (r2 >= 0 && c1 >= 0) ? data[r2 * step + c1] : 0.0f;
const float D = (r2 >= 0 && c2 >= 0) ? data[r2 * step + c2] : 0.0f;
return std::max(0.f, A - B - C + D);
like your original code, this will make A, B, C and D have a value either from data[] if the condition is true or 0.0f if the condition is false. Also, I would (as I have shown) use const wherever it is appropriate. Many compilers aren't able to improve code much based on const-ness, but it certainly can't hurt to give the compiler more information about the data it is operating on. Finally I have reordered the r1/r2/c1/c2 variables to encourage reuse of the fetched width and height.
Obviously you would need to profile to determine if any of this is actually an improvement.
I am not sure if your problem lends itself to SIMD but this could potentially allow you to perform multiple operations on your image at once and give you a good performance improvement. I am assuming you are inlining and optimizing because you are performing the operation multiple times. Take a look at:
Compiler do have some support for Neon if the correct flags are enabled but you will probably need to roll out some on your own.
To get compiler support for neon you will need to use the compiler flag -mfpu=neon